-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Make DataViewRowId not act like a number. #2707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
get | ||
{ | ||
return _instance ?? | ||
Interlocked.CompareExchange(ref _instance, new RowIdDataViewType(), null) ?? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interlocked.CompareExchange [](start = 20, length = 27)
Nice! #Resolved
@@ -118,6 +120,8 @@ public static PrimitiveDataViewType PrimitiveTypeFromKind(DataKind kind) | |||
return DateTimeDataViewType.Instance; | |||
if (kind == DataKind.DZ) | |||
return DateTimeOffsetDataViewType.Instance; | |||
if (kind == DataKind.UG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataKind.UG [](start = 24, length = 11)
Since DataKind.UG
is only for these RowId
instances, perhaps we should rename DataKind.UG
to DataKind.RowId
. When I first ran across UG
, I thought it was for holding large unsigned integers.... Could be a more user friendly nomenclature, I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment and a quick question before I sign off: Comment: It looks like this was designed so that we could take slices of an Question: Can we just remove all the |
That definitely was not my intent, and I definitely did not design it with that in mind, and I think any code that uses it that way will have some problems with biased samples and whatnot. Please read here for more details on what it is used for. In reply to: 467108963 [](ancestors = 467108963) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @eerhardt !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked to @TomFinley offline — let's not worry about the use of RowId as a helper in the sum I pointed out earlier for now.
We're accumulating integers, but the critical point is how large those integers might conceivably get. Let's imagine we have a dataset with 1 billion features (sparse, of course), and 20 billion instances. The accumulation of lengths will exceed the capacity of a Have I seen datasets where the feature array is 1 billion large? Yes, a few times. Have I seen datasets with many billions of examples? Also yes. So I don't view it as ridiculous that a In reply to: 467106243 [](ancestors = 467106243) Refers to: src/Microsoft.ML.Transforms/MissingValueReplacingUtils.cs:185 in f25882e. [](commit_id = f25882e, deletion_comment = False) |
- Remove it from the NumberDataViewType. - Remove any method/operator that makes it feel like a number. Working towards dotnet#2297
bab9e76
to
3c3efc0
Compare
Codecov Report
@@ Coverage Diff @@
## master #2707 +/- ##
==========================================
- Coverage 71.68% 71.67% -0.02%
==========================================
Files 808 808
Lines 142214 142247 +33
Branches 16131 16138 +7
==========================================
+ Hits 101945 101951 +6
- Misses 35830 35857 +27
Partials 4439 4439
|
Working towards #2297