Skip to content

Make DataViewRowId not act like a number. #2707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 25, 2019

Conversation

eerhardt
Copy link
Member

  • Remove it from the NumberDataViewType.
  • Remove any method/operator that makes it feel like a number.

Working towards #2297

get
{
return _instance ??
Interlocked.CompareExchange(ref _instance, new RowIdDataViewType(), null) ??
Copy link
Contributor

@rogancarr rogancarr Feb 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interlocked.CompareExchange [](start = 20, length = 27)

Nice! #Resolved

@@ -118,6 +120,8 @@ public static PrimitiveDataViewType PrimitiveTypeFromKind(DataKind kind)
return DateTimeDataViewType.Instance;
if (kind == DataKind.DZ)
return DateTimeOffsetDataViewType.Instance;
if (kind == DataKind.UG)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataKind.UG [](start = 24, length = 11)

Since DataKind.UG is only for these RowId instances, perhaps we should rename DataKind.UG to DataKind.RowId. When I first ran across UG, I thought it was for holding large unsigned integers.... Could be a more user friendly nomenclature, I guess.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wschin is currently working on #2661, which makes DataKind.UG internal. So it wouldn't be user facing.

Copy link

@yaeldekel yaeldekel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@rogancarr
Copy link
Contributor

        private DataViewRowId _valueCount;

Perhaps this should just be a long?

It looks like we're just accumulating integers here, no?


Refers to: src/Microsoft.ML.Transforms/MissingValueReplacingUtils.cs:185 in f25882e. [](commit_id = f25882e, deletion_comment = False)

@rogancarr
Copy link
Contributor

rogancarr commented Feb 25, 2019

A comment and a quick question before I sign off:

Comment: It looks like this was designed so that we could take slices of an IDataView. We lose that functionality if we remove this. But in practice, we take slices by creating dummy columns anyways, so it's probably just not necessary.

Question: Can we just remove all the Add() / Subtract() / ToDouble() / etc. methods? The whole business of counting the number of rows processed here could be done with a ulong, couldn't it? Or am I missing something?

@TomFinley
Copy link
Contributor

TomFinley commented Feb 25, 2019

Comment: It looks like this was designed so that we could take slices of an IDataView.

That definitely was not my intent, and I definitely did not design it with that in mind, and I think any code that uses it that way will have some problems with biased samples and whatnot. Please read here for more details on what it is used for.


In reply to: 467108963 [](ancestors = 467108963)

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @eerhardt !

Copy link
Contributor

@rogancarr rogancarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Talked to @TomFinley offline — let's not worry about the use of RowId as a helper in the sum I pointed out earlier for now.

@TomFinley
Copy link
Contributor

        private DataViewRowId _valueCount;

We're accumulating integers, but the critical point is how large those integers might conceivably get. Let's imagine we have a dataset with 1 billion features (sparse, of course), and 20 billion instances. The accumulation of lengths will exceed the capacity of a ulong. With sparse vectors, these values can easily grow to be fairly large.

Have I seen datasets where the feature array is 1 billion large? Yes, a few times. Have I seen datasets with many billions of examples? Also yes. So I don't view it as ridiculous that a ulong might be judged to be insufficient to calculate this quantity. Now, then, could this code be changed to do something else? Again, maybe yes, but not without changing the implementation significantly, and it's unclear to me what the benefit is. So I'd prefer to just let @eerhardt's changes as written here stand.


In reply to: 467106243 [](ancestors = 467106243)


Refers to: src/Microsoft.ML.Transforms/MissingValueReplacingUtils.cs:185 in f25882e. [](commit_id = f25882e, deletion_comment = False)

- Remove it from the NumberDataViewType.
- Remove any method/operator that makes it feel like a number.

Working towards dotnet#2297
@eerhardt eerhardt merged commit f6d55f3 into dotnet:master Feb 25, 2019
@eerhardt eerhardt deleted the DataViewRowId branch February 25, 2019 21:48
@codecov
Copy link

codecov bot commented Feb 25, 2019

Codecov Report

Merging #2707 into master will decrease coverage by 0.01%.
The diff coverage is 26.74%.

@@            Coverage Diff             @@
##           master    #2707      +/-   ##
==========================================
- Coverage   71.68%   71.67%   -0.02%     
==========================================
  Files         808      808              
  Lines      142214   142247      +33     
  Branches    16131    16138       +7     
==========================================
+ Hits       101945   101951       +6     
- Misses      35830    35857      +27     
  Partials     4439     4439
Flag Coverage Δ
#Debug 71.67% <26.74%> (-0.02%) ⬇️
#production 67.92% <25%> (-0.02%) ⬇️
#test 85.85% <100%> (ø) ⬆️
Impacted Files Coverage Δ
src/Microsoft.ML.Parquet/ParquetLoader.cs 67.54% <0%> (ø) ⬆️
src/Microsoft.ML.Transforms/ProduceIdTransform.cs 0% <0%> (ø) ⬆️
...rosoft.ML.Transforms/MissingValueReplacingUtils.cs 34.81% <10.6%> (-2.04%) ⬇️
src/Microsoft.ML.StaticPipe/StaticSchemaShape.cs 76.5% <100%> (ø) ⬆️
src/Microsoft.ML.Data/Transforms/Hashing.cs 91.89% <100%> (+0.01%) ⬆️
...t/Microsoft.ML.Core.Tests/UnitTests/ColumnTypes.cs 75.3% <100%> (ø) ⬆️
...soft.ML.Data/Transforms/RowShufflingTransformer.cs 68.53% <100%> (ø) ⬆️
src/Microsoft.ML.Data/Data/DataViewUtils.cs 72.2% <100%> (ø) ⬆️
...rc/Microsoft.ML.Data/DataLoadSave/Binary/Codecs.cs 92.7% <100%> (+0.01%) ⬆️
test/Microsoft.ML.Tests/Transformers/HashTests.cs 100% <100%> (ø) ⬆️
... and 3 more

@ghost ghost locked as resolved and limited conversation to collaborators Mar 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants