Skip to content

The hashed values stored in the annotations indices are skewed by -1 #3072

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sfilipi opened this issue Mar 22, 2019 · 4 comments
Closed

The hashed values stored in the annotations indices are skewed by -1 #3072

sfilipi opened this issue Mar 22, 2019 · 4 comments
Labels
enhancement New feature or request

Comments

@sfilipi
Copy link
Member

sfilipi commented Mar 22, 2019

Look at the hash extension sample and compare the hashed values with the values stored in the annotations of the "CategoryHashed" column.

Notice how the indices in the annotations are skewed by -1 from the values in the dataview.

// Category CategoryHashed Age AgeHashed
// MLB 36206 18 127
// NFL 19015 14 62
// NFL 19015 15 43
// MLB 36206 18 127
// MLS 6013 14 62

versus the annotations values:

// Output Data
//
// The original value of the 6012 category is MLS
// The original value of the 19014 category is NFL
// The original value of the 36205 category is MLB

@sfilipi sfilipi added the bug Something isn't working label Mar 22, 2019
@TomFinley
Copy link
Contributor

TomFinley commented Mar 22, 2019

Working as intended I believe, though perhaps some of the infrastructure surrounding this could be improved. Remember: we have key-values presented as indices into an enumerated set (so 0 must be the first valid value), but we have simultaneously declared that the default value for keys (which must be 0), should represent the missing value, rather than any particular element.

So 0 is and must be the "missing" key. The alternative is that it become some arbitrary member of whatever set is being enumerated, which is an utterly ridiculous proposition. We are also constrained by the fact that we are operating in an environment with 0 indexing. These two constraints force this compromise, which is, despite misunderstandings like this, the correct compromise.

Perhaps we could repurpose this example and documentation so as to properly explain to people what keys are.

@TomFinley
Copy link
Contributor

TomFinley commented Mar 22, 2019

So... remove bug, maybe rephrase as documentation? The sample must at least be improved.

@sfilipi
Copy link
Member Author

sfilipi commented Apr 4, 2019

@TomFinley adding a bit more to this, as i think this appearance of skewed keys and mapping 0 to the missing value might be harder to solve through documentation. As i was working on the samples for the KeyToValue, KeyToVector seeing the values loaded being skewed is confusing.. Would it be better to change the mapping of the missing value to be something like MaxInt? cc @Ivanidzo4ka

@sfilipi sfilipi added enhancement New feature or request and removed bug Something isn't working labels Apr 4, 2019
@sfilipi
Copy link
Member Author

sfilipi commented Apr 6, 2019

Closing after discussing alternatives like mapping the missing value to maxInt, or switching the rawType of KeyTypes to int?.
Mapping to something other than the default value is not possible, since uninitialized variables will initialize to that.
Trying to use Nullable uints showed a perf hit, during testing.

For more on KeyTypes see: https://github.com/dotnet/machinelearning/blob/master/docs/code/IDataViewTypeSystem.md#key-types
https://github.com/dotnet/machinelearning/blob/master/docs/code/KeyValues.md

@sfilipi sfilipi closed this as completed Apr 6, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants