-
Notifications
You must be signed in to change notification settings - Fork 1.9k
The hashed values stored in the annotations indices are skewed by -1 #3072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Working as intended I believe, though perhaps some of the infrastructure surrounding this could be improved. Remember: we have key-values presented as indices into an enumerated set (so 0 must be the first valid value), but we have simultaneously declared that the default value for keys (which must be 0), should represent the missing value, rather than any particular element. So 0 is and must be the "missing" key. The alternative is that it become some arbitrary member of whatever set is being enumerated, which is an utterly ridiculous proposition. We are also constrained by the fact that we are operating in an environment with 0 indexing. These two constraints force this compromise, which is, despite misunderstandings like this, the correct compromise. Perhaps we could repurpose this example and documentation so as to properly explain to people what keys are. |
So... remove bug, maybe rephrase as documentation? The sample must at least be improved. |
@TomFinley adding a bit more to this, as i think this appearance of skewed keys and mapping 0 to the missing value might be harder to solve through documentation. As i was working on the samples for the KeyToValue, KeyToVector seeing the values loaded being skewed is confusing.. Would it be better to change the mapping of the missing value to be something like MaxInt? cc @Ivanidzo4ka |
Closing after discussing alternatives like mapping the missing value to maxInt, or switching the rawType of KeyTypes to int?. For more on KeyTypes see: https://github.com/dotnet/machinelearning/blob/master/docs/code/IDataViewTypeSystem.md#key-types |
Look at the hash extension sample and compare the hashed values with the values stored in the annotations of the "CategoryHashed" column.
Notice how the indices in the annotations are skewed by -1 from the values in the dataview.
// Category CategoryHashed Age AgeHashed
// MLB 36206 18 127
// NFL 19015 14 62
// NFL 19015 15 43
// MLB 36206 18 127
// MLS 6013 14 62
versus the annotations values:
// Output Data
//
// The original value of the 6012 category is MLS
// The original value of the 19014 category is NFL
// The original value of the 36205 category is MLB
The text was updated successfully, but these errors were encountered: