Skip to content

CategoricalHashTransform should accept Floats and Doubles #679

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ganik opened this issue Aug 14, 2018 · 6 comments
Closed

CategoricalHashTransform should accept Floats and Doubles #679

ganik opened this issue Aug 14, 2018 · 6 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@ganik
Copy link
Member

ganik commented Aug 14, 2018

Currently CategoricalHashTransform accepts only Text or Key types.

If Double is passed in for ex, below error message is shown:
Error: *** System.ArgumentOutOfRangeException: 'Source column 'workclass1' has invalid type ('R8'): Expected Text or Key item type.

It would be good if it can accept numbers: Ints and Floats.

@ganik ganik added the enhancement New feature or request label Aug 14, 2018
@ganik ganik self-assigned this Aug 14, 2018
@TomFinley
Copy link
Contributor

Hi @ganik thanks for looking at this.

One detail that I am not clear on here is what we shall do with 0 values. Previously when operating over keys or text, the default values for either (missing and empty respectively, for those two types) was to map to the missing key value. In this way the mapping was sparsity preserving.

Yet, in the case of float and double, a 0.0 value is a perfectly reasonable value, and should probably not map to the missing key value. But the implication there is that this transform becomes, for input numeric types, no longer sparsity preserving: e.g., a sparse vector of length 100 and count 0 would map to a dense vector of length 100, filled with the hash values for 0.0.

Other details include, do we want to be sure negative and positive zeros have the same hash? It seems like they ought to.

It would be good if it can accept numbers: Ints and Floats.

I agree with this, but just from a strategic perspective I might elect to not include handling to integers until the work of #673 is done. Work on floats and doubles though can be done immediately.

@ganik
Copy link
Member Author

ganik commented Aug 16, 2018

Sure, I can postpone integers work until #673 done, will do only floats and doubles.
I agree that negative and positive zeros should have same hash.

If we don't map 0.0 to 0 then sparsity is not maintained. How about we map 0.0 to 0, missing values to 1 and the rest will have hash code starting from 2?

@TomFinley
Copy link
Contributor

Hi @ganik.

If we don't map 0.0 to 0 then sparsity is not maintained. How about we map 0.0 to 0, missing values to 1 and the rest will have hash code starting from 2?

Interesting, but that still has the problem that values of 0.0 will be dropped from further processing (using, say, KeyToValue or something) since we are mapping them to the NA key value, which seems undesirable. That we would now make missing values have an actual key type seems undesirable. The proposal doesn't do much to represent that and the fact that we're now taking NA values and making them explicit "real" values through the hash seems undesirable. Also even if we didn't have the property that key values have their default value be the missing value (which is in any event I think the only sensible choice considering what keys are), fixing the hash for a single value seems like it compromises the usefulness of the hash, e.g., featurization schemes that rely on multiple hashings of the same data but with different seeds.

If we consider sparsity preserving a useful property (which I think we do) one possibility I'm not sure that I like is, I could imagine is the behavior could be optional, that NA and 0 inputs to NA keys for vector inputs. (Sort of like we have the zero=true option for the MinMax and other normalizers to turn off their by default sparsity preserving behavior. We have occasionally heard from people that really honestly did want the empty string to map to a real hash value, this could make those people "happy" as well, maybe.

Incidentally, it may be helpful to read about key types, both here for a formal definition and here for a more intuitive discussion, to understand what they are, and how they're used.

@shauheen shauheen added this to the 0818 milestone Aug 24, 2018
@shauheen
Copy link
Contributor

@ganik could you please look into this issue. Thanks,

@shauheen shauheen modified the milestones: 0818, 0918 Aug 31, 2018
@ganik
Copy link
Member Author

ganik commented Sep 1, 2018

@shauheen this issue should have been closed by PR [684] (#684) and be part of v0.5. Not sure why it wasn't automatically closed.

@shauheen
Copy link
Contributor

shauheen commented Sep 4, 2018

Thanks

@shauheen shauheen closed this as completed Sep 4, 2018
@ghost ghost locked as resolved and limited conversation to collaborators Mar 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants