CategoricalHashTransform should accept Floats and Doubles #679

ganik · 2018-08-14T17:35:21Z

Currently CategoricalHashTransform accepts only Text or Key types.

If Double is passed in for ex, below error message is shown:
Error: *** System.ArgumentOutOfRangeException: 'Source column 'workclass1' has invalid type ('R8'): Expected Text or Key item type.

It would be good if it can accept numbers: Ints and Floats.

TomFinley · 2018-08-15T22:27:52Z

Hi @ganik thanks for looking at this.

One detail that I am not clear on here is what we shall do with 0 values. Previously when operating over keys or text, the default values for either (missing and empty respectively, for those two types) was to map to the missing key value. In this way the mapping was sparsity preserving.

Yet, in the case of float and double, a 0.0 value is a perfectly reasonable value, and should probably not map to the missing key value. But the implication there is that this transform becomes, for input numeric types, no longer sparsity preserving: e.g., a sparse vector of length 100 and count 0 would map to a dense vector of length 100, filled with the hash values for 0.0.

Other details include, do we want to be sure negative and positive zeros have the same hash? It seems like they ought to.

It would be good if it can accept numbers: Ints and Floats.

I agree with this, but just from a strategic perspective I might elect to not include handling to integers until the work of #673 is done. Work on floats and doubles though can be done immediately.

ganik · 2018-08-16T17:23:29Z

Sure, I can postpone integers work until #673 done, will do only floats and doubles.
I agree that negative and positive zeros should have same hash.

If we don't map 0.0 to 0 then sparsity is not maintained. How about we map 0.0 to 0, missing values to 1 and the rest will have hash code starting from 2?

TomFinley · 2018-08-17T15:09:33Z

Hi @ganik.

If we don't map 0.0 to 0 then sparsity is not maintained. How about we map 0.0 to 0, missing values to 1 and the rest will have hash code starting from 2?

Interesting, but that still has the problem that values of 0.0 will be dropped from further processing (using, say, KeyToValue or something) since we are mapping them to the NA key value, which seems undesirable. That we would now make missing values have an actual key type seems undesirable. The proposal doesn't do much to represent that and the fact that we're now taking NA values and making them explicit "real" values through the hash seems undesirable. Also even if we didn't have the property that key values have their default value be the missing value (which is in any event I think the only sensible choice considering what keys are), fixing the hash for a single value seems like it compromises the usefulness of the hash, e.g., featurization schemes that rely on multiple hashings of the same data but with different seeds.

If we consider sparsity preserving a useful property (which I think we do) one possibility I'm not sure that I like is, I could imagine is the behavior could be optional, that NA and 0 inputs to NA keys for vector inputs. (Sort of like we have the zero=true option for the MinMax and other normalizers to turn off their by default sparsity preserving behavior. We have occasionally heard from people that really honestly did want the empty string to map to a real hash value, this could make those people "happy" as well, maybe.

Incidentally, it may be helpful to read about key types, both here for a formal definition and here for a more intuitive discussion, to understand what they are, and how they're used.

shauheen · 2018-08-31T04:27:10Z

@ganik could you please look into this issue. Thanks,

ganik · 2018-09-01T17:02:04Z

@shauheen this issue should have been closed by PR [684] (#684) and be part of v0.5. Not sure why it wasn't automatically closed.

shauheen · 2018-09-04T17:12:10Z

Thanks

ganik added the enhancement New feature or request label Aug 14, 2018

ganik self-assigned this Aug 14, 2018

ganik mentioned this issue Aug 16, 2018

HashTransform to handle Floats and Doubles #684

Merged

shauheen added this to the 0818 milestone Aug 24, 2018

shauheen modified the milestones: 0818, 0918 Aug 31, 2018

shauheen closed this as completed Sep 4, 2018

ghost locked as resolved and limited conversation to collaborators Mar 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CategoricalHashTransform should accept Floats and Doubles #679

CategoricalHashTransform should accept Floats and Doubles #679

ganik commented Aug 14, 2018 •

edited

Loading

TomFinley commented Aug 15, 2018

ganik commented Aug 16, 2018

TomFinley commented Aug 17, 2018

shauheen commented Aug 31, 2018

ganik commented Sep 1, 2018 •

edited

Loading

shauheen commented Sep 4, 2018

CategoricalHashTransform should accept Floats and Doubles #679

CategoricalHashTransform should accept Floats and Doubles #679

Comments

ganik commented Aug 14, 2018 • edited Loading

TomFinley commented Aug 15, 2018

ganik commented Aug 16, 2018

TomFinley commented Aug 17, 2018

shauheen commented Aug 31, 2018

ganik commented Sep 1, 2018 • edited Loading

shauheen commented Sep 4, 2018

ganik commented Aug 14, 2018 •

edited

Loading

ganik commented Sep 1, 2018 •

edited

Loading