-
Notifications
You must be signed in to change notification settings - Fork 1.9k
FeaturizeText outputTokens uses a magical string to name a new column #2957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@justinormont @TomFinley @sfilipi @Ivanidzo4ka Your thoughts? |
I would vote to hide that option in general and switch our examples to chain of estimators like we do here: machinelearning/docs/samples/Microsoft.ML.Samples/Dynamic/WordEmbeddingTransform.cs Line 28 in 9f87099
|
@Ivanidzo4ka I agree with what you're saying, but Update: I see what you're saying. You are saying to keep the transform, but turn off the "introspection" pieces. |
Exploding the text transform in to it's subparts will make the WordEmbedding quite error prone to use. For instance, users will often forget to lowercase their text, causing a silent degradation as it will still semi-work. I strongly prefer the explorability and simplicity of using a single transform instead of the seven or so sub-components. On the original topic of creating an explicit |
So we have this chunky transform which do lot of stuff, and have plenty of knobs. One of this knobs is I see main purpose of
I understand it can be error prone to convert text into tokens, but no one stops us from introducing another transform which would do that (and inside would be same just chain of already existing transformers) and only that, rather than having one transform which does two things. To answer @justinormont question regarding current behavior for multiple columns -> We just concat them together into one column and process it as one column. |
Another reason it's nice for the |
As suggested by @Ivanidzo4ka if users want to extract tokens as part of the pipeline , we can point them to use the machinelearning/src/Microsoft.ML.Transforms/Text/TextCatalog.cs Lines 154 to 165 in 91a8703
For V1.0 I feel it makes sense to hide the p.s. @rogancarr @Ivanidzo4ka @justinormont let me know if we have consensus on this -- i can get going on this issue |
If we add a separate step for outputting tokens, then that is actually a completely different pass through the data to create tokens that we already have. I would recommend keeping the output option, but letting the user specify a name. |
In other places of the API we have moved away from adding "smarts" (no-auto normalization, no-auto calibration). Should we adhere to this motto for If we expose |
@abgoswam in addition, there's various more steps needed than just |
So is the concern that we may want |
@rogancarr . Yeap that's my primary concern. Specifically, if a user uses the
@justinormont . Am not getting you.
|
@justinormont Thanks for the clarification. I created a small sample for this. Kindly take a look at this example #2985
|
@abgoswam: Your example #2985 rather misses the reason you want the The word embedding needs to run thru the steps in the For instance all of our, included, pre-trained word embeddings are lowercase. There is no "Cat" in the model, but there is a "cat". Beyond the hard failure of case, for most pre-trained word embeddings, having the |
One of the invariants of the As such, the recommendation by @rogancarr above #2957 (comment) seems reasonable. Here is the proposal:
|
When using
OutputTokens=true
,FeaturizeText
creates a new column called${OutputColumnName}_TransformedText
. This isn't really well documented anywhere, and it's odd behavior. I suggest that we make the tokenized text column name explicit in the API.My suggestion would be the following:
OutputTokens = [bool]
toOutputTokensColumn = [string]
, and astring.NullOrWhitespace(OutputTokensColumn)
signifies that this column will not be created.What do you all think?
The text was updated successfully, but these errors were encountered: