-
Notifications
You must be signed in to change notification settings - Fork 1.9k
TextFeaturizer cannot specify n-grams for words or characters #2802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Also, it is not clear if this is even possible using a workaround. |
Related to #838 ? |
@najeeb-kazmi Yes, it is. That one proposes a wider set of functionality than we need for the proposed V1 features. |
Is this strictly adding new API? Can this be done without a public API breaking change? If so, I think we can remove it from Project 13, and it can be added after v1.0. But if this requires a public API breaking change, then it can be left in Project 13. |
@eerhardt The answer is "it depends" Take a look at the current options. We use binary flags to turn words and chars on and off: // Create a training pipeline.
// TODO #2802: Update FeaturizeText to allow specifications of word-grams and char-grams.
var pipeline = mlContext.Transforms.Text.FeaturizeText("Features", new string[] { "SentimentText" },
new TextFeaturizingEstimator.Options
{
UseCharExtractor = true,
UseWordExtractor = true,
VectorNormalizer = TextFeaturizingEstimator.TextNormKind.L1
})
.AppendCacheCheckpoint(mlContext)
.Append(mlContext.BinaryClassification.Trainers.StochasticDualCoordinateAscent(
new SdcaBinaryTrainer.Options { NumThreads = 1 })); If we want to be able to choose n-grams, then it makes more sense to get rid of these flag variables and replace them with options (e.g. how many n-grams to use, whether to do all n-grams up to a cutoff). |
TLC text recipe defaults are bigram and tricharactergram. Is this the default for TextFeaturizer as well? |
One of the stated goals of the V1 API was:
TextFeaturizer
to update the number of word-grams and char-grams used along with things like the normalization.In the current API for
TextFeaturizer
, it is possible to create n-grams from words and/or characters (UseCharExtrator
,UseWordExtractor
) but it is not possible to specify what sorts of n-grams to make.Related to #2711
The text was updated successfully, but these errors were encountered: