Skip to content

Update default n-gram length for Text Transform to match default text recipe #2870

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
daholste opened this issue Mar 6, 2019 · 4 comments
Closed
Labels
P2 Priority of the issue for triage purpose: Needs to be fixed at some point.

Comments

@daholste
Copy link
Contributor

daholste commented Mar 6, 2019

@justinormont and the text team tuned default n-gram lengths for the default text recipe in the internal repo

These defaults are:
Word -- bigrams (w/ unigrams)
Character -- trigrams (w/o unigrams and bigrams)

One chart from his findings:
image

The line w/ the light blue call-out represents current ML.NET defaults (Unigram + Trichar)
The line w/ the light green call-out is the requested change (Bigram + Trichar)
The line w/ the pink call-out shows the Trigram+Trichar is better in terms of accuracy, but with a time hit, and accuracy has a cross over at NumIterations > 8 for Averaged Perceptron learner.

@daholste daholste changed the title Update default n-gram length in text transforms to match TLC Update default n-gram length in Text Transform to match TLC Mar 6, 2019
@daholste daholste changed the title Update default n-gram length in Text Transform to match TLC Update default n-gram length for Text Transform to match TLC Mar 6, 2019
@daholste daholste changed the title Update default n-gram length for Text Transform to match TLC Update default n-gram length for Text Transform to match default text recipe Mar 6, 2019
@rogancarr
Copy link
Contributor

Related to #2802

@zeahmed
Copy link
Contributor

zeahmed commented Mar 25, 2019

@justinormont and @shauheen, do you want this to go in V1.0?

@justinormont
Copy link
Contributor

That's up to @shauheen. I'd say yes, as there's strong upsides of accuracy. You'll notice the large jump in accuracy (y-axis) when we move from the blue to green lines in the above graph.

The power of defaults should never be underestimated.

Related: #2305

@frank-dong-ms-zz frank-dong-ms-zz added the P2 Priority of the issue for triage purpose: Needs to be fixed at some point. label Jan 9, 2020
@najeeb-kazmi
Copy link
Member

Tracking in #4749

@ghost ghost locked as resolved and limited conversation to collaborators Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P2 Priority of the issue for triage purpose: Needs to be fixed at some point.
Projects
None yet
Development

No branches or pull requests

6 participants