-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Created samples for 'ProduceWordBags' and 'ProduceHashedWordBags' API. #3183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3183 +/- ##
==========================================
+ Coverage 72.54% 72.58% +0.04%
==========================================
Files 807 807
Lines 144774 144956 +182
Branches 16208 16212 +4
==========================================
+ Hits 105021 105223 +202
+ Misses 35339 35318 -21
- Partials 4414 4415 +1
|
new TextData(){ Text = "This is an example to compute bag-of-word features using hashing." }, | ||
new TextData(){ Text = "ML.NET's ProduceHashedWordBags API produces count of Ngrams and hashes it as an index into a vector of given bit length." }, | ||
new TextData(){ Text = "It does so by first tokenizing text/string into words/tokens then " }, | ||
new TextData(){ Text = "computing Ngram and hash them to the index given by hash value." }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ngram [](start = 50, length = 5)
Ngrams
? #Resolved
new TextData(){ Text = "ML.NET's ProduceHashedWordBags API produces count of Ngrams and hashes it as an index into a vector of given bit length." }, | ||
new TextData(){ Text = "It does so by first tokenizing text/string into words/tokens then " }, | ||
new TextData(){ Text = "computing Ngram and hash them to the index given by hash value." }, | ||
new TextData(){ Text = "The hashing schem reduces the size of the output feature vector" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
schem [](start = 52, length = 5)
schema
#Resolved
|
||
// Expected output: | ||
// Number of Features: 256 | ||
// Features: 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 [](start = 30, length = 78)
This is kinda anticlimactic.
Maybe change numberOfBits to smaller value to get some results different from zero here? #Resolved
new TextData(){ Text = "ML.NET's ProduceWordBags API produces bag-of-word features from input text." }, | ||
new TextData(){ Text = "It does so by first tokenizing text/string into words/tokens then " }, | ||
new TextData(){ Text = "computing Ngram and their neumeric values." }, | ||
new TextData(){ Text = "Each position in the output vector corresponds to a particular Ngram." }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ngram [](start = 103, length = 5)
https://en.wikipedia.org/wiki/N-gram
We call it NGram in code because you can't use -
in variable, but for explanation maybe we should use n-gram
? #Resolved
new TextData(){ Text = "Each position in the output vector corresponds to a particular Ngram." }, | ||
new TextData(){ Text = "The value at each position corresponds to," }, | ||
new TextData(){ Text = "the number of times Ngram occured in the data (Tf), or" }, | ||
new TextData(){ Text = "the inverse of the number of documents contain the Ngram (Idf), or." }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or. [](start = 104, length = 3)
double or, plus dot in the end. I would omit this or.
#Resolved
var transformedDataView = textTransformer.Transform(dataview); | ||
|
||
// Create the prediction engine to get the bag-of-word features extracted from the text. | ||
var predictionEngine = mlContext.Model.CreatePredictionEngine<TextData, TransformedTextData>(textTransformer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why predictionEngine
rather than TakeRows
and ConvertToEnumerable
? I would use the one we would recommend people to use to inspect data in practice.
|
||
// Preview of the produced n-grams. | ||
// Get the slot names from the column's metadata. | ||
// If the column is a vector column the slot names corresponds to the names associated with each position in the vector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the column is a vector column [](start = 15, length = 32)
The If clause is confusing here. I would just drop it. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with comments.
🔵
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to #1209.