Skip to content

Created samples for 'ProduceWordBags' and 'ProduceHashedWordBags' API. #3183

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 4, 2019

Conversation

zeahmed
Copy link
Contributor

@zeahmed zeahmed commented Apr 2, 2019

Related to #1209.

@codecov
Copy link

codecov bot commented Apr 2, 2019

Codecov Report

Merging #3183 into master will increase coverage by 0.04%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3183      +/-   ##
==========================================
+ Coverage   72.54%   72.58%   +0.04%     
==========================================
  Files         807      807              
  Lines      144774   144956     +182     
  Branches    16208    16212       +4     
==========================================
+ Hits       105021   105223     +202     
+ Misses      35339    35318      -21     
- Partials     4414     4415       +1
Flag Coverage Δ
#Debug 72.58% <ø> (+0.04%) ⬆️
#production 68.15% <ø> (+0.01%) ⬆️
#test 88.89% <ø> (+0.06%) ⬆️
Impacted Files Coverage Δ
src/Microsoft.ML.DataView/KeyDataViewType.cs 74.57% <0%> (-3.76%) ⬇️
test/Microsoft.ML.Tests/ImagesTests.cs 98.69% <0%> (-0.13%) ⬇️
...Microsoft.ML.Tests/Transformers/NormalizerTests.cs 100% <0%> (ø) ⬆️
...ML.Data/Transforms/ConversionsExtensionsCatalog.cs 44.87% <0%> (ø) ⬆️
...soft.ML.TestFramework/DataPipe/TestDataPipeBase.cs 74.03% <0%> (+0.33%) ⬆️
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs 85.11% <0%> (+0.4%) ⬆️
...rosoft.ML.ImageAnalytics/VectorToImageTransform.cs 76.77% <0%> (+4.53%) ⬆️
.../Microsoft.ML.Data/Transforms/ExtensionsCatalog.cs 100% <0%> (+4.76%) ⬆️
...c/Microsoft.ML.ImageAnalytics/ExtensionsCatalog.cs 16.66% <0%> (+5.55%) ⬆️
src/Microsoft.ML.Transforms/NormalizerCatalog.cs 84.78% <0%> (+8.11%) ⬆️

new TextData(){ Text = "This is an example to compute bag-of-word features using hashing." },
new TextData(){ Text = "ML.NET's ProduceHashedWordBags API produces count of Ngrams and hashes it as an index into a vector of given bit length." },
new TextData(){ Text = "It does so by first tokenizing text/string into words/tokens then " },
new TextData(){ Text = "computing Ngram and hash them to the index given by hash value." },
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ngram [](start = 50, length = 5)

Ngrams ? #Resolved

new TextData(){ Text = "ML.NET's ProduceHashedWordBags API produces count of Ngrams and hashes it as an index into a vector of given bit length." },
new TextData(){ Text = "It does so by first tokenizing text/string into words/tokens then " },
new TextData(){ Text = "computing Ngram and hash them to the index given by hash value." },
new TextData(){ Text = "The hashing schem reduces the size of the output feature vector" },
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schem [](start = 52, length = 5)

schema #Resolved


// Expected output:
// Number of Features: 256
// Features: 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 [](start = 30, length = 78)

This is kinda anticlimactic.
Maybe change numberOfBits to smaller value to get some results different from zero here? #Resolved

new TextData(){ Text = "ML.NET's ProduceWordBags API produces bag-of-word features from input text." },
new TextData(){ Text = "It does so by first tokenizing text/string into words/tokens then " },
new TextData(){ Text = "computing Ngram and their neumeric values." },
new TextData(){ Text = "Each position in the output vector corresponds to a particular Ngram." },
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ngram [](start = 103, length = 5)

https://en.wikipedia.org/wiki/N-gram
We call it NGram in code because you can't use - in variable, but for explanation maybe we should use n-gram? #Resolved

new TextData(){ Text = "Each position in the output vector corresponds to a particular Ngram." },
new TextData(){ Text = "The value at each position corresponds to," },
new TextData(){ Text = "the number of times Ngram occured in the data (Tf), or" },
new TextData(){ Text = "the inverse of the number of documents contain the Ngram (Idf), or." },
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or. [](start = 104, length = 3)

double or, plus dot in the end. I would omit this or. #Resolved

var transformedDataView = textTransformer.Transform(dataview);

// Create the prediction engine to get the bag-of-word features extracted from the text.
var predictionEngine = mlContext.Model.CreatePredictionEngine<TextData, TransformedTextData>(textTransformer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why predictionEngine rather than TakeRows and ConvertToEnumerable? I would use the one we would recommend people to use to inspect data in practice.


// Preview of the produced n-grams.
// Get the slot names from the column's metadata.
// If the column is a vector column the slot names corresponds to the names associated with each position in the vector.
Copy link
Contributor

@rogancarr rogancarr Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the column is a vector column [](start = 15, length = 32)

The If clause is confusing here. I would just drop it. #Resolved

Copy link
Contributor

@rogancarr rogancarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with comments.

🔵

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@zeahmed zeahmed merged commit 24645ff into dotnet:master Apr 4, 2019
zeahmed added a commit to zeahmed/machinelearning that referenced this pull request Apr 8, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants