-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Added a test showing example of text classification using TensorFlow in ML.Net #2302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
} | ||
|
||
[ConditionalFact(typeof(Environment), nameof(Environment.Is64BitProcess))] | ||
public void TensorFlowSentimentClassificationTest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is going to fail as Microsoft.ML.TensorFlow.TestModels nuget is not updated yet. #Resolved
@@ -11,6 +11,7 @@ | |||
using Microsoft.ML.Model; | |||
using Microsoft.ML.RunTests; | |||
using Microsoft.ML.Tools; | |||
using Microsoft.ML.Transforms; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove it. #Resolved
Codecov Report
@@ Coverage Diff @@
## master #2302 +/- ##
==========================================
+ Coverage 71.13% 71.15% +0.01%
==========================================
Files 779 779
Lines 140271 140308 +37
Branches 16046 16047 +1
==========================================
+ Hits 99788 99834 +46
+ Misses 36032 36023 -9
Partials 4451 4451
|
separatorChar: ',' | ||
); | ||
|
||
var estimator = new WordTokenizingEstimator(mlContext, new[]{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the mlContext
methods to create these estimators? I think #2100 is making all these constructors internal. #Resolved
Array.Resize(ref processedData.Features, 600); | ||
var prediction = tfEnginePipe.Predict(processedData); | ||
|
||
Assert.Equal(2, prediction.Prediction.Length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we verify that the predictions were somewhat correct? #Resolved
var estimator = new WordTokenizingEstimator(mlContext, new[]{ | ||
new WordTokenizingTransformer.ColumnInfo("Sentiment_Text", "TokenizedWords") | ||
}).Append(new ValueMappingEstimator(mlContext, lookupMap, "Words", "Ids", new[] { ("TokenizedWords", "Features") })); | ||
var dataPipe = estimator.Fit(dataView) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataPipe [](start = 16, length = 8)
is there a particular reason why we have dataPipe
and tfEnginePipe
separate ? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var dataPipe = estimator.Fit(dataView) | ||
.CreatePredictionEngine<TensorFlowSentiment, TensorFlowSentiment>(mlContext); | ||
|
||
string modelLocation = @"sentiment_model"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sentiment_model [](start = 37, length = 15)
so this TF model takes as input a vector of floats. Am i right ?
Perhaps we should add a comment how the model was created etc. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
var estimator = new WordTokenizingEstimator(mlContext, new[]{ | ||
new WordTokenizingTransformer.ColumnInfo("Sentiment_Text", "TokenizedWords") | ||
}).Append(new ValueMappingEstimator(mlContext, lookupMap, "Words", "Ids", new[] { ("TokenizedWords", "Features") })); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lookupMap [](start = 63, length = 9)
so we are using the word embeddings from the lookupMap
-> using the embeddings to construct Features
vector -> using the Features
vector as input to the TF model.
Am i right ? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
var estimator = new WordTokenizingEstimator(mlContext, new[]{ | ||
new WordTokenizingTransformer.ColumnInfo("Sentiment_Text", "TokenizedWords") | ||
}).Append(new ValueMappingEstimator(mlContext, lookupMap, "Words", "Ids", new[] { ("TokenizedWords", "Features") })); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ValueMappingEstimator [](start = 30, length = 21)
we should have MLContext extension for it. Looks like its missing presently #Resolved
// Then this integer vector is retrieved from the pipeline and resized to fixed length. | ||
// The second pipeline takes the resized integer vector and passed to TensoFlow and get the classification scores. | ||
var estimator = mlContext.Transforms.Text.TokenizeWords("Sentiment_Text", "TokenizedWords") | ||
.Append(mlContext.Transforms.Conversion.ValueMap(lookupMap, "Words", "Ids", new[] { ("TokenizedWords", "Features") })); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ValueMap [](start = 56, length = 8)
-
Is this transform doing the re-sizing to the fixed length ?
-
Does it matter what the fixed length is ? I presume the model was built with fixed length shaped input. But I do not see the shape specified #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are calling the Predict
API and hence in Line #899 we can resize the output of the first pipeline.
How would this work for the 'Transform' API... where the testData has 2 rows (say)
row1 -> "Hi" -> dimension 50 (say)
row2 -> "(some long sentence)" -> dimension 5000 (say)
Will it work ?
In reply to: 251978928 [](ancestors = 251978928,251978523,251977773,251976925)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are not training the TF model at all. It is just the prediction pipeline. For the case you are mentioning, it would require the same resize operation on dataview instead of single prediction.
In reply to: 251981308 [](ancestors = 251981308,251978928,251978523,251977773,251976925)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand we are not training the TF model. The Fit() for the TFTransform would not do anyting in this example.
I wanted to know if the re-size operation on dataview would be supported -- If it is supported, can we add it to the unit test with at least 2 rows of text data + use of the Transform
API ?
This test case does single prediction (use of Predict
API) but does not show use of the Transform
API where we would have to re-size more than just 1 row of variable length vector.
In reply to: 252003040 [](ancestors = 252003040,251981308,251978928,251978523,251977773,251976925)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually not in the scope of this test. I will try to add more training related test later on but not in this PR because of the scope.
In reply to: 252020455 [](ancestors = 252020455,252003040,251981308,251978928,251978523,251977773,251976925)
separatorChar: ',' | ||
); | ||
|
||
// We cannot resize variable length vector to fixed lenght vector in ML.Net |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lenght [](start = 64, length = 6)
typo #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tensorflow update LGTM
var prediction = tfEnginePipe.Predict(processedData); | ||
|
||
Assert.Equal(2, prediction.Prediction.Length); | ||
Assert.InRange(prediction.Prediction[1], 0.650032759 - 0.01, 0.650032759 + 0.01); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use Assert.Equal. If there are only two prediction values, can we check them all? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by Assert.Equal? Here we are checking the range within particular threshold.
No need to check another value. These are probabilities.
In reply to: 252005383 [](ancestors = 252005383)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I got you what you meant with Assert.Equal. I actually want to check if my values are in range e.g. 0.64 <= prediction <= 0.66 which I cannot do with Assert.Equal, can I?
Also, I feel InRange
more readable than other when asserting thresholds.
In reply to: 252006645 [](ancestors = 252006645,252005383)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have tolerance in Assert.Equal. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It uses number of decimal places which is not applicable here.
In reply to: 252050957 [](ancestors = 252050957)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could trim 0.650032759 to 0.65, if we're comparing as ± 0.01.
separatorChar: ',' | ||
); | ||
|
||
// We cannot resize variable length vector to fixed length vector in ML.Net |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ML.Net? ML.NET? @shauheen, any comment? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.NET
is always stylized as .NET
.
https://twitter.com/terrajobst/status/1064638607707631616 #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
// We cannot resize variable length vector to fixed length vector in ML.Net | ||
// The trick here is to create two pipelines. | ||
// The first pipeline tokenzies the strings into words and maps the words to an integer which is an index in the dictionary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string"s"? I only see one string data. Also, is it "each word" instead of "words"? #Resolved
// The trick here is to create two pipelines. | ||
// The first pipeline tokenzies the strings into words and maps the words to an integer which is an index in the dictionary. | ||
// Then this integer vector is retrieved from the pipeline and resized to fixed length. | ||
// The second pipeline takes the resized integer vector and passed to TensoFlow and get the classification scores. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not straightforward to user to understand what are "first pipeline" and "second pipeline". Please cite their variable names here or move those statements to the associated pipeline declarations. #Resolved
@@ -141,5 +141,19 @@ public static KeyToValueMappingEstimator MapKeyToValue(this TransformsCatalog.Co | |||
IEnumerable<TOutputType> values, | |||
params (string source, string name)[] columns) | |||
=> new ValueMappingEstimator<TInputType, TOutputType>(CatalogUtils.GetEnvironment(catalog), keys, values, columns); | |||
|
|||
/// <summary> | |||
/// Maps specified keys to specified values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't explain how this transform works. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good but I have some comments around docs and I hope they can be addressed before merging.
/// <returns></returns> | ||
public static ValueMappingEstimator ValueMap( | ||
this TransformsCatalog.ConversionTransforms catalog, | ||
IDataView lookupMap, string keyColumn, string valueColumn, params (string input, string output)[] columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomFinley @sfilipi - is this consistent with the order we've decided on with #2064? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts guys? I saw the method above has same pattern so I followed that.
In reply to: 252014795 [](ancestors = 252014795)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(string outputColumnName, string inputColumnName)
You'll see that if you update to latest.
In reply to: 252027697 [](ancestors = 252027697,252014795)
/// Maps specified keys to specified values | ||
/// Maps the <paramref name="columns.input"/> using the keys in the dictionary to the values of dictionary. | ||
/// In this case, the <paramref name="lookupMap"/> is used as a dictionary where <paramref name="keyColumn"/> | ||
/// and <paramref name="valueColumn"/> specify the keys and values of dictionary respectively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a value x in the input would be mapped to value stored in dictionary[x]
? #Resolved
// The first pipeline 'dataPipe' tokenzies the string into words and maps each word to an integer which is an index in the dictionary. | ||
// Then this integer vector is retrieved from the pipeline and resized to fixed length. | ||
// The second pipeline 'tfEnginePipe' takes the resized integer vector and passed to TensoFlow and get the classification scores. | ||
var estimator = mlContext.Transforms.Text.TokenizeWords("Sentiment_Text", "TokenizedWords") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okenizeWords("Sentiment_Text", "TokenizedWords" [](start = 55, length = 47)
if you rebase to latest, you'll have to swap those. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks all for the useful comments/feedback. |
This PR fixes #2301.
Also updated the TensorFlow runtime from 1.10.0 -> 1.12.0