Skip to content

Added a test showing example of text classification using TensorFlow in ML.Net #2302

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jan 30, 2019

Conversation

zeahmed
Copy link
Contributor

@zeahmed zeahmed commented Jan 29, 2019

This PR fixes #2301.

Also updated the TensorFlow runtime from 1.10.0 -> 1.12.0

}

[ConditionalFact(typeof(Environment), nameof(Environment.Is64BitProcess))]
public void TensorFlowSentimentClassificationTest()
Copy link
Contributor Author

@zeahmed zeahmed Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is going to fail as Microsoft.ML.TensorFlow.TestModels nuget is not updated yet. #Resolved

@@ -11,6 +11,7 @@
using Microsoft.ML.Model;
using Microsoft.ML.RunTests;
using Microsoft.ML.Tools;
using Microsoft.ML.Transforms;
Copy link
Contributor Author

@zeahmed zeahmed Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove it. #Resolved

@zeahmed zeahmed changed the title Create a test showing example of text classification in TensorFlow Added a test showing example of text classification using TensorFlow in ML.Net Jan 29, 2019
@codecov
Copy link

codecov bot commented Jan 29, 2019

Codecov Report

Merging #2302 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2302      +/-   ##
==========================================
+ Coverage   71.13%   71.15%   +0.01%     
==========================================
  Files         779      779              
  Lines      140271   140308      +37     
  Branches    16046    16047       +1     
==========================================
+ Hits        99788    99834      +46     
+ Misses      36032    36023       -9     
  Partials     4451     4451
Flag Coverage Δ
#Debug 71.15% <100%> (+0.01%) ⬆️
#production 67.57% <100%> (ø) ⬆️
#test 85.07% <100%> (+0.01%) ⬆️

separatorChar: ','
);

var estimator = new WordTokenizingEstimator(mlContext, new[]{
Copy link
Member

@eerhardt eerhardt Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the mlContext methods to create these estimators? I think #2100 is making all these constructors internal. #Resolved

Array.Resize(ref processedData.Features, 600);
var prediction = tfEnginePipe.Predict(processedData);

Assert.Equal(2, prediction.Prediction.Length);
Copy link
Member

@eerhardt eerhardt Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we verify that the predictions were somewhat correct? #Resolved

var estimator = new WordTokenizingEstimator(mlContext, new[]{
new WordTokenizingTransformer.ColumnInfo("Sentiment_Text", "TokenizedWords")
}).Append(new ValueMappingEstimator(mlContext, lookupMap, "Words", "Ids", new[] { ("TokenizedWords", "Features") }));
var dataPipe = estimator.Fit(dataView)
Copy link
Member

@abgoswam abgoswam Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataPipe [](start = 16, length = 8)

is there a particular reason why we have dataPipe and tfEnginePipe separate ? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments in the code.


In reply to: 251916406 [](ancestors = 251916406)

var dataPipe = estimator.Fit(dataView)
.CreatePredictionEngine<TensorFlowSentiment, TensorFlowSentiment>(mlContext);

string modelLocation = @"sentiment_model";
Copy link
Member

@abgoswam abgoswam Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sentiment_model [](start = 37, length = 15)

so this TF model takes as input a vector of floats. Am i right ?

Perhaps we should add a comment how the model was created etc. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it takes integers as input.


In reply to: 251917695 [](ancestors = 251917695)


var estimator = new WordTokenizingEstimator(mlContext, new[]{
new WordTokenizingTransformer.ColumnInfo("Sentiment_Text", "TokenizedWords")
}).Append(new ValueMappingEstimator(mlContext, lookupMap, "Words", "Ids", new[] { ("TokenizedWords", "Features") }));
Copy link
Member

@abgoswam abgoswam Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lookupMap [](start = 63, length = 9)

so we are using the word embeddings from the lookupMap -> using the embeddings to construct Features vector -> using the Features vector as input to the TF model.

Am i right ? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, correct!


In reply to: 251963454 [](ancestors = 251963454)


var estimator = new WordTokenizingEstimator(mlContext, new[]{
new WordTokenizingTransformer.ColumnInfo("Sentiment_Text", "TokenizedWords")
}).Append(new ValueMappingEstimator(mlContext, lookupMap, "Words", "Ids", new[] { ("TokenizedWords", "Features") }));
Copy link
Member

@abgoswam abgoswam Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValueMappingEstimator [](start = 30, length = 21)

we should have MLContext extension for it. Looks like its missing presently #Resolved

// Then this integer vector is retrieved from the pipeline and resized to fixed length.
// The second pipeline takes the resized integer vector and passed to TensoFlow and get the classification scores.
var estimator = mlContext.Transforms.Text.TokenizeWords("Sentiment_Text", "TokenizedWords")
.Append(mlContext.Transforms.Conversion.ValueMap(lookupMap, "Words", "Ids", new[] { ("TokenizedWords", "Features") }));
Copy link
Member

@abgoswam abgoswam Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValueMap [](start = 56, length = 8)

  • Is this transform doing the re-sizing to the fixed length ?

  • Does it matter what the fixed length is ? I presume the model was built with fixed length shaped input. But I do not see the shape specified #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resizing is done at 893 in C# code.


In reply to: 251976925 [](ancestors = 251976925)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see shape specified in line 893


In reply to: 251977773 [](ancestors = 251977773,251976925)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry not 893. Its line 899


In reply to: 251978523 [](ancestors = 251978523,251977773,251976925)

Copy link
Member

@abgoswam abgoswam Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are calling the Predict API and hence in Line #899 we can resize the output of the first pipeline.

How would this work for the 'Transform' API... where the testData has 2 rows (say)

row1 -> "Hi" -> dimension 50 (say)
row2 -> "(some long sentence)" -> dimension 5000 (say)

Will it work ?


In reply to: 251978928 [](ancestors = 251978928,251978523,251977773,251976925)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are not training the TF model at all. It is just the prediction pipeline. For the case you are mentioning, it would require the same resize operation on dataview instead of single prediction.


In reply to: 251981308 [](ancestors = 251981308,251978928,251978523,251977773,251976925)

Copy link
Member

@abgoswam abgoswam Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand we are not training the TF model. The Fit() for the TFTransform would not do anyting in this example.

I wanted to know if the re-size operation on dataview would be supported -- If it is supported, can we add it to the unit test with at least 2 rows of text data + use of the Transform API ?

This test case does single prediction (use of Predict API) but does not show use of the Transform API where we would have to re-size more than just 1 row of variable length vector.


In reply to: 252003040 [](ancestors = 252003040,251981308,251978928,251978523,251977773,251976925)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually not in the scope of this test. I will try to add more training related test later on but not in this PR because of the scope.


In reply to: 252020455 [](ancestors = 252020455,252003040,251981308,251978928,251978523,251977773,251976925)

separatorChar: ','
);

// We cannot resize variable length vector to fixed lenght vector in ML.Net
Copy link
Member

@abgoswam abgoswam Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lenght [](start = 64, length = 6)

typo #Resolved

Copy link
Member

@ericstj ericstj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tensorflow update LGTM

var prediction = tfEnginePipe.Predict(processedData);

Assert.Equal(2, prediction.Prediction.Length);
Assert.InRange(prediction.Prediction[1], 0.650032759 - 0.01, 0.650032759 + 0.01);
Copy link
Member

@wschin wschin Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use Assert.Equal. If there are only two prediction values, can we check them all? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by Assert.Equal? Here we are checking the range within particular threshold.
No need to check another value. These are probabilities.


In reply to: 252005383 [](ancestors = 252005383)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I got you what you meant with Assert.Equal. I actually want to check if my values are in range e.g. 0.64 <= prediction <= 0.66 which I cannot do with Assert.Equal, can I?
Also, I feel InRange more readable than other when asserting thresholds.


In reply to: 252006645 [](ancestors = 252006645,252005383)

Copy link
Member

@wschin wschin Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have tolerance in Assert.Equal. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It uses number of decimal places which is not applicable here.


In reply to: 252050957 [](ancestors = 252050957)

Copy link
Contributor

@justinormont justinormont Jan 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could trim 0.650032759 to 0.65, if we're comparing as ± 0.01.

separatorChar: ','
);

// We cannot resize variable length vector to fixed length vector in ML.Net
Copy link
Member

@wschin wschin Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ML.Net? ML.NET? @shauheen, any comment? #Resolved

Copy link
Member

@eerhardt eerhardt Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.NET is always stylized as .NET.

https://twitter.com/terrajobst/status/1064638607707631616 #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.


In reply to: 252005632 [](ancestors = 252005632)


// We cannot resize variable length vector to fixed length vector in ML.Net
// The trick here is to create two pipelines.
// The first pipeline tokenzies the strings into words and maps the words to an integer which is an index in the dictionary.
Copy link
Member

@wschin wschin Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string"s"? I only see one string data. Also, is it "each word" instead of "words"? #Resolved

// The trick here is to create two pipelines.
// The first pipeline tokenzies the strings into words and maps the words to an integer which is an index in the dictionary.
// Then this integer vector is retrieved from the pipeline and resized to fixed length.
// The second pipeline takes the resized integer vector and passed to TensoFlow and get the classification scores.
Copy link
Member

@wschin wschin Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not straightforward to user to understand what are "first pipeline" and "second pipeline". Please cite their variable names here or move those statements to the associated pipeline declarations. #Resolved

@@ -141,5 +141,19 @@ public static KeyToValueMappingEstimator MapKeyToValue(this TransformsCatalog.Co
IEnumerable<TOutputType> values,
params (string source, string name)[] columns)
=> new ValueMappingEstimator<TInputType, TOutputType>(CatalogUtils.GetEnvironment(catalog), keys, values, columns);

/// <summary>
/// Maps specified keys to specified values
Copy link
Member

@wschin wschin Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't explain how this transform works. #Resolved

Copy link
Member

@wschin wschin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good but I have some comments around docs and I hope they can be addressed before merging.

/// <returns></returns>
public static ValueMappingEstimator ValueMap(
this TransformsCatalog.ConversionTransforms catalog,
IDataView lookupMap, string keyColumn, string valueColumn, params (string input, string output)[] columns)
Copy link
Member

@eerhardt eerhardt Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomFinley @sfilipi - is this consistent with the order we've decided on with #2064? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts guys? I saw the method above has same pattern so I followed that.


In reply to: 252014795 [](ancestors = 252014795)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(string outputColumnName, string inputColumnName)

You'll see that if you update to latest.


In reply to: 252027697 [](ancestors = 252027697,252014795)

/// Maps specified keys to specified values
/// Maps the <paramref name="columns.input"/> using the keys in the dictionary to the values of dictionary.
/// In this case, the <paramref name="lookupMap"/> is used as a dictionary where <paramref name="keyColumn"/>
/// and <paramref name="valueColumn"/> specify the keys and values of dictionary respectively.
Copy link
Member

@wschin wschin Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a value x in the input would be mapped to value stored in dictionary[x]? #Resolved

// The first pipeline 'dataPipe' tokenzies the string into words and maps each word to an integer which is an index in the dictionary.
// Then this integer vector is retrieved from the pipeline and resized to fixed length.
// The second pipeline 'tfEnginePipe' takes the resized integer vector and passed to TensoFlow and get the classification scores.
var estimator = mlContext.Transforms.Text.TokenizeWords("Sentiment_Text", "TokenizedWords")
Copy link
Member

@sfilipi sfilipi Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okenizeWords("Sentiment_Text", "TokenizedWords" [](start = 55, length = 47)

if you rebase to latest, you'll have to swap those. #Resolved

Copy link
Member

@sfilipi sfilipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@zeahmed zeahmed merged commit b4c1066 into dotnet:master Jan 30, 2019
@zeahmed
Copy link
Contributor Author

zeahmed commented Jan 30, 2019

Thanks all for the useful comments/feedback.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a test for text classification in TensorFlow.
7 participants