AutoML Nuget Package won't train against TSV #4555

TheDevelolper · 2019-12-09T15:20:39Z

System information

OS version/distro: Win10 64bit
.NET Version (eg., dotnet --info):
dotnet core 3.0

Issue

What did you do?
Installed AutoML nuget package. Setup sentiment analysis training. Selected correct labels and features columns.

Went to train and...

What happened?

I got a message almost immediately "Failed. See more in Output pane".

The output pane for "Machine Learning" is completely empty!

What did you expect?
At least an error...

screenshot:

Here's the input data pane:

Source code / logs

Please paste or attach the code or logs or traces that would be helpful to diagnose the issue you are reporting.

gvashishtha · 2019-12-10T00:49:14Z

@briacht any ideas on this?

TheDevelolper · 2019-12-10T08:19:23Z

I'm happy to help in any way that I can to get this resolved. I think it may have something to do with hidden characters.

Perhaps I can try with just one row.. I could duplicate it over and over so the content is the same. If that fails we know the problem is likely to be with that row.

I can then test it with a simple string to prove that works.

Then I can strip all email addresses and confidential data from the email contents and I can provide that data from you to look at.

How does that sound?

justinormont · 2019-12-10T18:14:02Z

It's likely an escaping issue which is not handled by ML.NET's TextLoader. Most common is quote (") or newline (\r|\n) character within a string.

See: #4460

TheDevelolper · 2019-12-10T22:44:26Z

It's likely an escaping issue which is not handled by ML.NET's TextLoader. Most common is quote (") or newline (\r|\n) character within a string.

@justinormont thanks for getting back to me!

That is more than likely... Do you think I could encode the text to numerical values for each char to solve this issue?

justinormont · 2019-12-10T23:12:01Z

@kiranshub : I wouldn't recommend pre-featurizing as this can cause stealthily treacherous data leakage.

I would pre-clean the TSV file by ensuring no newlines or quotes are in the text columns.

Most normal CSV/TSV readers will load your file and can be scrubbed there. For instance loading the file in Excel and search-replace quotes/newline is a rather simple way. You can also handle at dataset creation time, for instance if you're exporting from a DB: SELECT REPLACE(REPLACE(REPLACE(email, '\r', ' ##R## '), '\n', ' ##N## '), '"', ' ##Q## '), which replaces with a new token with white space around the token.

TheDevelolper · 2019-12-11T12:13:14Z

@justinormont you're awesome for helping me out here. I just wanted to ask you what you meant by " stealthily treacherous data leakage" ?

Do you mean data could be transmitted to someone else? If so I'm not sure how?

I'm sure I've misunderstood that.

justinormont · 2019-12-11T18:49:46Z

No outside transmission of data; simply overly optimistic metrics.

@YuriyGuts has a nice talk on data leakage: https://youtu.be/dWhdWxgt5SU

Data leakage, also called "target leakage", can come in many forms. The specific type you can introduce here is introducing information from your scoring set into your training dataset.

By pre-featurizing the whole dataset the model can learn from data it should not see. The AutoML handles the feature engineering for you to remove this style of dataset leakage; pre-featurizing on the whole dataset bypasses some of these safe-guards and can allow your metrics to no longer be representative of how well your model will do in production.

Featurizing the dataset will (often) introduce information in to each row from other rows. When featurizing data, the features should be learned only on the training dataset split. Now when your featurized dataset is split, information has flowed from the validation (scoring) dataset to the training dataset. Since the training dataset now has information about your scoring dataset, this causes the estimate of how well your model will do in production (the metrics, like accuracy) to be artificially high.

gvashishtha · 2019-12-11T21:32:10Z

@justinormont thanks for all your help. @kiranshub does this answer your question? Would like to close the issue if possible.

YuriyGuts · 2019-12-12T11:11:21Z

@justinormont I appreciate the reference!

TheDevelolper · 2019-12-15T15:42:48Z

@justinormont @YuriyGuts

Hi guys, sorry no this doesn't solve the issue. I've been trying this out for some time.

Here's the code I have:

       email_body = email_body.replace("\r", "##R##").replace("\n", "##N##").replace(
            ",", "##C##").replace(";", "##S##").replace("'", "##A##").replace("\"", "##Q##").replace("\t", "##T##")

I'm trying to replace all the symbols as you say.

TheDevelolper · 2019-12-15T16:45:22Z

Hi guys,

I finally managed to solve this. So it turns out that the pandas python library was writing a duff csv!

It adds an index column so my csv headers were looking like this:

,column1,column2

So I needed to specify an index = false parameter in Pandas (I know this is unrelated to ML.NET) but I thought I better include it incase someone else has the same problem :

Adding the index = false parameter (along with your previous suggestions) produced a CSV that would parse!

    df.to_csv('results.csv', index=False)

TheDevelolper changed the title ~~AutoML seems won't train against TSV~~ AutoML Nuget Package won't train against TSV Dec 9, 2019

TheDevelolper closed this as completed Dec 15, 2019

LittleLittleCloud mentioned this issue Dec 18, 2019

fix TextLoader bug when there's newline between quotes #4584

Closed

ghost locked as resolved and limited conversation to collaborators Mar 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoML Nuget Package won't train against TSV #4555

AutoML Nuget Package won't train against TSV #4555

TheDevelolper commented Dec 9, 2019 •

edited

Loading

gvashishtha commented Dec 10, 2019

TheDevelolper commented Dec 10, 2019

justinormont commented Dec 10, 2019

TheDevelolper commented Dec 10, 2019

justinormont commented Dec 10, 2019 •

edited

Loading

TheDevelolper commented Dec 11, 2019

justinormont commented Dec 11, 2019

gvashishtha commented Dec 11, 2019

YuriyGuts commented Dec 12, 2019

TheDevelolper commented Dec 15, 2019

TheDevelolper commented Dec 15, 2019

AutoML Nuget Package won't train against TSV #4555

AutoML Nuget Package won't train against TSV #4555

Comments

TheDevelolper commented Dec 9, 2019 • edited Loading

System information

Issue

Source code / logs

gvashishtha commented Dec 10, 2019

TheDevelolper commented Dec 10, 2019

justinormont commented Dec 10, 2019

TheDevelolper commented Dec 10, 2019

justinormont commented Dec 10, 2019 • edited Loading

TheDevelolper commented Dec 11, 2019

justinormont commented Dec 11, 2019

gvashishtha commented Dec 11, 2019

YuriyGuts commented Dec 12, 2019

TheDevelolper commented Dec 15, 2019

TheDevelolper commented Dec 15, 2019

TheDevelolper commented Dec 9, 2019 •

edited

Loading

justinormont commented Dec 10, 2019 •

edited

Loading