Skip to content

AutoML Nuget Package won't train against TSV #4555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TheDevelolper opened this issue Dec 9, 2019 · 11 comments
Closed

AutoML Nuget Package won't train against TSV #4555

TheDevelolper opened this issue Dec 9, 2019 · 11 comments

Comments

@TheDevelolper
Copy link

TheDevelolper commented Dec 9, 2019

System information

  • OS version/distro: Win10 64bit
  • .NET Version (eg., dotnet --info):
    dotnet core 3.0

Issue

  • What did you do?
    Installed AutoML nuget package. Setup sentiment analysis training. Selected correct labels and features columns.

Went to train and...

  • What happened?

I got a message almost immediately "Failed. See more in Output pane".

The output pane for "Machine Learning" is completely empty!

  • What did you expect?
    At least an error...

screenshot:

image

Here's the input data pane:

image

Source code / logs

Please paste or attach the code or logs or traces that would be helpful to diagnose the issue you are reporting.

@TheDevelolper TheDevelolper changed the title AutoML seems won't train against TSV AutoML Nuget Package won't train against TSV Dec 9, 2019
@gvashishtha
Copy link
Contributor

@briacht any ideas on this?

@TheDevelolper
Copy link
Author

I'm happy to help in any way that I can to get this resolved. I think it may have something to do with hidden characters.

Perhaps I can try with just one row.. I could duplicate it over and over so the content is the same. If that fails we know the problem is likely to be with that row.

I can then test it with a simple string to prove that works.

Then I can strip all email addresses and confidential data from the email contents and I can provide that data from you to look at.

How does that sound?

@justinormont
Copy link
Contributor

It's likely an escaping issue which is not handled by ML.NET's TextLoader. Most common is quote (") or newline (\r|\n) character within a string.

See: #4460

@TheDevelolper
Copy link
Author

It's likely an escaping issue which is not handled by ML.NET's TextLoader. Most common is quote (") or newline (\r|\n) character within a string.

@justinormont thanks for getting back to me!

That is more than likely... Do you think I could encode the text to numerical values for each char to solve this issue?

@justinormont
Copy link
Contributor

justinormont commented Dec 10, 2019

@kiranshub : I wouldn't recommend pre-featurizing as this can cause stealthily treacherous data leakage.

I would pre-clean the TSV file by ensuring no newlines or quotes are in the text columns.

Most normal CSV/TSV readers will load your file and can be scrubbed there. For instance loading the file in Excel and search-replace quotes/newline is a rather simple way. You can also handle at dataset creation time, for instance if you're exporting from a DB: SELECT REPLACE(REPLACE(REPLACE(email, '\r', ' ##R## '), '\n', ' ##N## '), '"', ' ##Q## '), which replaces with a new token with white space around the token.

@TheDevelolper
Copy link
Author

@justinormont you're awesome for helping me out here. I just wanted to ask you what you meant by " stealthily treacherous data leakage" ?

Do you mean data could be transmitted to someone else? If so I'm not sure how?

I'm sure I've misunderstood that.

@justinormont
Copy link
Contributor

No outside transmission of data; simply overly optimistic metrics.

@YuriyGuts has a nice talk on data leakage: https://youtu.be/dWhdWxgt5SU

Data leakage, also called "target leakage", can come in many forms. The specific type you can introduce here is introducing information from your scoring set into your training dataset.

By pre-featurizing the whole dataset the model can learn from data it should not see. The AutoML handles the feature engineering for you to remove this style of dataset leakage; pre-featurizing on the whole dataset bypasses some of these safe-guards and can allow your metrics to no longer be representative of how well your model will do in production.

Featurizing the dataset will (often) introduce information in to each row from other rows. When featurizing data, the features should be learned only on the training dataset split. Now when your featurized dataset is split, information has flowed from the validation (scoring) dataset to the training dataset. Since the training dataset now has information about your scoring dataset, this causes the estimate of how well your model will do in production (the metrics, like accuracy) to be artificially high.

@gvashishtha
Copy link
Contributor

@justinormont thanks for all your help. @kiranshub does this answer your question? Would like to close the issue if possible.

@YuriyGuts
Copy link

@justinormont I appreciate the reference!

@TheDevelolper
Copy link
Author

@justinormont @YuriyGuts

Hi guys, sorry no this doesn't solve the issue. I've been trying this out for some time.

Here's the code I have:

       email_body = email_body.replace("\r", "##R##").replace("\n", "##N##").replace(
            ",", "##C##").replace(";", "##S##").replace("'", "##A##").replace("\"", "##Q##").replace("\t", "##T##")

I'm trying to replace all the symbols as you say.

@TheDevelolper
Copy link
Author

Hi guys,

I finally managed to solve this. So it turns out that the pandas python library was writing a duff csv!

It adds an index column so my csv headers were looking like this:

,column1,column2

So I needed to specify an index = false parameter in Pandas (I know this is unrelated to ML.NET) but I thought I better include it incase someone else has the same problem :

Adding the index = false parameter (along with your previous suggestions) produced a CSV that would parse!

    df.to_csv('results.csv', index=False)

@ghost ghost locked as resolved and limited conversation to collaborators Mar 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants