-
Notifications
You must be signed in to change notification settings - Fork 1.9k
AutoML Nuget Package won't train against TSV #4555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@briacht any ideas on this? |
I'm happy to help in any way that I can to get this resolved. I think it may have something to do with hidden characters. Perhaps I can try with just one row.. I could duplicate it over and over so the content is the same. If that fails we know the problem is likely to be with that row. I can then test it with a simple string to prove that works. Then I can strip all email addresses and confidential data from the email contents and I can provide that data from you to look at. How does that sound? |
It's likely an escaping issue which is not handled by ML.NET's TextLoader. Most common is quote (") or newline (\r|\n) character within a string. See: #4460 |
@justinormont thanks for getting back to me! That is more than likely... Do you think I could encode the text to numerical values for each char to solve this issue? |
@kiranshub : I wouldn't recommend pre-featurizing as this can cause stealthily treacherous data leakage. I would pre-clean the TSV file by ensuring no newlines or quotes are in the text columns. Most normal CSV/TSV readers will load your file and can be scrubbed there. For instance loading the file in Excel and search-replace quotes/newline is a rather simple way. You can also handle at dataset creation time, for instance if you're exporting from a DB: |
@justinormont you're awesome for helping me out here. I just wanted to ask you what you meant by " stealthily treacherous data leakage" ? Do you mean data could be transmitted to someone else? If so I'm not sure how? I'm sure I've misunderstood that. |
No outside transmission of data; simply overly optimistic metrics. @YuriyGuts has a nice talk on data leakage: https://youtu.be/dWhdWxgt5SU Data leakage, also called "target leakage", can come in many forms. The specific type you can introduce here is introducing information from your scoring set into your training dataset. By pre-featurizing the whole dataset the model can learn from data it should not see. The AutoML handles the feature engineering for you to remove this style of dataset leakage; pre-featurizing on the whole dataset bypasses some of these safe-guards and can allow your metrics to no longer be representative of how well your model will do in production. Featurizing the dataset will (often) introduce information in to each row from other rows. When featurizing data, the features should be learned only on the training dataset split. Now when your featurized dataset is split, information has flowed from the validation (scoring) dataset to the training dataset. Since the training dataset now has information about your scoring dataset, this causes the estimate of how well your model will do in production (the metrics, like accuracy) to be artificially high. |
@justinormont thanks for all your help. @kiranshub does this answer your question? Would like to close the issue if possible. |
@justinormont I appreciate the reference! |
Hi guys, sorry no this doesn't solve the issue. I've been trying this out for some time. Here's the code I have: email_body = email_body.replace("\r", "##R##").replace("\n", "##N##").replace(
",", "##C##").replace(";", "##S##").replace("'", "##A##").replace("\"", "##Q##").replace("\t", "##T##") I'm trying to replace all the symbols as you say. |
Hi guys, I finally managed to solve this. So it turns out that the pandas python library was writing a duff csv! It adds an index column so my csv headers were looking like this: ,column1,column2 So I needed to specify an index = false parameter in Pandas (I know this is unrelated to ML.NET) but I thought I better include it incase someone else has the same problem : Adding the index = false parameter (along with your previous suggestions) produced a CSV that would parse! df.to_csv('results.csv', index=False) |
System information
dotnet core 3.0
Issue
Installed AutoML nuget package. Setup sentiment analysis training. Selected correct labels and features columns.
Went to train and...
I got a message almost immediately "Failed. See more in Output pane".
The output pane for "Machine Learning" is completely empty!
At least an error...
screenshot:
Here's the input data pane:
Source code / logs
Please paste or attach the code or logs or traces that would be helpful to diagnose the issue you are reporting.
The text was updated successfully, but these errors were encountered: