Skip to content

[AutoML v0.16.0] InferColumn doesn't work on tricky csv file #4460

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
LittleLittleCloud opened this issue Nov 8, 2019 · 9 comments
Closed
Labels
P2 Priority of the issue for triage purpose: Needs to be fixed at some point.

Comments

@LittleLittleCloud
Copy link
Contributor

LittleLittleCloud commented Nov 8, 2019

For some csv file that contains double quotes in it's field, the inferColumn API can't work properly. It's probably because when guessing delimiter, AutoML takes the candidates inside double quote into consideration, which should be neglect. (Or when splitting lines, it uses \n inside double quote)

steps to reproduce:
download this dataset

MLContext mlContext = new MLContext();
var inputColumnInformation = new ColumnInformation();
inputColumnInformation.LabelColumnName = @"review_scores_rating";
var train = mlContext.Auto().InferColumns(TrainDataPath, inputColumnInformation);

Updated

The dataset actually works for latest AutoML/ModelBuilder, To reproduce the error, please uses this dataset:

jigsaw.txt

@justinormont
Copy link
Contributor

This is a issue of the TextLoader in ML.NET. It does not currently support escaped quotes in a quoted field.

The TextLoader has a rather limited support for TSV/CSV files.

The issue is noted in the old repo: https://github.com/dotnet/machinelearning-automl/issues/193 ("Infercolumn fails to parse new lines inside quoted text")


@vinodshanbhag :
Wikidetox fails in benchmarking because of this.
Tools like Excel are able to handle this.


@justinormont :
@CESARDELATORRE: what do you think about writing an example of converting a dataset from CSV/TSV to IDV?

The TextLoader can not handle many CSV/TSV files. Using a more general reader and outputting to IDV would the allow the AutoML code to read the IDV format.

Basic example:

To be clear, this would be an example (docs/example code) of how a user could convert their data before it comes to AutoML. This would allow us to process files like this issue is referencing.


@CESARDELATORRE :
@justinormont - It's a good idea. However, This example should be a workaround for cases like that.
It might also be a good example because those issues with "numeric value" happen in ML.NET 0.11 per-se.

For instance, I was using another dataset yesterday (just migrating to ML.NET v0.11) where the column Label had values like:

  • "1"
  • "0"

ML.NET transformers were not able to convert that to Boolean (it was putting all as 0) neither to Float (all values as NaN)… See issue in ML.NET I created:

#2824

Interestingly, those conversions to Boolean were working properly until ML.NET v0.10...

So, yes, this can be a good example. However, for AutoML, this example should be a workaround. For most cases, a .CSV/TSV files should be the by default approach since that is the most common type of dataset.


@justinormont :
@CESARDELATORRE - This is a side-effect of turning off quoting by default in ML.NET:
#2630

Non-issue:
I think AutoML will be unaffected by ML.NET changing its quoting defaults, as we sweep over both choices (and our heuristics default to using quoting when all else is equal). We should verify.

Issue:
AutoML will be affected by the TextLoader not supporting common TSV/CSV files. The purposed work around above is telling a user how to convert their TSV/CSV to IDV (bypassing TextLoader).

@LittleLittleCloud
Copy link
Contributor Author

Good to know about it
Thanks!

@harishsk
Copy link
Contributor

@LittleLittleCloud I am assuming your question has been answered and am closing the issue. Please feel free to reopen the issue if you have more questions.

@justinormont
Copy link
Contributor

We may want to open an issue to improve the TextLoader to support more common TSV/CSV formats.

@LittleLittleCloud
Copy link
Contributor Author

LittleLittleCloud commented Nov 20, 2019

Is there a way to bypass the restriction in AutoML? like provide an internal columnInference API using TextReader or something, seems that we can create IDataView from IEnumerable, it should not be too hard

I can help investigate that, ModelBuilder has already done something handling escaped quotes inside quotes, maybe we can use that @justinormont @JakeRadMSFT

@JakeRadMSFT
Copy link
Contributor

@LittleLittleCloud lets give that a shot. Can you open an issue on Model Builder side?

@antoniovs1029
Copy link
Member

antoniovs1029 commented May 14, 2020

So as discussed offline with @LittleLittleCloud :

  1. The issue he mentioned with CSV files that have new lines inside quoted fields is fixed in Enable TextLoader to accept new lines in quoted fields #5125
  2. We're not clear about what was the issue he mentioned regarding having double quotes (""). The file he shared as repro doesn't contain new lines inside quoted fields (so it is unaffected by Enable TextLoader to accept new lines in quoted fields #5125 ) but it contains double quotes, and we are able to run InferColumn() on it without any problem. ML.NET's Textloader has always been capable of loading CSV files with double quotes, so there's no issue in Textloader regarding this.

@antoniovs1029
Copy link
Member

PR #5125 has fixed the problem in the TextLoader to load new line characters as part of quoted fields, as part of a new option called readMultilines.

As mentioned here: #5125 (comment) we'll leave it up to the ModelBuilder to decide how to surface this new option through AutoML, and if they actually want to expose this through InferColumn() to properly fix the issue that was created here by @LittleLittleCloud .

@antoniovs1029
Copy link
Member

So this has been fixed by #5125 and #5148 😄

@ghost ghost locked as resolved and limited conversation to collaborators Mar 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P2 Priority of the issue for triage purpose: Needs to be fixed at some point.
Projects
None yet
Development

No branches or pull requests

6 participants