-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[AutoML v0.16.0] InferColumn doesn't work on tricky csv file #4460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is a issue of the TextLoader in ML.NET. It does not currently support escaped quotes in a quoted field. The TextLoader has a rather limited support for TSV/CSV files. The issue is noted in the old repo: https://github.com/dotnet/machinelearning-automl/issues/193 ("Infercolumn fails to parse new lines inside quoted text")
|
Good to know about it |
@LittleLittleCloud I am assuming your question has been answered and am closing the issue. Please feel free to reopen the issue if you have more questions. |
We may want to open an issue to improve the TextLoader to support more common TSV/CSV formats. |
Is there a way to bypass the restriction in AutoML? like provide an internal columnInference API using TextReader or something, seems that we can create I can help investigate that, ModelBuilder has already done something handling escaped quotes inside quotes, maybe we can use that @justinormont @JakeRadMSFT |
@LittleLittleCloud lets give that a shot. Can you open an issue on Model Builder side? |
So as discussed offline with @LittleLittleCloud :
|
PR #5125 has fixed the problem in the TextLoader to load new line characters as part of quoted fields, as part of a new option called readMultilines. As mentioned here: #5125 (comment) we'll leave it up to the ModelBuilder to decide how to surface this new option through AutoML, and if they actually want to expose this through InferColumn() to properly fix the issue that was created here by @LittleLittleCloud . |
For some csv file that contains double quotes in it's field, the
inferColumn
API can't work properly. It's probably because when guessing delimiter, AutoML takes the candidates inside double quote into consideration, which should be neglect. (Or when splitting lines, it uses \n inside double quote)steps to reproduce:
download this dataset
Updated
The dataset actually works for latest AutoML/ModelBuilder, To reproduce the error, please uses this dataset:
jigsaw.txt
The text was updated successfully, but these errors were encountered: