-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[Python] Datatypes not preserved for partition fields in roundtrip to partitioned parquet dataset #22510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Joris Van den Bossche / @jorisvandenbossche: So when a partitioned dataset is written, the partition columns are not stored in the actual data, but are part of the directory schema (in your case you would have "age=77", "age=32", etc sub-folders). Currently, we don't save any "meta data" about the columns used to partition, and since they are also not stored in the actual parquet files (where a schema of the data is stored), we don't have that information from there either. So when reading a partitioned dataset, (py)arrow has not much information about the type of this partition column. Currently, the logic is to try to convert the values to ints and otherwise leave as strings, and then those values are converted to a Dictionary type (corresponding to categorical type in pandas). This logic is here: arrow/python/pyarrow/parquet.py Lines 585 to 609 in 06fd2da
There is currently no option to change this. So right now, the workaround is to convert the categorical back to an integer column in pandas. Related issues about the type of the partition column: ARROW-3388 (booleans as strings), ARROW-5666 (strings with underscores interpreted as int) |
Wes McKinney / @wesm: |
Joris Van den Bossche / @jorisvandenbossche: |
This now prevents storing and loading a pandas dataframe partitioned on a date column: pandas-dev/pandas#53008 |
Uh oh!
There was an error while loading. Please reload this page.
Datatypes are not preserved when a pandas data frame is partitioned and saved as parquet file using pyarrow but that's not the case when the data frame is not partitioned.
Case 1: Saving a partitioned dataset - Data Types are NOT preserved
Output:
From the above output, we could see that the data type for age is int64 in the original pandas data frame but it got changed to category when we saved to local and loaded back.
Case 2: Non-partitioned dataset - Data types are preserved
Output:
Versions
Environment: Python 3.7.3
pyarrow 0.14.1
Reporter: Naga
Related issues:
Note: This issue was originally created as ARROW-6114. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: