[Python] Datatypes not preserved for partition fields in roundtrip to partitioned parquet dataset #22510

asfimport · 2019-08-02T04:32:59Z

Datatypes are not preserved when a pandas data frame is partitioned and saved as parquet file using pyarrow but that's not the case when the data frame is not partitioned.

Case 1: Saving a partitioned dataset - Data Types are NOT preserved

# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame( {'age': [77,32,234],'name':['agan','bbobby','test'] }
)
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, preserve_index=False)

# Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

Output:

Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
name object
age category
dtype: object

From the above output, we could see that the data type for age is int64 in the original pandas data frame but it got changed to category when we saved to local and loaded back.

Case 2: Non-partitioned dataset - Data types are preserved

import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow')
df = pd.DataFrame(

{'age': [77,32,234],'name':['agan','bbobby','test'] }

)
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)
# Loading a non-partioned parquet file from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

Output:

Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
age int64
name object
dtype: object

Versions

Python 3.7.3
pyarrow 0.14.1

Environment: Python 3.7.3
pyarrow 0.14.1
Reporter: Naga

Related issues:

[Python] Underscores in partition (string) values are dropped when reading dataset (is related to)
[C++][Dataset] Automatically detect boolean partition columns (is related to)
[Python][Dataset] Support using dataset API in pyarrow.parquet with a minimal ParquetDataset shim (depends upon)

_{Note: This issue was originally created as ARROW-6114. Please see the migration documentation for further details.}

asfimport · 2019-08-02T09:36:59Z

Joris Van den Bossche / @jorisvandenbossche:
[~bnriiitb] thanks for opening the issue.

So when a partitioned dataset is written, the partition columns are not stored in the actual data, but are part of the directory schema (in your case you would have "age=77", "age=32", etc sub-folders).

Currently, we don't save any "meta data" about the columns used to partition, and since they are also not stored in the actual parquet files (where a schema of the data is stored), we don't have that information from there either.

So when reading a partitioned dataset, (py)arrow has not much information about the type of this partition column. Currently, the logic is to try to convert the values to ints and otherwise leave as strings, and then those values are converted to a Dictionary type (corresponding to categorical type in pandas). This logic is here:

arrow/python/pyarrow/parquet.py

Lines 585 to 609 in 06fd2da

    
           if len(self.partition_keys) > 0: 
        
               if partitions is None: 
        
                   raise ValueError('Must pass partition sets') 
        
               # Here, the index is the categorical code of the partition where 
        
               # this piece is located. Suppose we had 
        
               # 
        
               # /foo=a/0.parq 
        
               # /foo=b/0.parq 
        
               # /foo=c/0.parq 
        
               # 
        
               # Then we assign a=0, b=1, c=2. And the resulting Table pieces will 
        
               # have a DictionaryArray column named foo having the constant index 
        
               # value as indicated. The distinct categories of the partition have 
        
               # been computed in the ParquetManifest 
        
               for i, (name, index) in enumerate(self.partition_keys): 
        
                   # The partition code is the same for all values in this piece 
        
                   indices = np.array([index], dtype='i4').repeat(len(table)) 
        
                   # This is set of all partition values, computed as part of the 
        
                   # manifest, so ['a', 'b', 'c'] as in our example above. 
        
                   dictionary = partitions.levels[i].dictionary 
        
                   arr = pa.DictionaryArray.from_arrays(indices, dictionary) 
        
                   table = table.append_column(name, arr)

There is currently no option to change this. So right now, the workaround is to convert the categorical back to an integer column in pandas.
But longer term, we should maybe think about storing the type of the partition keys as metadata, and an option to restore it as a dictionary column or not.

Related issues about the type of the partition column: ARROW-3388 (booleans as strings), ARROW-5666 (strings with underscores interpreted as int)

asfimport · 2019-08-20T23:26:06Z

Wes McKinney / @wesm:
Fixing this is not that easy. If anyone looks at this I would recommend tackling after we've got Python up and running on a C++-based dataset implementation

asfimport · 2020-07-08T08:27:59Z

Joris Van den Bossche / @jorisvandenbossche:
One possibility to look at (although it would only help on the python side, not for inferring the data type of the partition field in C++), is to ensure we store the data type information in the pandas metadata (right now we drop the partition fields from that), so this information can be used on conversion back to pandas (but of course this only helps for roundtrip from/to pandas, not for just pyarrow Table).

alippai · 2023-05-03T03:05:15Z

This now prevents storing and loading a pandas dataframe partitioned on a date column: pandas-dev/pandas#53008

alippai mentioned this issue Apr 30, 2023

BUG: partitioning parquet by pyarrow.date32 fails when reading pandas-dev/pandas#53008

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Python] Datatypes not preserved for partition fields in roundtrip to partitioned parquet dataset #22510

[Python] Datatypes not preserved for partition fields in roundtrip to partitioned parquet dataset #22510

asfimport commented Aug 2, 2019 •

edited

Loading

asfimport commented Aug 2, 2019

Uh oh!

asfimport commented Aug 20, 2019

Uh oh!

asfimport commented Jul 8, 2020

Uh oh!

alippai commented May 3, 2023

Uh oh!

[Python] Datatypes not preserved for partition fields in roundtrip to partitioned parquet dataset #22510

[Python] Datatypes not preserved for partition fields in roundtrip to partitioned parquet dataset #22510

Comments

asfimport commented Aug 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Datatypes are not preserved when a pandas data frame is partitioned and saved as parquet file using pyarrow but that's not the case when the data frame is not partitioned.

From the above output, we could see that the data type for age is int64 in the original pandas data frame but it got changed to category when we saved to local and loaded back.

Related issues:

asfimport commented Aug 2, 2019

Uh oh!

asfimport commented Aug 20, 2019

Uh oh!

asfimport commented Jul 8, 2020

Uh oh!

alippai commented May 3, 2023

Uh oh!

asfimport commented Aug 2, 2019 •

edited

Loading