Description
Describe the bug
arrow-rs generated .parquet files where the schema implies a nested structure should call the list item element
as of parquet specifications:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists
... the files generated are instead using item
, probably some legacy code was used to build the code.
A similar issue has been recently fixed in polars-rs:
pola-rs/polars#17803
Pyarrow let you use item
instead of element
(default) to support legacy files, but IMHO arrow-rs should not generate legacy parquet files ( https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html
):
The code in arrow-rs that implement this is:
https://github.com/apache/arrow-rs/blob/master/arrow-schema/src/field.rs#L147
IMHO the fix will just involve a single line change, I can create a PR, but I want to be sure I'm not reading the specs in the wrong way or there is a reason for hardcoding item
since it seems too simple...
To Reproduce
Generate a nested parquet file, or use the one attached to this issue and verify (with an hex editor, parquet-schema from this REPO or with a GUI tool that shows the parquet schema like "parquet floor"), that the type name associated to the list item is always item
instead of element
.
Using
example_parquet.zip
the file attached to this ticket that follow the schema will be reported by arrow-schema
:
{
REQUIRED BYTE_ARRAY school (STRING);
REQUIRED group students (LIST) {
REPEATED group list {
OPTIONAL group item {
REQUIRED BYTE_ARRAY name (STRING);
REQUIRED INT32 age;
}
}
}
REQUIRED group teachers (LIST) {
REPEATED group list {
OPTIONAL group item {
REQUIRED BYTE_ARRAY name (STRING);
REQUIRED INT32 age;
}
}
}
}
the expected value was:
{
REQUIRED BYTE_ARRAY school (STRING);
REQUIRED group students (LIST) {
REPEATED group list {
OPTIONAL group element {
REQUIRED BYTE_ARRAY name (STRING);
REQUIRED INT32 age;
}
}
}
REQUIRED group teachers (LIST) {
REPEATED group list {
OPTIONAL group element {
REQUIRED BYTE_ARRAY name (STRING);
REQUIRED INT32 age;
}
}
}
}
I can get parquet-schema
to output element
instead of item
when generating the parquet file from python or .net.
In the hex editor you will see students.list.item.name
instead of the expected students.list.element.name
.