Skip to content

Add Option To Coerce List Type on Parquet Write #6733

Closed
@ggreco

Description

@ggreco

Describe the bug
arrow-rs generated .parquet files where the schema implies a nested structure should call the list item element as of parquet specifications:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists

... the files generated are instead using item, probably some legacy code was used to build the code.

A similar issue has been recently fixed in polars-rs:
pola-rs/polars#17803

Pyarrow let you use item instead of element (default) to support legacy files, but IMHO arrow-rs should not generate legacy parquet files ( https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html
):
image

The code in arrow-rs that implement this is:
https://github.com/apache/arrow-rs/blob/master/arrow-schema/src/field.rs#L147

IMHO the fix will just involve a single line change, I can create a PR, but I want to be sure I'm not reading the specs in the wrong way or there is a reason for hardcoding item since it seems too simple...

To Reproduce
Generate a nested parquet file, or use the one attached to this issue and verify (with an hex editor, parquet-schema from this REPO or with a GUI tool that shows the parquet schema like "parquet floor"), that the type name associated to the list item is always item instead of element.

Using
example_parquet.zip
the file attached to this ticket that follow the schema will be reported by arrow-schema :

{
  REQUIRED BYTE_ARRAY school (STRING);
  REQUIRED group students (LIST) {
    REPEATED group list {
      OPTIONAL group item {
        REQUIRED BYTE_ARRAY name (STRING);
        REQUIRED INT32 age;
      }
    }
  }
  REQUIRED group teachers (LIST) {
    REPEATED group list {
      OPTIONAL group item {
        REQUIRED BYTE_ARRAY name (STRING);
        REQUIRED INT32 age;
      }
    }
  }
}

the expected value was:

{
  REQUIRED BYTE_ARRAY school (STRING);
  REQUIRED group students (LIST) {
    REPEATED group list {
      OPTIONAL group element {
        REQUIRED BYTE_ARRAY name (STRING);
        REQUIRED INT32 age;
      }
    }
  }
  REQUIRED group teachers (LIST) {
    REPEATED group list {
      OPTIONAL group element {
        REQUIRED BYTE_ARRAY name (STRING);
        REQUIRED INT32 age;
      }
    }
  }
}

I can get parquet-schema to output element instead of item when generating the parquet file from python or .net.

In the hex editor you will see students.list.item.name instead of the expected students.list.element.name.

Metadata

Metadata

Assignees

No one assigned

    Labels

    arrowChanges to the arrow crateenhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions