Expose Avro reader to PyIceberg #1328

Fokko · 2025-05-14T13:43:42Z

Which issue does this PR close?

I've been looking into exposing the Avro readers to PyIceberg. This will give a huge benefit to PyIceberg because we can drop the Cython Avro reader.

What changes are included in this PR?

Exposing methods and structures to read the manifest lists, and manifests itself.

Are these changes tested?

By using them in PyIceberg :)

…iceberg

Xuanwo · 2025-05-14T13:56:59Z

bindings/python/src/manifest.rs

+
+
+#[pyfunction]
+pub fn read_manifest_list(bs: &[u8], cb: &PartitionSpecProviderCallbackHolder) -> PyManifestList {


Passing around callbacks can be problematic between Rust and Python. Is there a specific reason for this?

Rust needs to know about the partition-spec struct, to construct the `Datum. I would favor removing this if possible, since on the PyIceberg side, we also need to pass down the PartitionSpecs to the ManifestList reader.

Right now we just pass a string, which is pretty safe (I think), but I'm not an expert. Could you elaborate on your concerns?

Rust needs to know about the partition-spec struct, to construct the `Datum.

So PyIceberg itself can parse [u8] to Datum, right? If that's the case, I think we can use a separate Manifest struct that simply uses Vec<u8> for the lower bound and lets python handle the parsing. Let say, UnboundManifest.

cc @liurenjie1024, what do you think? Is it valuable to allow users to build Manifest without PartitionSpec?

So PyIceberg itself can parse [u8] to Datum right?

That's correct, that's what we do today 👍

Yeah I agree, there must be a cleaner way than this callback approach, it doesn't feel like the right way to go about things.

Cool, appreciate that! 🙌

cc @liurenjie1024, what do you think? Is it valuable to allow users to build Manifest without PartitionSpec?

I'm afraid not, manifest parsing needs to know the partition schema for field summaries, which are stored as binaries. I think maybe we need to a python binding for schema and partition spec?

~~Keep in mind that the current approach also has its limitations, as it does not take into account the evolution of columns~~, this has been fixed: #1334. I think the issue is more fundamental as we store it as binary, instead of its actual type. See the Schema Evolution part in the spec.

I think what @Xuanwo is suggesting is pushing the deserailization of the binaries into the native Rust types down the line, which is not a bad idea because you might not be using the fields (for example in a full table scan, or a predicate that doesn't match the partition).

Silly question, instead of the callback, can we just serialize the partition-specs from table metadata and pass it to read_manifest_list?

And use that to reconstruct the partition_type_provider

iceberg-rust/crates/iceberg/src/spec/snapshot.rs

Lines 196 to 207 in cfe2a98

let partition_type_provider = |partition_spec_id: i32| -> Result<Option<StructType>> {

table_metadata

.partition_spec_by_id(partition_spec_id)

.map(|partition_spec| partition_spec.partition_type(&schema))

.transpose()

};

ManifestList::parse_with_version(

&manifest_list_content,

table_metadata.format_version(),

partition_type_provider,

)

I think this is a good idea. Why not just passing table metadata down from python to rust?

bindings/python/src/manifest.rs

sdd · 2025-05-14T18:27:12Z

bindings/python/src/manifest.rs

+        // I don't fully comprehend the deserializer here,
+        // it works for a Type, but not for a StructType
+        // So I had to do some awkward stuff to make it work
+        let res: Result<Type, _> = serde_json::from_str(json);


Do you have an example of the JSON input that fails deserialization into a StructType? If so I'll see what I can do

Thanks @sdd for jumping in here 👍

I would expect the following to work:

Suggested change

let res: Result<Type, _> = serde_json::from_str(json);

let res = serde_json::from_str<StructType>(json);

I was also able to reproduce this in a unit test:

#[test] fn empty_struct_type() { let json = r#"{"type": "struct", "fields": []}"#; let expected = StructType { fields: vec![], id_lookup: OnceLock::default(), name_lookup: OnceLock::default(), }; let res = serde_json::from_str::<StructType>(json).unwrap(); assert_eq!(res, expected); }

But it looks like we need to wrap it in the Type enum.

Xuanwo · 2025-05-15T09:04:40Z

Hi @Fokko, I experimented a bit with this PR. One possible approach is to allow Python to access our structs in _serde, which map directly to the on-disk representation without any type transformation or parsing.

We could have something like this:

#[pyfunction]
pub fn read_manifest_list_v2(bs: &[u8]) -> PyManifestList {
    let reader = apache_avro::Reader::new(bs).unwrap();
    let values = apache_avro::types::Value::Array(
        reader
            .collect::<std::result::Result<Vec<apache_avro::types::Value>, _>>()
            .unwrap(),
    );
    let manifest_list = apache_avro::from_value::<_serde::ManifestListV2>(&values).unwrap();

    PyManifestList {
        inner: manifest_list,
    }
}

Or much better if we can expose such API directly:

#[pyfunction]
pub fn read_manifest_list_v2(bs: &[u8]) -> PyManifestList {
    PyManifestList {
        inner: ManifestList::parse_as_is(bs),
    }
}

Our current design focuses solely on Rust users, but some users may simply want to parse the file themselves and don’t want iceberg-rust to handle any transformation (such as parsing into Datum).

We could reconsider this, perhaps we can expose these as a public API, but hide them behind a feature gate.

cc @liurenjie1024 and @sdd for ideas.

…iceberg

Fokko · 2025-05-15T10:08:27Z

Our current design focuses solely on Rust users, but some users may simply want to parse the file themselves and don’t want iceberg-rust to handle any transformation (such as parsing into Datum).

Yes, that makes sense to me. I think we still want to have Iceberg-Rust some things like setting the default values for V2 (eg, setting 134: content to data, when reading V1 metadata):

Apart from that, I think your approach is great. Curious to learn what others think.

liurenjie1024 · 2025-05-19T07:07:21Z

bindings/python/src/manifest.rs

+pub struct PyLiteral {
+    inner: Literal,
+}
+
+
+#[pyclass]
+pub struct PyPrimitiveLiteral {
+    inner: PrimitiveLiteral,
+}


Should we consider having a values.rs module like what we did in core crate?

liurenjie1024 · 2025-05-19T07:45:45Z

bindings/python/src/manifest.rs

+
+
+#[pyfunction]
+pub fn read_manifest_list(bs: &[u8], cb: &PartitionSpecProviderCallbackHolder) -> PyManifestList {


I think this is a good idea. Why not just passing table metadata down from python to rust?

liurenjie1024 · 2025-05-19T07:48:18Z

Our current design focuses solely on Rust users, but some users may simply want to parse the file themselves and don’t want iceberg-rust to handle any transformation (such as parsing into Datum).

I'm leaning toward to this approach, also this makes the api more aligned with python/java implementation.

Fokko · 2025-05-20T07:32:30Z

Thanks everyone for chiming in here. Let me summarize the discussion. I think there is consensus that the callback is not ideal.

Supply required information to construct the summaries
1. Instead of having the Fn(i32) -> Result<Option<StructType>> provider, we could pass in a HashMap<i32, StructType>. We would bind all the PartitionSpec's in PyIceberg. This is relative straightforward, but comes at a cost when there are many PartitionSpecs (which should be okay for the majority of tables).
2. What @kevinjqliu suggested Expose Avro reader to PyIceberg #1328 (comment) suggested. Pass in the current Schema and PartitionSpec's to Iceberg-Rust where we can do the lazy binding on the Iceberg-Rust side.
3. Go all the way, and convert the TableMetadata to Iceberg-Rust, this is probably where we end up at some point at some day, but require a lot of scaffolding.
Deserialize in Vec<u8> instead of a Datum, and convert them later into the actual type. This removes the dependency on the Schema and the PartitionSpec's.

I'm leaning towards 2 since that aligns the best with PyIceberg, where we can deserialize the manifest-list without having to know about the schema. I would make sure that we have consensus before moving into a certain direction, and happy to follow up on that.

Fokko added 5 commits May 13, 2025 02:25

WIP

7249542

Merge branch 'main' of github.com:apache/iceberg-rust

0260aa4

Expose Avro parsers in Python

cff3d2b

Merge branch 'main' of github.com:apache/iceberg-rust into fd-avro-py…

ee6aeda

…iceberg

Cleanup

fb44a0a

Xuanwo reviewed May 14, 2025

View reviewed changes

sdd reviewed May 14, 2025

View reviewed changes

bindings/python/src/manifest.rs Outdated Show resolved Hide resolved

sdd reviewed May 14, 2025

View reviewed changes

bindings/python/src/manifest.rs Outdated Show resolved Hide resolved

sdd reviewed May 14, 2025

View reviewed changes

Thanks Scott!

9bc9baf

Fokko mentioned this pull request May 14, 2025

Use Iceberg-Rust for parsing the ManifestList and Manifests apache/iceberg-python#2004

Draft

Merge branch 'main' of github.com:apache/iceberg-rust into fd-avro-py…

24b02e3

…iceberg

liurenjie1024 reviewed May 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose Avro reader to PyIceberg #1328

Expose Avro reader to PyIceberg #1328

Fokko commented May 14, 2025

Xuanwo May 14, 2025

Fokko May 14, 2025 •

edited

Loading

Xuanwo May 14, 2025 •

edited

Loading

Fokko May 14, 2025

sdd May 14, 2025

Fokko May 15, 2025

liurenjie1024 May 15, 2025

Fokko May 15, 2025 •

edited

Loading

kevinjqliu May 17, 2025

liurenjie1024 May 19, 2025

sdd May 14, 2025

Fokko May 14, 2025

Xuanwo commented May 15, 2025 •

edited

Loading

Fokko commented May 15, 2025

liurenjie1024 May 19, 2025

liurenjie1024 May 19, 2025

liurenjie1024 commented May 19, 2025

Fokko commented May 20, 2025



		#[pyfunction]
		pub fn read_manifest_list(bs: &[u8], cb: &PartitionSpecProviderCallbackHolder) -> PyManifestList {

	let partition_type_provider = \|partition_spec_id: i32\| -> Result<Option<StructType>> {
	table_metadata
	.partition_spec_by_id(partition_spec_id)
	.map(\|partition_spec\| partition_spec.partition_type(&schema))
	.transpose()
	};

	ManifestList::parse_with_version(
	&manifest_list_content,
	table_metadata.format_version(),
	partition_type_provider,
	)

	let res: Result<Type, _> = serde_json::from_str(json);
	let res = serde_json::from_str<StructType>(json);

Expose Avro reader to PyIceberg #1328

Are you sure you want to change the base?

Expose Avro reader to PyIceberg #1328

Conversation

Fokko commented May 14, 2025

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Choose a reason for hiding this comment

Fokko May 14, 2025 • edited Loading

Choose a reason for hiding this comment

Xuanwo May 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fokko May 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xuanwo commented May 15, 2025 • edited Loading

Fokko commented May 15, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liurenjie1024 commented May 19, 2025

Fokko commented May 20, 2025

Fokko May 14, 2025 •

edited

Loading

Xuanwo May 14, 2025 •

edited

Loading

Fokko May 15, 2025 •

edited

Loading

Xuanwo commented May 15, 2025 •

edited

Loading