-
Notifications
You must be signed in to change notification settings - Fork 260
Expose Avro reader to PyIceberg #1328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
||
|
||
#[pyfunction] | ||
pub fn read_manifest_list(bs: &[u8], cb: &PartitionSpecProviderCallbackHolder) -> PyManifestList { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing around callbacks can be problematic between Rust and Python. Is there a specific reason for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rust needs to know about the partition-spec struct, to construct the `Datum. I would favor removing this if possible, since on the PyIceberg side, we also need to pass down the PartitionSpecs to the ManifestList reader.
Right now we just pass a string, which is pretty safe (I think), but I'm not an expert. Could you elaborate on your concerns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rust needs to know about the partition-spec struct, to construct the `Datum.
So PyIceberg itself can parse [u8]
to Datum
, right? If that's the case, I think we can use a separate Manifest
struct that simply uses Vec<u8>
for the lower bound and lets python handle the parsing. Let say, UnboundManifest
.
cc @liurenjie1024, what do you think? Is it valuable to allow users to build Manifest
without PartitionSpec
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So PyIceberg itself can parse
[u8]
toDatum
right?
That's correct, that's what we do today 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree, there must be a cleaner way than this callback approach, it doesn't feel like the right way to go about things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, appreciate that! 🙌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @liurenjie1024, what do you think? Is it valuable to allow users to build Manifest without PartitionSpec?
I'm afraid not, manifest parsing needs to know the partition schema for field summaries, which are stored as binaries. I think maybe we need to a python binding for schema and partition spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in mind that the current approach also has its limitations, as it does not take into account the evolution of columns, this has been fixed: #1334. I think the issue is more fundamental as we store it as binary, instead of its actual type. See the Schema Evolution part in the spec.
I think what @Xuanwo is suggesting is pushing the deserailization of the binaries into the native Rust types down the line, which is not a bad idea because you might not be using the fields (for example in a full table scan, or a predicate that doesn't match the partition).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Silly question, instead of the callback, can we just serialize the partition-specs
from table metadata and pass it to read_manifest_list
?
And use that to reconstruct the partition_type_provider
iceberg-rust/crates/iceberg/src/spec/snapshot.rs
Lines 196 to 207 in cfe2a98
let partition_type_provider = |partition_spec_id: i32| -> Result<Option<StructType>> { | |
table_metadata | |
.partition_spec_by_id(partition_spec_id) | |
.map(|partition_spec| partition_spec.partition_type(&schema)) | |
.transpose() | |
}; | |
ManifestList::parse_with_version( | |
&manifest_list_content, | |
table_metadata.format_version(), | |
partition_type_provider, | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good idea. Why not just passing table metadata down from python to rust?
// I don't fully comprehend the deserializer here, | ||
// it works for a Type, but not for a StructType | ||
// So I had to do some awkward stuff to make it work | ||
let res: Result<Type, _> = serde_json::from_str(json); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an example of the JSON input that fails deserialization into a StructType? If so I'll see what I can do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sdd for jumping in here 👍
I would expect the following to work:
let res: Result<Type, _> = serde_json::from_str(json); | |
let res = serde_json::from_str<StructType>(json); |
I was also able to reproduce this in a unit test:
#[test]
fn empty_struct_type() {
let json = r#"{"type": "struct", "fields": []}"#;
let expected = StructType {
fields: vec![],
id_lookup: OnceLock::default(),
name_lookup: OnceLock::default(),
};
let res = serde_json::from_str::<StructType>(json).unwrap();
assert_eq!(res, expected);
}
But it looks like we need to wrap it in the Type
enum.
Hi @Fokko, I experimented a bit with this PR. One possible approach is to allow Python to access our structs in We could have something like this: #[pyfunction]
pub fn read_manifest_list_v2(bs: &[u8]) -> PyManifestList {
let reader = apache_avro::Reader::new(bs).unwrap();
let values = apache_avro::types::Value::Array(
reader
.collect::<std::result::Result<Vec<apache_avro::types::Value>, _>>()
.unwrap(),
);
let manifest_list = apache_avro::from_value::<_serde::ManifestListV2>(&values).unwrap();
PyManifestList {
inner: manifest_list,
}
} Or much better if we can expose such API directly: #[pyfunction]
pub fn read_manifest_list_v2(bs: &[u8]) -> PyManifestList {
PyManifestList {
inner: ManifestList::parse_as_is(bs),
}
} Our current design focuses solely on Rust users, but some users may simply want to parse the file themselves and don’t want iceberg-rust to handle any transformation (such as parsing into We could reconsider this, perhaps we can expose these as a public API, but hide them behind a feature gate. cc @liurenjie1024 and @sdd for ideas. |
Yes, that makes sense to me. I think we still want to have Iceberg-Rust some things like setting the default values for V2 (eg, setting Apart from that, I think your approach is great. Curious to learn what others think. |
pub struct PyLiteral { | ||
inner: Literal, | ||
} | ||
|
||
|
||
#[pyclass] | ||
pub struct PyPrimitiveLiteral { | ||
inner: PrimitiveLiteral, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider having a values.rs
module like what we did in core crate?
|
||
|
||
#[pyfunction] | ||
pub fn read_manifest_list(bs: &[u8], cb: &PartitionSpecProviderCallbackHolder) -> PyManifestList { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good idea. Why not just passing table metadata down from python to rust?
I'm leaning toward to this approach, also this makes the api more aligned with python/java implementation. |
Thanks everyone for chiming in here. Let me summarize the discussion. I think there is consensus that the callback is not ideal.
I'm leaning towards 2 since that aligns the best with PyIceberg, where we can deserialize the manifest-list without having to know about the schema. I would make sure that we have consensus before moving into a certain direction, and happy to follow up on that. |
Which issue does this PR close?
I've been looking into exposing the Avro readers to PyIceberg. This will give a huge benefit to PyIceberg because we can drop the Cython Avro reader.
What changes are included in this PR?
Exposing methods and structures to read the manifest lists, and manifests itself.
Are these changes tested?
By using them in PyIceberg :)