Implement arrow-avro SchemaStore and Fingerprinting To Enable Schema Resolution #8006

jecsand838 · 2025-07-26T03:00:26Z

Which issue does this PR close?

Part of Add Avro Support #4886
Follow up to Implement arrow-avro Reader and ReaderBuilder #7834

Rationale for this change

Apache Avro’s single object encoding prefixes every record with the marker 0xC3 0x01 followed by a Rabin schema fingerprint so that readers can identify the correct writer schema without carrying the full definition in each message.
While the current arrow‑avro implementation can read container files, it cannot ingest these framed messages or handle streams where the writer schema changes over time.

The Avro specification recommends computing a 64‑bit CRC‑64‑AVRO (Rabin) hashed fingerprint of the parsed canonical form of a schema to look up the Schema from a local schema store or registry.

This PR introduces SchemaStore and fingerprinting to enable:

Zero‑copy schema identification for decoding streaming Avro messages published in single‑object format (i.e. Kafka, Pulsar, etc) into Arrow.
Dynamic schema evolution by laying the foundation to resolve writer reader schema differences on the fly.
NOTE: Schema Resolution support in Codec and RecordDecoder coming the next PR.

What changes are included in this PR?

Area	Highlights
`reader/mod.rs`	Decoder now detects the `C3 01` prefix, extracts the fingerprint, looks up the writer schema in a `SchemaStore`, and switches to an LRU cached `RecordDecoder` without interrupting streaming; supports `static_store_mode` to skip the 2 byte peek for high‑throughput fixed‑schema pipelines.
`ReaderBuilder`	New builder configuration methods: `.with_writer_schema_store`, `.with_active_fingerprint`, `.with_static_store_mode`, `.with_reader_schema`, `.with_max_decoder_cache_size`, with rigorous validation to prevent misconfiguration.
Unit tests	New tests covering fingerprint generation, store registration/lookup, schema switching, unknown‑fingerprint errors, and interaction with UTF8‑view decoding.
Docs & Examples	Extensive inline docs with examples on all new public methods / structs.

Are these changes tested?

Yes. New tests cover:

Fingerprinting against the canonical examples from the Avro spec
SchemaStore behavior deduplication, duplicate registration, and lookup.
Decoder fast‑path with static_store_mode=true, ensuring the prefix is treated as payload, the 2 byte peek is skipped, and no schema switch is attempted.

Are there any user-facing changes?

N/A

Follow-Up PRs

Implement Schema Resolution Functionality in Codec and RecordDecoder
Add ID Fingerprint variant on SchemaStore for Confluent Schema Registry compatibility
Improve arrow-avro errors + add more benchmarks & examples to prepare for public release

jecsand838 · 2025-07-26T03:54:43Z

@alamb @scovich I apologize in advance for how large this one got! A substantial portion of the updates are detailed doc comments, examples, and tests. Functionally I don't think this PR is as large as it seems. However, let me know if this needs to be broken up and I'd be happy to do so.

… and made the schema module public. Integrated new `SchemaStore` to the `Decoder` in `reader/mod.rs`. Stubbed out `AvroField::resolve_from_writer_and_reader` in `codec.rs`. Added new tests to cover changes

alamb · 2025-07-26T12:30:36Z

@veronica-m-ef I wonder if you might have some time to help review this PR, as you previously contributed to this code?

alamb · 2025-07-26T12:32:36Z

Perhaps @svencowart you might also be interested and able to help review this PR?

scovich

This is definitely a big piece of work, but I don't know how to split up the functionality of this PR -- except some of the cosmetic changes, code movement, and variable renames should ideally be eliminated or moved to a different PR for clarity.

scovich · 2025-07-26T11:48:39Z

arrow-avro/src/codec.rs

+        writer: &'a Schema<'a>,
+        reader: &'a Schema<'a>,


aside: I guess this is a low-level avro schema instance, not the arrow schema Schema?
At least, I don't remember arrow Schema objects having a lifetime parameter?

Ah -- SchemaRef is from arrow-schema, but Schema is crate-local.

Perhaps we could rename to disambiguate? ArrowSchemaRef vs. [Avro]Schema?

scovich · 2025-07-26T11:54:44Z

arrow-avro/src/schema.rs

+/// <https://avro.apache.org/docs/1.11.1/specification/#parsing-canonical-form-for-schemas>
+#[inline]
+pub fn generate_canonical_form(schema: &Schema) -> String {
+    serde_json::to_string(&parse_canonical_json(schema)).unwrap()


unwrap because the to_string call can never fail for some reason?

I cleaned this up in my last commit. Ty for catching this.

scovich · 2025-07-26T15:51:06Z

arrow-avro/src/schema.rs

+    let canonical = generate_canonical_form(schema);
+    match hash_type {
+        FingerprintAlgorithm::Rabin => Fingerprint::Rabin(compute_fingerprint_rabin(&canonical)),


Converting something to a string just so we can hash it seems really expensive... but if I understand correctly, the avro spec mandates it?

Correct unfortunately. The fingerprints are supposed to be of a Schema in canonical form

Luckily, there shouldn't be a scenario where we need to parse a schema and fingerprint it while decoding.

scovich · 2025-07-26T15:54:52Z

arrow-avro/src/schema.rs

+    /// The hashing algorithm used for generating fingerprints.
+    fingerprint_algorithm: FingerprintAlgorithm,
+    /// A map from a schema's fingerprint to the schema itself.
+    schemas: HashMap<Fingerprint, Schema<'a>>,


This map probing seems vulnerable to hash collisions, because we probe only by hash?
(as opposed to passing the schema, probing by hash, and then confirming against the schema)?

From the spec:

fingerprints are not meant to provide any security guarantees, even the longer SHA-256-based ones. Most Avro applications should be surrounded by security measures that prevent attackers from writing random data and otherwise interfering with the consumers of schemas. We recommend that these surrounding mechanisms be used to prevent collision and pre-image attacks (i.e., “forgery”) on schema fingerprints, rather than relying on the security properties of the fingerprints themselves.

Granted, the chances of a collision should be vanishingly small for a reasonable number of schemas and a uniformly distributed 64-bit hash, so maybe we don't care?

I was planning to add some improvements to this logic when I got back in here for the extra hash types. However I went ahead and added a check to the register function. It was pretty trivial and was worth it.

scovich · 2025-07-26T15:57:29Z

arrow-avro/src/schema.rs

+    ///
+    /// An `Option` containing a clone of the `Schema` if found, otherwise `None`.
+    pub fn lookup(&self, fp: &Fingerprint) -> Option<Schema<'a>> {
+        self.schemas.get(fp).cloned()


That's an expensive clone (for a big schema)... should we return a reference to the schema instead, and let the caller clone it if they wish?

This was a great suggestion, ty for making it. This was included in my last commit as well.

scovich · 2025-07-26T17:53:10Z

arrow-avro/src/reader/mod.rs

+        self
    }

-    fn build_impl<R: BufRead>(self, reader: &mut R) -> Result<(Header, Decoder), ArrowError> {


Are these methods deleted? Or moved? Or just github is giving a messy diff?

These methods got deleted. I was able to confirm that only the Reader would ever expect a Header and was able to change build and build_decoder to this:

/// Create a [`Reader`] from this builder and a `BufRead` pub fn build<R: BufRead>(self, mut reader: R) -> Result<Reader<R>, ArrowError> { self.validate()?; let header = read_header(&mut reader)?; let decoder = self.make_decoder(Some(&header))?; Ok(Reader { reader, header, decoder, block_decoder: BlockDecoder::default(), block_data: Vec::new(), block_cursor: 0, finished: false, }) } /// Create a [`Decoder`] from this builder. pub fn build_decoder(self) -> Result<Decoder, ArrowError> { self.validate()?; self.make_decoder(None) }

scovich · 2025-07-26T17:54:01Z

arrow-avro/src/reader/mod.rs

    ///
-    /// When enabled, string data from Avro files will be loaded into
-    /// Arrow's StringViewArray instead of the standard StringArray.
-    pub fn with_utf8_view(mut self, utf8_view: bool) -> Self {


There seems to be considerable code movement in this part of the file... makes it hard to see what meaningfully changed. Is there a way to clean up the diff?

100% I'm working on that now.

Tried to clean up the diff with my latest pushes. Let me know if that's better and easier to follow.

scovich · 2025-07-26T17:55:18Z

arrow-avro/src/reader/mod.rs

+                    // No initial fingerprint; the first record must contain one.
+                    // A temporary decoder is created from the reader schema.
+                    _ => {
+                        let dec = self.make_record_decoder(&reader_schema, None)?;
+                        (None, dec)


This looks error-prone... but I guess there's no way to avoid it if the spec allows this?

This is more of the default state that we'd need to cover. What we could do is if the schema_store is set without an active_fingerprint, then throw an explicit error in the Decoder that is more clear than ArrowError::ParseError(format!("Unknown fingerprint: {new_fingerprint:?}")).

I'll clean that up in the morning, this is a good call out!

I added this check + early failure to the beginning of Decoder::decoder to help clean this up a bit:

if self.active_fingerprint.is_none() && self.writer_schema_store.is_some() && !data.starts_with(&SINGLE_OBJECT_MAGIC) { return Err(ArrowError::ParseError( "Expected single‑object encoding fingerprint prefix for first message \ (writer_schema_store is set but active_fingerprint is None)" .into(), )); }

Let me know what you think.

scovich · 2025-07-26T17:57:05Z

arrow-avro/src/reader/mod.rs

+                    // A temporary decoder is created from the reader schema.
+                    _ => {


Is it valid to ignore the case where we have Some(fp) but no schema store? That seems like an error?

100% It's an error, I'm just doing that check at the start, in the validate method:

(None, _, Some(_), _) => Err(ArrowError::ParseError( "Active fingerprint requires a writer schema store".into(), )),

scovich · 2025-07-26T17:58:34Z

arrow-avro/src/reader/mod.rs

+                Ok(Decoder {
+                    batch_size: self.batch_size,
+                    decoded_rows: 0,
+                    active_fp: init_fp,
+                    active_decoder: initial_decoder,
+                    cache: HashMap::new(),
+                    lru: VecDeque::new(),


These two constructor calls seem to have a lot of redundancy. Would it be worthwhile to factor out the args that actually differ, and create the decoder only once, outside the match?

That was a good call out. I included this abstraction in my latest commit.

Co-authored-by: Ryan Johnson <[email protected]>

scovich

Heading out the door for a couple days, but this refresh looks way better at a glance.

Will hopefully get a more thorough pass on Wed

arrow-avro/src/reader/mod.rs

scovich · 2025-07-28T13:57:20Z

arrow-avro/src/reader/mod.rs

+        while self.cache.len() > self.max_cache_size {
+            if let Some(lru_key) = self.cache.keys().next().cloned() {
+                self.cache.shift_remove(&lru_key);


This will pay quadratic work for a cache with a lot of extra entries. Hopefully that's a rare case tho?

Actually... looking at the code, there is only one call site for this method, and there will be at most one extra entry to remove. We should probably just bake that in at the call site, instead of splitting the logic up like this?

Pushed this change up in my latest commit. That was a good catch.

arrow-avro/src/reader/mod.rs

Co-authored-by: Ryan Johnson <[email protected]>

jecsand838 · 2025-07-29T04:18:24Z

Heading out the door for a couple days, but this refresh looks way better at a glance.

Will hopefully get a more thorough pass on Wed

@scovich Really appreciate the solid review on a bigger PR like this. I got those changes pushed up and the code is definitely looking much better.

scovich

Made a full pass now. Definitely headed a good direction.

scovich · 2025-07-30T19:02:40Z

arrow-avro/src/reader/mod.rs

+                // A batch is complete when its `remaining_capacity` is 0. It may be completed early if
+                // a schema change is detected or there are insufficient bytes to read the next prefix.
+                // A schema change requires a new batch.


This comment seems a bit misplaced? Should it be at L193 below?

Tho re-reading, comment seems to talk about both locations so it probably won't fit well in either one. Maybe it should be at L185 and explain the loop as a whole?

scovich · 2025-07-30T19:06:33Z

arrow-avro/src/reader/mod.rs

            // Forcing the batch to be full ensures `flush` is called next.
-            if self.decoded_rows > 0 {
-                self.decoded_rows = self.batch_size;
+            if self.remaining_capacity < self.batch_size {


Interesting.. it's possible for two schema changes to come with no rows in between?
And this check prevents emitting an empty batch in that corner case?

@scovich That was the intention. Not sure how common this scenario is in the real world, however with single object encoding you can legally receive two different schema fingerprints back to back before any rows are decoded.

Also I'll update the comment on L254 to better the reflect the changes.

scovich · 2025-07-30T19:09:19Z

arrow-avro/src/reader/mod.rs

+                    initial_decoder,
+                    init_fingerprint,


Seems better to pick one or the other of init_ vs. initial_? (slight preference toward the latter)

bump? we still have both initial_ and init_ here?

Ugh, I thought I had changed that. Sorry I was tired last night. I'll make sure that's initial_fingerprint in my next push.

Ok this is resolved. Can confirm the init_ is gone in my latest push. Sorry about making you have to catch that twice.

scovich · 2025-07-30T19:11:26Z

arrow-avro/src/reader/mod.rs

-            (Some(_), _, None, true) => Err(ArrowError::ParseError(
-                "static_store_mode=true requires an active fingerprint".into(),
-            )),
            _ => Ok(()),


Instead of defaulting to Ok, it seems better to enumerate the valid cases and default (if necessary) to a generic Err?

On a related note -- have we done the full "truth table" for these values, to determine which combos are definitely valid vs. definitely invalid? Otherwise I worry we might overlook some invalid or ambiguous cases.

I pushed up those improvements and included a truth table in the comments. Really good idea.

Also I removed the self.validate()?; call from the ReaderBuilder::build method since it's only really needed in ReaderBuilder::build_decoder

scovich · 2025-07-30T19:14:45Z

arrow-avro/src/schema.rs

+        Some(ns) if !name.contains('.') => format!("{ns}.{name}"),
+        _ => name.to_string(),


What are the rules for formatting complex field names in avro? For example, is a.`b.c`.d (a three-deep path) allowed? What about a."hi".b? etc.

Asking because the default match arm seems a bit questionable, because it implicitly covers Some(ns) if name.contains('.') but then ignores ns?

What are the rules for formatting complex field names in avro?

Avro does not support “complex” JSON‑dot or quoted paths inside a single name.

Every identifier (whether it is a name on a record/enum/fixed type, a field name, or an enum symbol) must match the regex ^[A‑Za‑z_][A‑Za‑z0‑9_]*$.

the only place the period . is between such identifiers when building a namespace‑qualified full name. When the name attribute already contains a ., Avro treats that string as the full name and the separate namespace attribute (if present) must be ignored.

For example, is a.b.c.d (a three-deep path) allowed? What about a."hi".b? etc.

a."hi".b and a.`b.c`.d are invalid because of the quotes / backticks.

a.b.c.d is valid when used as the name of a record/enum/fixed type, where it is parsed as namespace: a.b.c, name: d.

Asking because the default match arm seems a bit questionable, because it implicitly covers Some(ns) if name.contains('.') but then ignores ns?

Basically once a name contains a dot it is already the fullname, and namespace must be ignored. Not sure exactly why Avro has two different ways to express a fullname, but they are both valid and should be handled.

Got it, thanks for the explanation of yet another intricacy of the spec.

Maybe worth a code comment summarizing this?

arrow-avro/src/reader/mod.rs

scovich · 2025-07-30T19:39:40Z

arrow-avro/src/reader/mod.rs

+        let new_decoder = if let Some(decoder) = self.cache.shift_remove(&new_fingerprint) {
+            decoder
+        } else {


Suggested change

let new_decoder = if let Some(decoder) = self.cache.shift_remove(&new_fingerprint) {

decoder

} else {

let new_decoder = self.cache.shift_remove(&new_fingerprint).unwrap_or_else(|| {

Oh... all those ? complicate things a lot. Never mind.

Actually, what if this helper method only dealt with the decoder, and the (single) caller installed it? Instead of:

self.prepare_schema_switch(new_fp)?;

do

let new_decoder = match self.cache.shift_remove(&new_fingerprint) { Some(decoder) => decoder, None => self.create_decoder_for(new_fingerprint)?, }; self.pending_schema = Some((new_fingerprint, new_decoder))

Where create_decoder_for is the logic from this else block?

Super clean approach! I made those changes.

scovich · 2025-07-30T19:54:02Z

arrow-avro/src/reader/mod.rs

+            (None, _, Some(_)) => Err(ArrowError::ParseError(
+                "Active fingerprint requires a writer schema store".into(),
+            )),
+            _ => Ok(()),


Are we certain all other combos are valid? Perhaps better to enumerate the known-good and known-bad cases instead? Seems like we shouldn't even need a default match arm that that point.

I'm going with your enumeration suggestion. That will be much cleaner and maintainable. Ty for for calling that out.

arrow-avro/src/reader/mod.rs

scovich · 2025-07-30T20:06:28Z

arrow-avro/src/schema.rs

+    let mut fp = i as u64;
+    let mut j = 0;
+    while j < 8 {
+        fp = (fp >> 1) ^ (EMPTY & (0u64.wrapping_sub(fp & 1)));
+        j += 1;
+    }
+    fp


I guess we can't use iterators in const context?
Otherwise 0..8 would be helpful here.

Still doesn't seem to be stable yet: rust-lang/rust#87575

This would be nice though:

const fn one_entry(i: usize) -> u64 { let mut fp = i as u64; for _ in 0..8 { fp = (fp >> 1) ^ (EMPTY & (0u64.wrapping_sub(fp & 1))); } fp }

I'll add comments with those details.

Yeah... if iterator machinery were all const, this whole function would just be a fold:

(0..8).fold(i as u64, |fp, _| (fp >> 1) ^ (EMPTY & (0u64.wrapping_sub(fp & 1))))

Still doesn't seem to be stable yet: rust-lang/rust#87575

Wow, that led down a fascinating rabbit hole, ending at
https://github.com/oli-obk/rfcs/blob/const-trait-impl/text/0000-const-trait-impls.md

I'll be interested to see if/when that ever lands!

Co-authored-by: Ryan Johnson <[email protected]>

jecsand838 · 2025-08-04T20:05:11Z

@scovich @alamb

I know this PR has a lot of comments and quite a few refinements were needed. I also know it was way too big.

To move forward, would it be best to simply focus on getting the schema.rs changes in first (~500-600 LOC) and then follow-up with a arrow-avro/src/reader/mod.rs PR?

I do apologize for both the size of this PR and the quality issues. I'll do better going forward. Thank you again @scovich for the amount of time and effort you've put into reviewing this.

Let me know what I should do and I'll get on it immediately.

scovich · 2025-08-04T22:41:44Z

this PR has a lot of comments and quite a few refinements were needed. I also know it was way too big.

Honestly, I think the "state machine" for avro decoding is just really complex. It felt like a lot of the questions and churn ultimately come from that underlying complexity. And my lack of familiarity with the avro spec. Not sure how easily those issues could be avoided merely by raising a smaller PR?

To move forward, would it be best to simply focus on getting the schema.rs changes in first (~500-600 LOC) and then follow-up with a arrow-avro/src/reader/mod.rs PR?

Interesting. By "schema.rs changes" you mean adding the new SchemaStore and fingerprinting infrastructure? With unit tests but not yet integrated into the actual decoder? That does seem like a good idea. I don't think there's any outstanding controversy in that part of the code (just a couple of follow-on items)?

I do apologize for both the size of this PR and the quality issues.

This is NOT the kind of run of the mill logic that I would normally associate with "quality issues."
For me, at least, this has been an invigorating PR to review -- not a frustrating one.
Some hard core software engineering here with depth and subtlety to work through.

Let me know what I should do and I'll get on it immediately.

Now that we know more, the split you propose seems quite reasonable; I'm not sure it was obvious two weeks ago tho?

jecsand838 · 2025-08-04T23:04:20Z

To move forward, would it be best to simply focus on getting the schema.rs changes in first (~500-600 LOC) and then follow-up with a arrow-avro/src/reader/mod.rs PR?

Interesting. By "schema.rs changes" you mean adding the new SchemaStore and fingerprinting infrastructure? With unit tests but not yet integrated into the actual decoder? That does seem like a good idea. I don't think there's any outstanding controversy in that part of the code (just a couple of follow-on items)?

Thats correct and that's my understanding as well.

I do apologize for both the size of this PR and the quality issues.

This is NOT the kind of run of the mill logic that I would normally associate with "quality issues." For me, at least, this has been an invigorating PR to review -- not a frustrating one. Some hard core software engineering here with depth and subtlety to work through.

I really appreciate that! I just thought I should have caught some of that subtlety upfront.

Let me know what I should do and I'll get on it immediately.

Now that we know more, the split you propose seems quite reasonable; I'm not sure it was obvious two weeks ago tho?

It crossed my mind honestly. I just thought that:

It would be helpful to see the context of how the SchemaStore would be used.
I saw the functionality as coupled (even if loosely) and wanted to ensure there was test coverage in mod.rs

I'll get that PR up right now and link it here.

…g in arrow-avro reader.

# Which issue does this PR close? - Part of #4886 - Pre-work for #8006 # Rationale for this change Apache Avro’s [single object encoding](https://avro.apache.org/docs/1.11.1/specification/#single-object-encoding) prefixes every record with the marker `0xC3 0x01` followed by a `Rabin` [schema fingerprint ](https://avro.apache.org/docs/1.11.1/specification/#schema-fingerprints) so that readers can identify the correct writer schema without carrying the full definition in each message. While the current `arrow‑avro` implementation can read container files, it cannot ingest these framed messages or handle streams where the writer schema changes over time. The Avro specification recommends computing a 64‑bit CRC‑64‑AVRO (Rabin) hashed fingerprint of the [parsed canonical form of a schema](https://avro.apache.org/docs/1.11.1/specification/#parsing-canonical-form-for-schemas) to look up the `Schema` from a local schema store or registry. This PR introduces **`SchemaStore`** and **fingerprinting** to enable: * **Zero‑copy schema identification** for decoding streaming Avro messages published in single‑object format (i.e. Kafka, Pulsar, etc) into Arrow. * **Dynamic schema evolution** by laying the foundation to resolve writer reader schema differences on the fly. **NOTE:** Integration with `Decoder` and `Reader` coming in next PR. # What changes are included in this PR? | Area | Highlights | | ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **`schema.rs`** | *New* `Fingerprint`, `SchemaStore`, and `SINGLE_OBJECT_MAGIC`; canonical‑form generator; Rabin fingerprint calculator; `compare_schemas` helper. | | **`lib.rs`** | `mod schema` is now `pub` | | **Unit tests** | New tests covering fingerprint generation, store registration/lookup, unknown‑fingerprint errors, and interaction with UTF8‑view decoding. | | **Docs & Examples** | Extensive inline docs with examples on all new public methods / structs. | # Are these changes tested? Yes. New tests cover: 1. **Fingerprinting** against the canonical examples from the Avro spec 2. **`SchemaStore` behavior** deduplication, duplicate registration, and lookup. # Are there any user-facing changes? N/A

jecsand838 · 2025-08-05T20:40:07Z

@scovich @alamb Now this one is much smaller. Was there any remaining changes we still needed?

scovich

Good enough for now, I think.

There are lots of follow-ups to simplify the logic, especially places where we funnel two distinct cases through one code path that has to tease them back apart. But I think there's ongoing work to address those?

arrow-avro/src/reader/mod.rs

jecsand838 · 2025-08-05T23:37:27Z

Good enough for now, I think.

There are lots of follow-ups to simplify the logic, especially places where we funnel two distinct cases through one code path that has to tease them back apart. But I think there's ongoing work to address those?

100% I have that code mostly ready. Should only take me a few hours to get that follow-up PR up.

Co-authored-by: Ryan Johnson <[email protected]>

alamb

Thanks @jecsand838 and @scovich for the review -- very much apprecaited

I skimmed this PR and it (as the other PRs from @jecsand838 ) looks well commented and tested (and therefore maintainable). I am relying on @scovich 's work for the detailed review.

Thank you

…ation and schema fingerprinting.

alamb · 2025-08-07T20:33:08Z

🚀

@jecsand838

# Which issue does this PR close? - Part of #4886 - Extends work initiated in #8006 # Rationale for this change This introduces support for Confluent schema registry ID handling in the arrow-avro crate, adding compatibility with Confluent's wire format. These improvements enable streaming Apache Kafka, Redpanda, and Pulsar messages with Avro schemas directly into arrow-rs. # What changes are included in this PR? - Adds Confluent support - Adds initial support for SHA256 and MD5 algorithm types. Rabin remains the default. # Are these changes tested? Yes, existing tests are all passing, and tests for ID handling have been added. Benchmark results show no appreciable changes. # Are there any user-facing changes? - Confluent users need to provide the ID fingerprint when using the `set` method, unlike the `register` method which generates it from the schema on the fly. Existing API behavior has been maintained. - SchemaStore TryFrom now accepts a `&HashMap<Fingerprint, AvroSchema>`, rather than a `&[AvroSchema]` Huge shout out to @jecsand838 for his collaboration on this! --------- Co-authored-by: Connor Sanders <[email protected]>

github-actions bot added the arrow Changes to the arrow crate label Jul 26, 2025

jecsand838 force-pushed the avro-reader-schema-store branch 4 times, most recently from b549452 to a4a4df8 Compare July 26, 2025 03:27

jecsand838 force-pushed the avro-reader-schema-store branch from a4a4df8 to ca39cba Compare July 26, 2025 04:06

Added arrow-avro SchemaStore and fingerprint support to schema.rs…

4890faa

… and made the schema module public. Integrated new `SchemaStore` to the `Decoder` in `reader/mod.rs`. Stubbed out `AvroField::resolve_from_writer_and_reader` in `codec.rs`. Added new tests to cover changes

jecsand838 force-pushed the avro-reader-schema-store branch from ca39cba to 4890faa Compare July 26, 2025 04:42

scovich reviewed Jul 26, 2025

View reviewed changes

jecsand838 and others added 4 commits July 27, 2025 16:44

Update arrow-avro/src/reader/mod.rs

f8040c6

Co-authored-by: Ryan Johnson <[email protected]>

Update arrow-avro/src/reader/mod.rs

f2f34d0

Co-authored-by: Ryan Johnson <[email protected]>

Update arrow-avro/src/reader/mod.rs

3db9aed

Co-authored-by: Ryan Johnson <[email protected]>

Address PR Comments

da7b1b9

jecsand838 force-pushed the avro-reader-schema-store branch from 9dde02c to da7b1b9 Compare July 28, 2025 04:41

scovich reviewed Jul 28, 2025

View reviewed changes

jecsand838 and others added 2 commits July 28, 2025 12:05

Update arrow-avro/src/reader/mod.rs

ed7fb49

Co-authored-by: Ryan Johnson <[email protected]>

Update arrow-avro/src/reader/mod.rs

8e6face

Co-authored-by: Ryan Johnson <[email protected]>

jecsand838 force-pushed the avro-reader-schema-store branch 4 times, most recently from 0608fd1 to 98ae29a Compare July 29, 2025 04:00

Address Remaining PR Comments

25c3899

jecsand838 force-pushed the avro-reader-schema-store branch from 98ae29a to 25c3899 Compare July 29, 2025 04:06

jecsand838 requested a review from scovich July 29, 2025 04:18

scovich reviewed Jul 30, 2025

View reviewed changes

Update arrow-avro/src/reader/mod.rs

bf55dba

Co-authored-by: Ryan Johnson <[email protected]>

jecsand838 force-pushed the avro-reader-schema-store branch from 98f0b91 to 4f734e2 Compare August 4, 2025 15:33

Update arrow-avro/src/reader/mod.rs

9c828c6

Co-authored-by: Ryan Johnson <[email protected]>

jecsand838 force-pushed the avro-reader-schema-store branch from 4924637 to 9c828c6 Compare August 4, 2025 19:16

jecsand838 mentioned this pull request Aug 4, 2025

Add arrow-avro SchemaStore and fingerprinting #8039

Merged

jecsand838 force-pushed the avro-reader-schema-store branch from 6bec449 to 78b6633 Compare August 5, 2025 08:15

Remove LRU cache feature and refactor StoreStore + AvroSchema handlin…

dc56c70

…g in arrow-avro reader.

jecsand838 force-pushed the avro-reader-schema-store branch from 78b6633 to dc56c70 Compare August 5, 2025 08:21

Merge branch 'main' into avro-reader-schema-store

5a1b7e4

scovich approved these changes Aug 5, 2025

View reviewed changes

arrow-avro/src/reader/mod.rs Outdated Show resolved Hide resolved

arrow-avro/src/reader/mod.rs Outdated Show resolved Hide resolved

arrow-avro/src/reader/mod.rs Outdated Show resolved Hide resolved

jecsand838 and others added 3 commits August 5, 2025 18:37

Update arrow-avro/src/reader/mod.rs

fc6a5cb

Co-authored-by: Ryan Johnson <[email protected]>

Update arrow-avro/src/reader/mod.rs

eca7ca3

Co-authored-by: Ryan Johnson <[email protected]>

Update arrow-avro/src/reader/mod.rs

ef9051e

Co-authored-by: Ryan Johnson <[email protected]>

jecsand838 force-pushed the avro-reader-schema-store branch from 6724d75 to ef9051e Compare August 5, 2025 23:40

alamb approved these changes Aug 6, 2025

View reviewed changes

jecsand838 mentioned this pull request Aug 6, 2025

Add arrow-avro Decoder Benchmarks #8025

Merged

Merge branch 'apache:main' into avro-reader-schema-store

da0cfc2

jecsand838 force-pushed the avro-reader-schema-store branch from 71f63c7 to 5edc763 Compare August 7, 2025 18:17

Updated decoder benchmarks to include Avro single-object prefix gener…

c450401

…ation and schema fingerprinting.

jecsand838 force-pushed the avro-reader-schema-store branch from 5edc763 to c450401 Compare August 7, 2025 18:34

alamb merged commit 4a21443 into apache:main Aug 7, 2025
24 checks passed

jecsand838 deleted the avro-reader-schema-store branch August 10, 2025 00:25

nathaniel-d-ef mentioned this pull request Aug 28, 2025

Adds Confluent wire format handling to arrow-avro crate #8242

Merged

		// A temporary decoder is created from the reader schema.
		_ => {

		Some(ns) if !name.contains('.') => format!("{ns}.{name}"),
		_ => name.to_string(),

Implement arrow-avro SchemaStore and Fingerprinting To Enable Schema Resolution #8006

Implement arrow-avro SchemaStore and Fingerprinting To Enable Schema Resolution #8006

Uh oh!

Conversation

jecsand838 commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Follow-Up PRs

Uh oh!

jecsand838 commented Jul 26, 2025

Uh oh!

alamb commented Jul 26, 2025

Uh oh!

alamb commented Jul 26, 2025

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jecsand838 Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jecsand838 Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jecsand838 commented Jul 26, 2025 •

edited

Loading

jecsand838 Jul 29, 2025 •

edited

Loading

jecsand838 Jul 28, 2025 •

edited

Loading