Metadata Log Entries metadata table #667

kevinjqliu · 2024-04-28T16:59:56Z

Resolves #594 (and part of #511)

This PR creates a metadata table for "Metadata Log Entries", similar to its spark equivalent (metadata_log_entries).

To query the metadata table, use

tbl.inspect.metadata_log_entries()

References

Spark metadata log entries table is implemented in MetadataLogEntriesTable.java
Add Snapshots table metadata #524 (snapshots metadata table)
Add Refs metadata table #602 (references metadata table)
Add entries metadata table #551 (entries metadata table)

Metadata log, append the latest metadata

The metadata log entries append the latest TableMetadata in its operation (code).

This leads to a surprising behavior where the last row of the metadata entries table is based on when the query ran.

For example,

a = spark.sql(f"SELECT * FROM {identifier}.metadata_log_entries").toPandas()
import time
time.sleep(5)
b = spark.sql(f"SELECT * FROM {identifier}.metadata_log_entries").toPandas()

(Pdb) display(a)
display (a):                 timestamp                                               file  latest_snapshot_id  latest_schema_id  latest_sequence_number
0 2024-04-28 17:21:31.336  s3://warehouse/default/table_metadata_log_entr...                 NaN               NaN                     NaN
1 2024-04-28 17:21:31.531  s3://warehouse/default/table_metadata_log_entr...        4.105762e+18               0.0                     0.0
2 2024-04-28 17:21:31.600  s3://warehouse/default/table_metadata_log_entr...        7.201925e+18               0.0                     0.0
3 2024-04-28 17:21:34.204  s3://warehouse/default/table_metadata_log_entr...        1.984627e+18               0.0                     0.0

(Pdb) display(b)
display (b):                 timestamp                                               file  latest_snapshot_id  latest_schema_id  latest_sequence_number
0 2024-04-28 17:21:31.336  s3://warehouse/default/table_metadata_log_entr...                 NaN               NaN                     NaN
1 2024-04-28 17:21:31.531  s3://warehouse/default/table_metadata_log_entr...        4.105762e+18               0.0                     0.0
2 2024-04-28 17:21:31.600  s3://warehouse/default/table_metadata_log_entr...        7.201925e+18               0.0                     0.0
3 2024-04-28 17:21:42.336  s3://warehouse/default/table_metadata_log_entr...        1.984627e+18               0.0                     0.0

# Notice the timestamp in the last row of a and b differs by more than 5 seconds

Snapshot `sequence-number` default value

There's an issue with reading V1 spec where the sequence-number is None instead of 0. According to the Iceberg spec, when reading v1 metadata for v2, the Snapshot field sequence-number must default to 0 (source).

Snapshot JSON:
sequence-number was added and is required; default to 0 when reading v1 metadata

Similarly when writing V1 spec from V2, the sequence-number should not be written. This is achieved by the new field_serializer function in TableMetadataCommonFields.

Writing v1 metadata:
Snapshot field sequence-number should not be written

pyiceberg/table/metadata.py

pyiceberg/table/__init__.py

sungwy · 2024-04-30T12:51:11Z

pyiceberg/table/snapshots.py

@@ -226,7 +226,8 @@ def __eq__(self, other: Any) -> bool:
 class Snapshot(IcebergBaseModel):
    snapshot_id: int = Field(alias="snapshot-id")
    parent_snapshot_id: Optional[int] = Field(alias="parent-snapshot-id", default=None)
-    sequence_number: Optional[int] = Field(alias="sequence-number", default=None)
+    # cannot import `INITIAL_SEQUENCE_NUMBER` due to circular import
+    sequence_number: Optional[int] = Field(alias="sequence-number", default=0)


Is there a reason the default value for the sequence number has to be changed to 0 as opposed to None?

According to the spec, https://iceberg.apache.org/spec/#version-2

Snapshot JSON: sequence-number was added and is required; default to 0 when reading v1 metadata

Also added this in the PR description

@kevinjqliu Thanks for spotting this! We definitely need to read snapshot.sequence_number as 0 for v1. However, as we have observed in the test outcome, making sequence_number default to 0 here leads to sequence_number=0 be written to version 1 table metada's snapshots, which is not allowed by spec:

Writing v1 metadata: Snapshot field sequence-number should not be written

I think we may need a new field_serializer in TableMetadataCommonFields class or some other ways to correct the behavior on write. WDYT?

Thank you! I missed the part about the V1 spec. Following your suggestion, I added a field_serializer for TableMetadataCommonFields. This will ensure that the Snapshot pydantic object will not have the sequence-number field for V1 format

pyiceberg/table/__init__.py

kevinjqliu · 2024-06-19T05:51:48Z

r? @Fokko @HonahX @syun64 please take a look when you get a chance

HonahX

@kevinjqliu Thanks for working on this! It looks geat.

HonahX · 2024-06-24T07:22:16Z

pyiceberg/table/__init__.py

+        # imitates `addPreviousFile` from Java
+        # https://github.com/apache/iceberg/blob/8248663a2a1ffddd2664ea37b45882455466f71c/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1450-L1451
+        metadata_log_entries = self.tbl.metadata.metadata_log + [
+            MetadataLogEntry(metadata_file=self.tbl.metadata_location, timestamp_ms=self.tbl.metadata.last_updated_ms)


It seems this line acts more like https://github.com/apache/iceberg/blob/8a70fe0ff5f241aec8856f8091c77fdce35ad256/core/src/main/java/org/apache/iceberg/MetadataLogEntriesTable.java#L62-L66.
Just curious the reason that you mention addPreviousFile here, which seem to be more relevant when we update the metadata_log during table commit.

BTW, this reminds me that currently non-rest catalog does not update the metadata_log field during commit.

good catch! I link the wrong code. I was trying to find all the places where the metadata log was modified

HonahX · 2024-06-24T07:56:43Z

pyiceberg/table/snapshots.py

@@ -226,7 +226,8 @@ def __eq__(self, other: Any) -> bool:
 class Snapshot(IcebergBaseModel):
    snapshot_id: int = Field(alias="snapshot-id")
    parent_snapshot_id: Optional[int] = Field(alias="parent-snapshot-id", default=None)
-    sequence_number: Optional[int] = Field(alias="sequence-number", default=None)
+    # cannot import `INITIAL_SEQUENCE_NUMBER` due to circular import
+    sequence_number: Optional[int] = Field(alias="sequence-number", default=0)


@kevinjqliu Thanks for spotting this! We definitely need to read snapshot.sequence_number as 0 for v1. However, as we have observed in the test outcome, making sequence_number default to 0 here leads to sequence_number=0 be written to version 1 table metada's snapshots, which is not allowed by spec:

Writing v1 metadata: Snapshot field sequence-number should not be written

I think we may need a new field_serializer in TableMetadataCommonFields class or some other ways to correct the behavior on write. WDYT?

HonahX

@kevinjqliu Thanks for the quick update! Just have one minor comment for the test. Overall it looks great!

tests/table/test_snapshots.py

HonahX · 2024-06-26T19:10:18Z

Merged! @kevinjqliu Thanks for another great metadata table work! Thanks @corleyma @syun64 @Fokko for reviewing!

kevinjqliu added 6 commits April 28, 2024 11:06

add metadata_entries table with tests

9a0423d

make test work

ecec57e

remove comment

9e506c2

add doc

9c77d57

make lint

b26f08f

comment

58b0609

kevinjqliu marked this pull request as ready for review April 28, 2024 17:18

comment

f7dd165

corleyma reviewed Apr 29, 2024

View reviewed changes

pyiceberg/table/metadata.py Outdated Show resolved Hide resolved

corleyma reviewed Apr 29, 2024

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

use pa.field and set nullable properly

4655c97

sungwy reviewed Apr 30, 2024

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

Fokko mentioned this pull request May 13, 2024

Add metadata tables #511

Closed

8 tasks

kevinjqliu mentioned this pull request May 14, 2024

PyIceberg Near-Term Roadmap #736

Closed

39 tasks

kevinjqliu added 6 commits June 18, 2024 18:43

Merge branch 'main' into kevinjqliu/metadata_log_entries

d989802

use table snapshot_as_of_timestamp instead

8613c2e

default sequence_number to 0

f45f2ea

improve test

2d417da

improve docs

502e5d8

add comment

1cd5f93

fix tests with string output for sequence-number

3fe675d

HonahX reviewed Jun 24, 2024

View reviewed changes

HonahX mentioned this pull request Jun 24, 2024

[Feature] Support Metadata Log Update For Non-Rest Catalogs #849

Closed

kevinjqliu added 4 commits June 24, 2024 10:33

comment

6ce55f4

revert INITIAL_SEQUENCE_NUMBER changes

aa068ff

exclude sequence_number for v1

ab74a3e

inline function

f274ef9

kevinjqliu requested a review from HonahX June 24, 2024 18:35

HonahX approved these changes Jun 25, 2024

View reviewed changes

tests/table/test_snapshots.py Outdated Show resolved Hide resolved

test_serialize_snapshot_without_sequence_number

a338e17

Fokko approved these changes Jun 26, 2024

View reviewed changes

HonahX merged commit 9cb3cd5 into apache:main Jun 26, 2024
7 checks passed

kevinjqliu deleted the kevinjqliu/metadata_log_entries branch June 26, 2024 19:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata Log Entries metadata table #667

Metadata Log Entries metadata table #667

kevinjqliu commented Apr 28, 2024 •

edited

Loading

sungwy Apr 30, 2024

kevinjqliu Jun 19, 2024

HonahX Jun 24, 2024

kevinjqliu Jun 24, 2024

kevinjqliu commented Jun 19, 2024

HonahX left a comment

HonahX Jun 24, 2024

kevinjqliu Jun 24, 2024

HonahX Jun 24, 2024

HonahX left a comment

HonahX commented Jun 26, 2024

Metadata Log Entries metadata table #667

Metadata Log Entries metadata table #667

Conversation

kevinjqliu commented Apr 28, 2024 • edited Loading

Metadata log, append the latest metadata

Snapshot sequence-number default value

sungwy Apr 30, 2024

Choose a reason for hiding this comment

kevinjqliu Jun 19, 2024

Choose a reason for hiding this comment

HonahX Jun 24, 2024

Choose a reason for hiding this comment

kevinjqliu Jun 24, 2024

Choose a reason for hiding this comment

kevinjqliu commented Jun 19, 2024

HonahX left a comment

Choose a reason for hiding this comment

HonahX Jun 24, 2024

Choose a reason for hiding this comment

kevinjqliu Jun 24, 2024

Choose a reason for hiding this comment

HonahX Jun 24, 2024

Choose a reason for hiding this comment

HonahX left a comment

Choose a reason for hiding this comment

HonahX commented Jun 26, 2024

kevinjqliu commented Apr 28, 2024 •

edited

Loading

Snapshot `sequence-number` default value