Skip to content

[feature request] Allow engines to time travel #600

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kevinjqliu opened this issue Apr 12, 2024 · 9 comments
Closed

[feature request] Allow engines to time travel #600

kevinjqliu opened this issue Apr 12, 2024 · 9 comments

Comments

@kevinjqliu
Copy link
Contributor

Feature Request / Improvement

When engines, such as Daft, read from the Table object (see scan_iceberg), it would be great if PyIceberg transparently handles time travel.

For example, to query an Iceberg table at a specific commit or timestamp, we can use PyIceberg to time travel to the particular snapshot-id or timestamp and then pass it into the engine.

There are several options to achieve this:

  1. Construct Table object with the metadata of a specific Snapshot. Maybe a function like Table.as_of(snapshot_id/timestamp) -> Table. This will make time travel transparent to the engine.
  2. Pass the Snapshot object to the engine. The function Table.snapshot_by_id -> Snapshot already exists, and represents a specific Iceberg commit. The engine will need to be able to read from both Snapshot and Table

Happy to explore other options as well.

@sungwy
Copy link
Collaborator

sungwy commented Apr 13, 2024

I think this is a great discussion item @kevinjqliu - thank you for raising this.

I'm a bit torn between whether we (PyIceberg) should be responsible for creating separate instances of Tables specific to that snapshot_id like you suggest, or if the engines should be responsible for accepting the snapshot_id as an argument to their API, and pass the value to the scan function, which already supports the snapshot ID as an argument.

For instance, the PolaRs scan function you shared takes the pyiceberg.Table as an arg, and it uses pyiceberg.Table.scan, but it just doesn't utilize the rest of the function args it supports.

@kevinjqliu
Copy link
Contributor Author

+1, I agree with you. Passing the snapshot-id should be an engine-specific implementation detail. I was thinking about the Spark/Trino syntax of AS OF <snapshot-id> / <timestamp>.

I interpret the Table object in PyIceberg to represent the "Table and its metadata", not the "Table state at a specific snapshot". So that means reading a specific snapshot requires providing the argument to the scan function

@corleyma
Copy link

Still, an api like Table.as_of(snapshot_id/timestamp) -> Snapshot would be useful, even if reading requires then passing the correct arguments to Table.scan. In general it should be easier for pyiceberg users to get:

  • the snapshot id for a timestamp
  • the path to the metadata json file for a given snapshot id.
    • I really wish this was a property of the Snapshot class; is that possible or does this break correspondence between PyIceberg models and Iceberg spec?

@kevinjqliu
Copy link
Contributor Author

an api like Table.as_of(snapshot_id/timestamp) -> Snapshot would be useful

Yea, it's helpful in situations where we need to manipulate the Table state and get back the exact information at the specific time/snapshot-id . For example, being able to quickly get the metadata json of a specific snapshot.

This is more of a PyIceberg feature and not related to any engine implementation using PyIceberg.

@gupteaj
Copy link

gupteaj commented Apr 19, 2024

Presto time travel reference - https://prestodb.io/docs/0.286/connector/iceberg.html#time-travel-using-version-system-version-and-timestamp-system-time
Time travel for snapshot_id is using equality where as timestamp clause will match closest ( or equal) timestamp from snapshots catalog table.

@sungwy
Copy link
Collaborator

sungwy commented May 2, 2024

  • the path to the metadata json file for a given snapshot id.
    • I really wish this was a property of the Snapshot class; is that possible or does this break correspondence between PyIceberg models and Iceberg spec?

Hi @corleyma, this is an interesting suggestion. But I think it is important that we draw a distinction between a snapshot and a metadata json file. A metadata json file can exist without a snapshot - for example, if the table is newly created, the table will have a metadata json file describing its current state, without any snapshots committed to it. In the spec, you can see that the TableMetadata object that corresponds 1-1 to the metadata json file has the current-snapshot-id as an optional field.

More over, multiple different snapshots can also be committed between two consecutive metadata json files.

In summary, a snapshot is an optional property of a metadata json file, and hence I don't think it makes sense for a metadata json file to be an attribute of a Snapshot

@corleyma
Copy link

corleyma commented May 3, 2024

More over, multiple different snapshots can also be committed between two consecutive metadata json files.

In what situations would that occur? In my (possibly incorrect) mental model of how Iceberg works, any new snapshot should result in a new metadata file.

Anyway, I think the fact that not every metadata json file has a snapshot is less of a concern, so long as every snapshot is supposed to have a uniquely associated metadata json file... but it sounds like maybe that's not the case.

(As a separate aside: the fact that creating a table doesn't initialize a snapshot seems like a weird edgecase to my mind. Is there a reason the spec was designed this way? And, are there other scenarios where changes to table metadata don't result in a new snapshot?)

@kevinjqliu
Copy link
Contributor Author

A little late to the party here, but to summarize:

In order to perform time travel, the engine can read (scan) a particular snapshot of the iceberg table. The scan of a particular snapshot can be specified either by the snapshot id or by a timestamp.

#748 provides the helper function to query a particular snapshot id based on timestamp.

Closing this out, please let me know if we should reopen this for any other discussion items.

@sungwy
Copy link
Collaborator

sungwy commented Jun 21, 2024

Here's an example of how time travel can be supported in an engine using PyIceberg: Eventual-Inc/Daft#2426

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants