-
Notifications
You must be signed in to change notification settings - Fork 291
[feature request] Allow engines to time travel #600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think this is a great discussion item @kevinjqliu - thank you for raising this. I'm a bit torn between whether we (PyIceberg) should be responsible for creating separate instances of Tables specific to that snapshot_id like you suggest, or if the engines should be responsible for accepting the For instance, the PolaRs scan function you shared takes the pyiceberg.Table as an arg, and it uses pyiceberg.Table.scan, but it just doesn't utilize the rest of the function args it supports. |
+1, I agree with you. Passing the snapshot-id should be an engine-specific implementation detail. I was thinking about the Spark/Trino syntax of I interpret the |
Still, an api like
|
Yea, it's helpful in situations where we need to manipulate the Table state and get back the exact information at the specific time/snapshot-id . For example, being able to quickly get the metadata json of a specific snapshot. This is more of a PyIceberg feature and not related to any engine implementation using PyIceberg. |
Presto time travel reference - https://prestodb.io/docs/0.286/connector/iceberg.html#time-travel-using-version-system-version-and-timestamp-system-time |
Hi @corleyma, this is an interesting suggestion. But I think it is important that we draw a distinction between a snapshot and a metadata json file. A metadata json file can exist without a snapshot - for example, if the table is newly created, the table will have a metadata json file describing its current state, without any snapshots committed to it. In the spec, you can see that the TableMetadata object that corresponds 1-1 to the metadata json file has the current-snapshot-id as an optional field. More over, multiple different snapshots can also be committed between two consecutive metadata json files. In summary, a snapshot is an optional property of a metadata json file, and hence I don't think it makes sense for a metadata json file to be an attribute of a Snapshot |
In what situations would that occur? In my (possibly incorrect) mental model of how Iceberg works, any new snapshot should result in a new metadata file. Anyway, I think the fact that not every metadata json file has a snapshot is less of a concern, so long as every snapshot is supposed to have a uniquely associated metadata json file... but it sounds like maybe that's not the case. (As a separate aside: the fact that creating a table doesn't initialize a snapshot seems like a weird edgecase to my mind. Is there a reason the spec was designed this way? And, are there other scenarios where changes to table metadata don't result in a new snapshot?) |
A little late to the party here, but to summarize: In order to perform time travel, the engine can read (scan) a particular snapshot of the iceberg table. The scan of a particular snapshot can be specified either by the snapshot id or by a timestamp. #748 provides the helper function to query a particular snapshot id based on timestamp. Closing this out, please let me know if we should reopen this for any other discussion items. |
Here's an example of how time travel can be supported in an engine using PyIceberg: Eventual-Inc/Daft#2426 |
Feature Request / Improvement
When engines, such as Daft, read from the
Table
object (see scan_iceberg), it would be great if PyIceberg transparently handles time travel.For example, to query an Iceberg table at a specific commit or timestamp, we can use PyIceberg to time travel to the particular snapshot-id or timestamp and then pass it into the engine.
There are several options to achieve this:
Table
object with the metadata of a specificSnapshot
. Maybe a function likeTable.as_of(snapshot_id/timestamp) -> Table
. This will make time travel transparent to the engine.Snapshot
object to the engine. The functionTable.snapshot_by_id -> Snapshot
already exists, and represents a specific Iceberg commit. The engine will need to be able to read from bothSnapshot
andTable
Happy to explore other options as well.
The text was updated successfully, but these errors were encountered: