Skip to content

Support Appends with TimeTransform Partitions #703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 81 commits into from

Conversation

sungwy
Copy link
Collaborator

@sungwy sungwy commented May 6, 2024

Support writes with TimeTransforms.

TimeTransforms are can be supported natively within pyarrow as pyarrow compute functions, without requiring conversions back and forth between Arrow and Python data types.

@sungwy sungwy requested a review from Fokko May 6, 2024 17:06
felixscherz and others added 3 commits May 6, 2024 10:32
Bumps [mkdocs-section-index](https://github.com/oprypin/mkdocs-section-index) from 0.3.8 to 0.3.9.
- [Release notes](https://github.com/oprypin/mkdocs-section-index/releases)
- [Commits](oprypin/mkdocs-section-index@v0.3.8...v0.3.9)

---
updated-dependencies:
- dependency-name: mkdocs-section-index
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [cython](https://github.com/cython/cython) from 3.0.8 to 3.0.10.
- [Release notes](https://github.com/cython/cython/releases)
- [Changelog](https://github.com/cython/cython/blob/master/CHANGES.rst)
- [Commits](cython/cython@3.0.8...3.0.10)

---
updated-dependencies:
- dependency-name: cython
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Great! Thanks @syun64

def pyarrow_transform(self, source: IcebergType) -> "Callable[[pa.Array], pa.Array]": ...

@property
def supports_pyarrow_transform(self) -> bool:
Copy link
Contributor

@HonahX HonahX May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding a public method, how about we maintaining a private list of transform that supports pyarrow transform for the check at the beginning of append. For transforms that does not yet support pyarrow transform, we could throw NotImplementedError in pyarrow_transform

WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the existing convention for can_transform and preserves_order that are specified as class properties. I'm in favor of keeping it consistent with these

Copy link
Contributor

@HonahX HonahX May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing the context. My initial concern was that unlike can_transform and preserve_oder which behave differently across transforms, in the end the support_pyarrow_transform would return True for all the transforms, making it a little bit redundant. But on second thought, since we will likely not support pyarrow transform for other transforms in 0.7.0 release, this support_pyarrow_transform can be a useful property in the upcoming release.

We could deprecate this property after we support pyarrow for all transforms later. WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm on the same page! To add to that - I'm not yet sure if we will be able to support the other Transform classes in support_pyarrow_transform. The key idea here is that we want to use native pyarrow compute functions to apply the equivalent transform without converting the values back and forth between Arrow and Python data types. I'm not yet sure if we'll be able to do the same for BucketTransform for instance, with the existing range of pyarrow compute functions.

dependabot bot and others added 5 commits May 6, 2024 22:39
Bumps [tqdm](https://github.com/tqdm/tqdm) from 4.66.2 to 4.66.3.
- [Release notes](https://github.com/tqdm/tqdm/releases)
- [Commits](tqdm/tqdm@v4.66.2...v4.66.3)

---
updated-dependencies:
- dependency-name: tqdm
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [werkzeug](https://github.com/pallets/werkzeug) from 3.0.1 to 3.0.3.
- [Release notes](https://github.com/pallets/werkzeug/releases)
- [Changelog](https://github.com/pallets/werkzeug/blob/main/CHANGES.rst)
- [Commits](pallets/werkzeug@3.0.1...3.0.3)

---
updated-dependencies:
- dependency-name: werkzeug
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.3 to 3.1.4.
- [Release notes](https://github.com/pallets/jinja/releases)
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst)
- [Commits](pallets/jinja@3.1.3...3.1.4)

---
updated-dependencies:
- dependency-name: jinja2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
elif isinstance(source, TimestamptzType):
epoch = datetime.EPOCH_TIMESTAMPTZ
else:
raise ValueError(f"Cannot apply month transform for type: {source}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Cannot apply month transform for type" should be referring to 'hour', not 'month'.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the check - appreciate it!

dependabot bot and others added 9 commits May 8, 2024 15:10
Bumps [tenacity](https://github.com/jd/tenacity) from 8.2.3 to 8.3.0.
- [Release notes](https://github.com/jd/tenacity/releases)
- [Commits](jd/tenacity@8.2.3...8.3.0)

---
updated-dependencies:
- dependency-name: tenacity
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [mkdocstrings](https://github.com/mkdocstrings/mkdocstrings) from 0.25.0 to 0.25.1.
- [Release notes](https://github.com/mkdocstrings/mkdocstrings/releases)
- [Changelog](https://github.com/mkdocstrings/mkdocstrings/blob/main/CHANGELOG.md)
- [Commits](mkdocstrings/mkdocstrings@0.25.0...0.25.1)

---
updated-dependencies:
- dependency-name: mkdocstrings
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [coverage](https://github.com/nedbat/coveragepy) from 7.5.0 to 7.5.1.
- [Release notes](https://github.com/nedbat/coveragepy/releases)
- [Changelog](https://github.com/nedbat/coveragepy/blob/master/CHANGES.rst)
- [Commits](nedbat/coveragepy@7.5.0...7.5.1)

---
updated-dependencies:
- dependency-name: coverage
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [sqlalchemy](https://github.com/sqlalchemy/sqlalchemy) from 2.0.29 to 2.0.30.
- [Release notes](https://github.com/sqlalchemy/sqlalchemy/releases)
- [Changelog](https://github.com/sqlalchemy/sqlalchemy/blob/main/CHANGES.rst)
- [Commits](https://github.com/sqlalchemy/sqlalchemy/commits)

---
updated-dependencies:
- dependency-name: sqlalchemy
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [flask-cors](https://github.com/corydolphin/flask-cors) from 4.0.0 to 4.0.1.
- [Release notes](https://github.com/corydolphin/flask-cors/releases)
- [Changelog](https://github.com/corydolphin/flask-cors/blob/main/CHANGELOG.md)
- [Commits](corydolphin/flask-cors@4.0.0...4.0.1)

---
updated-dependencies:
- dependency-name: flask-cors
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
dependabot bot and others added 10 commits May 29, 2024 06:54
Bumps [coverage](https://github.com/nedbat/coveragepy) from 7.5.2 to 7.5.3.
- [Release notes](https://github.com/nedbat/coveragepy/releases)
- [Changelog](https://github.com/nedbat/coveragepy/blob/master/CHANGES.rst)
- [Commits](nedbat/coveragepy@7.5.2...7.5.3)

---
updated-dependencies:
- dependency-name: coverage
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [pydantic](https://github.com/pydantic/pydantic) from 2.7.1 to 2.7.2.
- [Release notes](https://github.com/pydantic/pydantic/releases)
- [Changelog](https://github.com/pydantic/pydantic/blob/main/HISTORY.md)
- [Commits](pydantic/pydantic@v2.7.1...v2.7.2)

---
updated-dependencies:
- dependency-name: pydantic
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [requests](https://github.com/psf/requests) from 2.32.2 to 2.32.3.
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](psf/requests@v2.32.2...v2.32.3)

---
updated-dependencies:
- dependency-name: requests
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [getdaft](https://github.com/Eventual-Inc/Daft) from 0.2.24 to 0.2.25.
- [Release notes](https://github.com/Eventual-Inc/Daft/releases)
- [Commits](Eventual-Inc/Daft@v0.2.24...v0.2.25)

---
updated-dependencies:
- dependency-name: getdaft
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [moto](https://github.com/getmoto/moto) from 5.0.8 to 5.0.9.
- [Release notes](https://github.com/getmoto/moto/releases)
- [Changelog](https://github.com/getmoto/moto/blob/master/CHANGELOG.md)
- [Commits](getmoto/moto@5.0.8...5.0.9)

---
updated-dependencies:
- dependency-name: moto
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good @syun64 Thanks for working on this, could you rebase to resolve the merge conflicts?

@Fokko
Copy link
Contributor

Fokko commented May 31, 2024

@syun64 It looks like something is off. I see a lot of unrelated changes 🤔

@sungwy
Copy link
Collaborator Author

sungwy commented May 31, 2024

@Fokko hmmm yeah I've seen this happen some times and I don't quite understand it - when you do a diff between the two branches, it shows the right changes: main...syun64:iceberg-python:transform-partition-writes

@sungwy
Copy link
Collaborator Author

sungwy commented May 31, 2024

I'll close this PR and reopen it and see if that helps

@sungwy sungwy closed this May 31, 2024
@sungwy sungwy deleted the transform-partition-writes branch May 31, 2024 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.