Skip to content

Prevent unsafe concurrent coordinate writes #17

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 10, 2023
Merged

Conversation

spencerkclark
Copy link
Owner

@spencerkclark spencerkclark commented May 8, 2023

When concurrently writing partitions of DataArrays in a Dataset, any coordinates carried along by those DataArrays are also written concurrently. These attached coordinates do not necessarily adhere to the same chunk structure as the DataArray itself. This is an issue, since frequently they are completely unchunked, meaning that concurrent jobs attempt to write coordinates to the same blob files on disk, opening the possibility for data corruption. In practice data corruption of coordinates has been rare, but we recently encountered a situation where it occurred.

This PR fixes this issue by dropping all coordinates when doing low-level partitioned writes. As expected, now any unchunked coordinates or data variables are written once (and only once) during the store initialization step. This PR also addresses the subtle issue of writing chunked coordinates by treating them as though they were independent data variables (previously we did not have test coverage for this case, though it is admittedly rare to encounter in practice).

Copy link

@oliverwm1 oliverwm1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Owner Author

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @oliverwm1!

@spencerkclark spencerkclark merged commit 1edae46 into main May 10, 2023
@spencerkclark spencerkclark deleted the coordinate-bug branch May 10, 2023 20:40
spencerkclark added a commit that referenced this pull request May 10, 2023
This bumps the version of xpartition for a new release to include the fix in #17.
spencerkclark added a commit to ai2cm/fv3net that referenced this pull request May 10, 2023
This PR bumps the version of xpartition to 0.2.1, which was just
released, and includes a safety-oriented bug fix. This fix should
eliminate the risk of corrupted coordinate data, which we only observed
once on a recently created dataset, but is important to address.

xref: spencerkclark/xpartition#17

Coverage reports (updated automatically):
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants