Skip to content

add section on rechunking to chunk optimisation section of tutorial #730

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 30, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions docs/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1288,6 +1288,45 @@ bytes within chunks of an array may improve the compression ratio, depending on
the structure of the data, the compression algorithm used, and which compression
filters (e.g., byte-shuffle) have been applied.

.. _tutorial_rechunking:

Changing chunk shapes (rechunking)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Sometimes you are not free to choose the initial chunking of your input data, or
you might have data saved with chunking which is not optimal for the analysis you
have planned. In such cases it can be advantageous to re-chunk the data. For small
datasets, or when the mismatch between input and output chunks is small
such that only a few chunks of the input dataset need to be read to create each
chunk in the output array, it is sufficient to simply copy the data to a new array
with the desired chunking, e.g. ::

>>> a = zarr.zeros((10000, 10000), chunks=(100,100), dtype='uint16', store='a.zarr')
>>> b = zarr.array(a, chunks=(100, 200), store='b.zarr')

If the chunk shapes mismatch, however, a simple copy can lead to non-optimal data
access patterns and incur a substantial performance hit when using
file based stores. One of the most pathological examples is
switching from column-based chunking to row-based chunking e.g. ::

>>> a = zarr.zeros((10000,10000), chunks=(10000, 1), dtype='uint16, store='a.zarr')
>>> b = zarr.array(a, chunks=(1,10000), store='b.zarr')

which will require every chunk in the input data set to be repeatedly read when creating
each output chunk. If the entire array will fit within memory, this is simply resolved
by forcing the entire input array into memory as a numpy array before converting
back to zarr with the desired chunking. ::

>>> a = zarr.zeros((10000,10000), chunks=(10000, 1), dtype='uint16, store='a.zarr')
>>> b = a[...]
>>> c = zarr.array(b, chunks=(1,10000), store='c.zarr')

For data sets which have mismatched chunks and which do not fit in memory, a
more sophisticated approach to rechunking, such as offered by the
`rechunker <https://github.com/pangeo-data/rechunker>`_ package and discussed
`here <https://medium.com/pangeo/rechunker-the-missing-link-for-chunked-array-analytics-5b2359e9dc11>`_
may offer a substantial improvement in performance.

.. _tutorial_sync:

Parallel computing and synchronization
Expand Down