diff --git a/docs/tutorial.rst b/docs/tutorial.rst index 0bcb45a5c4..79b3b87c35 100644 --- a/docs/tutorial.rst +++ b/docs/tutorial.rst @@ -1288,6 +1288,45 @@ bytes within chunks of an array may improve the compression ratio, depending on the structure of the data, the compression algorithm used, and which compression filters (e.g., byte-shuffle) have been applied. +.. _tutorial_rechunking: + +Changing chunk shapes (rechunking) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Sometimes you are not free to choose the initial chunking of your input data, or +you might have data saved with chunking which is not optimal for the analysis you +have planned. In such cases it can be advantageous to re-chunk the data. For small +datasets, or when the mismatch between input and output chunks is small +such that only a few chunks of the input dataset need to be read to create each +chunk in the output array, it is sufficient to simply copy the data to a new array +with the desired chunking, e.g. :: + + >>> a = zarr.zeros((10000, 10000), chunks=(100,100), dtype='uint16', store='a.zarr') + >>> b = zarr.array(a, chunks=(100, 200), store='b.zarr') + +If the chunk shapes mismatch, however, a simple copy can lead to non-optimal data +access patterns and incur a substantial performance hit when using +file based stores. One of the most pathological examples is +switching from column-based chunking to row-based chunking e.g. :: + + >>> a = zarr.zeros((10000,10000), chunks=(10000, 1), dtype='uint16, store='a.zarr') + >>> b = zarr.array(a, chunks=(1,10000), store='b.zarr') + +which will require every chunk in the input data set to be repeatedly read when creating +each output chunk. If the entire array will fit within memory, this is simply resolved +by forcing the entire input array into memory as a numpy array before converting +back to zarr with the desired chunking. :: + + >>> a = zarr.zeros((10000,10000), chunks=(10000, 1), dtype='uint16, store='a.zarr') + >>> b = a[...] + >>> c = zarr.array(b, chunks=(1,10000), store='c.zarr') + +For data sets which have mismatched chunks and which do not fit in memory, a +more sophisticated approach to rechunking, such as offered by the +`rechunker `_ package and discussed +`here `_ +may offer a substantial improvement in performance. + .. _tutorial_sync: Parallel computing and synchronization