You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Sep 11, 2023. It is now read-only.
When using the 3,600 timesteps of test Zarr data, loading is super-quick (40 it/s with batch_size=32, image_size_pixels=128, n_samples_per_timestep=4, num_workers=16). This test Zarr has chunk sizes: time=1, y=704, x=548, variable=1. It reads data at almost 200 MB/s.
But, using the full Zarr dataset (with exactly the same chunk size and compression), it struggles to get more than about 5 it/s; and reads data at a few tens of MB/s.
Experimenting, I don't think the bottleneck is gcsfs. Reading a single file; or searching using glob all seem about the same speed on the two zarr datasets.
Instead, it looks like Dask takes a long time to consider what to do with all those little chunks! The full Zarr dataset has 2 million chunks. Reading is even slower when using the Zarr array with quarter spatial resolution.
Potential solutions
First thing I'm trying is preparing a dataset with just HRV. UPDATE: This seems to work!
When we need more channels, then re-create a dataset and put the other channels in the same chunk, so the total number of chunks stays the same.
Use bigger chunks!
Can Xarray read data without dask?? Update: Yes: xr.open_zarr(filename, chunks=None)
The text was updated successfully, but these errors were encountered:
Actually, this is fixable without re-creating the Zarr! The trick is to open the Zarr file without dask by doing xr.open_dataset(filename, engine='zarr', chunks=None). Then use dask.delayed to construct our own graph. Gets 30 it/s again with full Zarr dataset (chunks: time=1). Compared to about 5it/s with the original zarr (with 36 timesteps per chunk)
The problem
When using the 3,600 timesteps of test Zarr data, loading is super-quick (40 it/s with batch_size=32, image_size_pixels=128, n_samples_per_timestep=4, num_workers=16). This test Zarr has chunk sizes: time=1, y=704, x=548, variable=1. It reads data at almost 200 MB/s.
But, using the full Zarr dataset (with exactly the same chunk size and compression), it struggles to get more than about 5 it/s; and reads data at a few tens of MB/s.
Experimenting, I don't think the bottleneck is gcsfs. Reading a single file; or searching using glob all seem about the same speed on the two zarr datasets.
Instead, it looks like Dask takes a long time to consider what to do with all those little chunks! The full Zarr dataset has 2 million chunks. Reading is even slower when using the Zarr array with quarter spatial resolution.
Potential solutions
First thing I'm trying is preparing a dataset with just HRV. UPDATE: This seems to work!
When we need more channels, then re-create a dataset and put the other channels in the same chunk, so the total number of chunks stays the same.
Use bigger chunks!
Can Xarray read data without dask?? Update: Yes:
xr.open_zarr(filename, chunks=None)
The text was updated successfully, but these errors were encountered: