Loading slows down with large dataset #23

JackKelly · 2021-06-21T11:22:23Z

The problem

When using the 3,600 timesteps of test Zarr data, loading is super-quick (40 it/s with batch_size=32, image_size_pixels=128, n_samples_per_timestep=4, num_workers=16). This test Zarr has chunk sizes: time=1, y=704, x=548, variable=1. It reads data at almost 200 MB/s.

But, using the full Zarr dataset (with exactly the same chunk size and compression), it struggles to get more than about 5 it/s; and reads data at a few tens of MB/s.

Experimenting, I don't think the bottleneck is gcsfs. Reading a single file; or searching using glob all seem about the same speed on the two zarr datasets.

Instead, it looks like Dask takes a long time to consider what to do with all those little chunks! The full Zarr dataset has 2 million chunks. Reading is even slower when using the Zarr array with quarter spatial resolution.

Potential solutions

First thing I'm trying is preparing a dataset with just HRV. UPDATE: This seems to work!

When we need more channels, then re-create a dataset and put the other channels in the same chunk, so the total number of chunks stays the same.

Use bigger chunks!

Can Xarray read data without dask?? Update: Yes: xr.open_zarr(filename, chunks=None)

The text was updated successfully, but these errors were encountered:

JackKelly · 2021-06-21T12:40:19Z

Code for looking at speeds to read data:

import gcsfs
from pathlib import Path

PATH = Path('solar-pv-nowcasting-data/satellite/EUMETSAT/SEVIRI_RSS/OSGB36/all_zarr_int16_single_timestep.zarr/')

fs = gcsfs.GCSFileSystem(access='read_only')

%%time
with fs.open(PATH / 'stacked_eumetsat_data/0.0.0.0') as f:
    data = f.read()

len(data)

%%time
fs.glob(str(PATH / 'stacked_eumetsat_data/0.0.0.*'))

JackKelly · 2021-06-21T14:29:22Z

Actually, this is fixable without re-creating the Zarr! The trick is to open the Zarr file without dask by doing xr.open_dataset(filename, engine='zarr', chunks=None). Then use dask.delayed to construct our own graph. Gets 30 it/s again with full Zarr dataset (chunks: time=1). Compared to about 5it/s with the original zarr (with 36 timesteps per chunk)

…orrectly #21. Also fixes the slow loading #23

JackKelly self-assigned this Jun 21, 2021

JackKelly closed this as completed Jun 21, 2021

JackKelly added a commit that referenced this issue Jun 21, 2021

Speed up loading by only using HRV channel in dataset for now. #23

ef65c33

JackKelly mentioned this issue Jun 21, 2021

Try with chunks representing quarter of the spatial extent #22

Closed

JackKelly added a commit that referenced this issue Jun 21, 2021

Tests pass! Tried partial_decompress but not sure it's being passed c…

64f4ac9

…orrectly #21. Also fixes the slow loading #23

JackKelly mentioned this issue Nov 18, 2021

Experiment will allowing xr.open_mfdataset to use dask for NWPs and Satellite to speed up loading #456

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading slows down with large dataset #23

Loading slows down with large dataset #23

JackKelly commented Jun 21, 2021 •

edited

Loading

JackKelly commented Jun 21, 2021

JackKelly commented Jun 21, 2021 •

edited

Loading

Loading slows down with large dataset #23

Loading slows down with large dataset #23

Comments

JackKelly commented Jun 21, 2021 • edited Loading

The problem

Potential solutions

JackKelly commented Jun 21, 2021

JackKelly commented Jun 21, 2021 • edited Loading

JackKelly commented Jun 21, 2021 •

edited

Loading

JackKelly commented Jun 21, 2021 •

edited

Loading