Skip to content
This repository was archived by the owner on Sep 11, 2023. It is now read-only.

Loading slows down with large dataset #23

Closed
JackKelly opened this issue Jun 21, 2021 · 2 comments
Closed

Loading slows down with large dataset #23

JackKelly opened this issue Jun 21, 2021 · 2 comments
Assignees

Comments

@JackKelly
Copy link
Member

JackKelly commented Jun 21, 2021

The problem

When using the 3,600 timesteps of test Zarr data, loading is super-quick (40 it/s with batch_size=32, image_size_pixels=128, n_samples_per_timestep=4, num_workers=16). This test Zarr has chunk sizes: time=1, y=704, x=548, variable=1. It reads data at almost 200 MB/s.

But, using the full Zarr dataset (with exactly the same chunk size and compression), it struggles to get more than about 5 it/s; and reads data at a few tens of MB/s.

Experimenting, I don't think the bottleneck is gcsfs. Reading a single file; or searching using glob all seem about the same speed on the two zarr datasets.

Instead, it looks like Dask takes a long time to consider what to do with all those little chunks! The full Zarr dataset has 2 million chunks. Reading is even slower when using the Zarr array with quarter spatial resolution.

Potential solutions

First thing I'm trying is preparing a dataset with just HRV. UPDATE: This seems to work!

When we need more channels, then re-create a dataset and put the other channels in the same chunk, so the total number of chunks stays the same.

Use bigger chunks!

Can Xarray read data without dask?? Update: Yes: xr.open_zarr(filename, chunks=None)

@JackKelly JackKelly self-assigned this Jun 21, 2021
@JackKelly
Copy link
Member Author

Code for looking at speeds to read data:

import gcsfs
from pathlib import Path

PATH = Path('solar-pv-nowcasting-data/satellite/EUMETSAT/SEVIRI_RSS/OSGB36/all_zarr_int16_single_timestep.zarr/')

fs = gcsfs.GCSFileSystem(access='read_only')

%%time
with fs.open(PATH / 'stacked_eumetsat_data/0.0.0.0') as f:
    data = f.read()

len(data)

%%time
fs.glob(str(PATH / 'stacked_eumetsat_data/0.0.0.*'))

@JackKelly
Copy link
Member Author

JackKelly commented Jun 21, 2021

Actually, this is fixable without re-creating the Zarr! The trick is to open the Zarr file without dask by doing xr.open_dataset(filename, engine='zarr', chunks=None). Then use dask.delayed to construct our own graph. Gets 30 it/s again with full Zarr dataset (chunks: time=1). Compared to about 5it/s with the original zarr (with 36 timesteps per chunk)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant