Skip to content
This repository was archived by the owner on Sep 11, 2023. It is now read-only.

Experiment with reading EUMETSAT data directly from raw data files (instead of from intermediate Zarr) #90

Closed
3 of 4 tasks
JackKelly opened this issue Sep 3, 2021 · 2 comments
Labels
data New data source or feature; or modification of existing data source

Comments

@JackKelly
Copy link
Member

JackKelly commented Sep 3, 2021

@JackKelly
Copy link
Member Author

JackKelly commented Oct 1, 2021

Hmmm... my very early experiments suggest that loading data from native EUMETSAT files might be way too slow... Some back-of-the-envelope maths:

It looks like it might take about 1 second to load and reproject a single satellite channel, at a single timestep.

We want 12 channels and at least 24 timesteps per example. That's 288 seconds (12 x 24) per example (just for satellite data).

If we buy a 64-core, 128-thread Epic; and if we can efficiently use all the threads , then that's 2.25 seconds per example (288 seconds / 128 threads). Which is 144 seconds per batch (2.25 seconds per example x 64 examples per batch). Which would take 1,000 hours to generate 25,000 batches! Not OK!

On Monday, I'll keep trying to see if we can speed things up. But my expectation, given these initial results, is that we'll have to revert back to using an intermediate representation. Maybe the pipeline could look something like this:

  1. Download raw data. Maybe compress using bzip2.
  2. Reproject & save intermediate data to disk (maybe as NetCDF, if saving on our own hardware. Or Zarr on SSD / cloud.). This script would know how to extend the intermediate representation whenever we download new raw data.
  3. Load the intermediate representation into nowcasting_dataset.

As an added bonus, we could perhaps try to keep the intermediate datasets fairly small (a few TBytes), so we can keep them on SSDs, so prepare_ml_data.py really flies when it's reading from the intermediate data on SSD.

Next steps:

Try speeding up reading from Nat files. Ask satpy folks if we can subset before loading? Check docs for load(). Try loading multiple Nat files and channels together. Try different resample algos.

Build little test rig for loading data in parallel in a repeatable way. Compare loading from native, loading from zarr, loading from multiple netcdf

@JackKelly
Copy link
Member Author

In #176 I've convinced myself that loading data directly from EUMETSAT .nat files will be way too slow. So we need an intermediate.

But, I'll leave this issue open for now: We might still be able to go 'straight to the source' for the NWP files (#122)

@flowirtz flowirtz moved this to In Progress in Nowcasting Oct 15, 2021
@JackKelly JackKelly changed the title Experiment with reading data directly from raw data files (instead of from intermediate Zarr) Experiment with reading EUMETSAT data directly from raw data files (instead of from intermediate Zarr) Oct 20, 2021
Repository owner moved this from In Progress to Done in Nowcasting Oct 20, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
data New data source or feature; or modification of existing data source
Projects
No open projects
Status: Done
Development

No branches or pull requests

2 participants