Experiment with reading EUMETSAT data directly from raw data files (instead of from intermediate Zarr) #90

JackKelly · 2021-09-03T17:17:56Z

e.g., for EUMETSAT, load directly from .nat files (downloaded by #91); and for NWPs load directly from the raw NetCDF or GRIB files.

See this doc for more context.

The text was updated successfully, but these errors were encountered:

JackKelly · 2021-10-01T16:57:58Z

Hmmm... my very early experiments suggest that loading data from native EUMETSAT files might be way too slow... Some back-of-the-envelope maths:

It looks like it might take about 1 second to load and reproject a single satellite channel, at a single timestep.

We want 12 channels and at least 24 timesteps per example. That's 288 seconds (12 x 24) per example (just for satellite data).

If we buy a 64-core, 128-thread Epic; and if we can efficiently use all the threads , then that's 2.25 seconds per example (288 seconds / 128 threads). Which is 144 seconds per batch (2.25 seconds per example x 64 examples per batch). Which would take 1,000 hours to generate 25,000 batches! Not OK!

On Monday, I'll keep trying to see if we can speed things up. But my expectation, given these initial results, is that we'll have to revert back to using an intermediate representation. Maybe the pipeline could look something like this:

Download raw data. Maybe compress using bzip2.
Reproject & save intermediate data to disk (maybe as NetCDF, if saving on our own hardware. Or Zarr on SSD / cloud.). This script would know how to extend the intermediate representation whenever we download new raw data.
Load the intermediate representation into nowcasting_dataset.

As an added bonus, we could perhaps try to keep the intermediate datasets fairly small (a few TBytes), so we can keep them on SSDs, so prepare_ml_data.py really flies when it's reading from the intermediate data on SSD.

Next steps:

Try speeding up reading from Nat files. Ask satpy folks if we can subset before loading? Check docs for load(). Try loading multiple Nat files and channels together. Try different resample algos.

Build little test rig for loading data in parallel in a repeatable way. Compare loading from native, loading from zarr, loading from multiple netcdf

JackKelly · 2021-10-05T15:16:10Z

In #176 I've convinced myself that loading data directly from EUMETSAT .nat files will be way too slow. So we need an intermediate.

But, I'll leave this issue open for now: We might still be able to go 'straight to the source' for the NWP files (#122)

JackKelly added the data New data source or feature; or modification of existing data source label Sep 3, 2021

JackKelly added this to the ESO Work Package 1: Essential tasks milestone Sep 7, 2021

This was referenced Sep 7, 2021

Re-chunk Satellite Zarr with one chunk for all channels and full spatial extent #44

Closed

Check for -1 values in sat data #29

Closed

Re-create NWP Zarr with one chunk per init_time and step and small bit depth #30

Closed

This was referenced Sep 22, 2021

Speed up prepare_ml_data.py on on-premises hardware #155

Closed

Experiment with ingesting NWP data from raw files, including relevant params at cloud heights #122

Closed

peterdudfield removed this from the WP1 essential tasks milestone Sep 24, 2021

This was referenced Sep 29, 2021

Docs: "How to add a data source" #174

Closed

Modify SatelliteDataSource so it can load from EUMETSAT's native files #176

Closed

JackKelly mentioned this issue Oct 5, 2021

Read directly from native EUMETSAT files #189

Closed

7 tasks

flowirtz moved this to In Progress in Nowcasting Oct 15, 2021

flowirtz added this to Nowcasting Oct 15, 2021

JackKelly changed the title ~~Experiment with reading data directly from raw data files (instead of from intermediate Zarr)~~ Experiment with reading EUMETSAT data directly from raw data files (instead of from intermediate Zarr) Oct 20, 2021

JackKelly closed this as completed Oct 20, 2021

Repository owner moved this from In Progress to Done in Nowcasting Oct 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Experiment with reading EUMETSAT data directly from raw data files (instead of from intermediate Zarr) #90

Experiment with reading EUMETSAT data directly from raw data files (instead of from intermediate Zarr) #90

JackKelly commented Sep 3, 2021 •

edited

Loading

JackKelly commented Oct 1, 2021 •

edited

Loading

Uh oh!

JackKelly commented Oct 5, 2021

Uh oh!

Uh oh!

Experiment with reading EUMETSAT data directly from raw data files (instead of from intermediate Zarr) #90

Experiment with reading EUMETSAT data directly from raw data files (instead of from intermediate Zarr) #90

Comments

JackKelly commented Sep 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JackKelly commented Oct 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JackKelly commented Oct 5, 2021

Uh oh!

JackKelly commented Sep 3, 2021 •

edited

Loading

JackKelly commented Oct 1, 2021 •

edited

Loading