-
-
Notifications
You must be signed in to change notification settings - Fork 6
Experiment with reading EUMETSAT data directly from raw data files (instead of from intermediate Zarr) #90
Comments
Hmmm... my very early experiments suggest that loading data from It looks like it might take about 1 second to load and reproject a single satellite channel, at a single timestep. We want 12 channels and at least 24 timesteps per example. That's 288 seconds (12 x 24) per example (just for satellite data). If we buy a 64-core, 128-thread Epic; and if we can efficiently use all the threads , then that's 2.25 seconds per example (288 seconds / 128 threads). Which is 144 seconds per batch (2.25 seconds per example x 64 examples per batch). Which would take 1,000 hours to generate 25,000 batches! Not OK! On Monday, I'll keep trying to see if we can speed things up. But my expectation, given these initial results, is that we'll have to revert back to using an intermediate representation. Maybe the pipeline could look something like this:
As an added bonus, we could perhaps try to keep the intermediate datasets fairly small (a few TBytes), so we can keep them on SSDs, so Next steps: Try speeding up reading from Nat files. Ask satpy folks if we can subset before loading? Check docs for load(). Try loading multiple Nat files and channels together. Try different resample algos. Build little test rig for loading data in parallel in a repeatable way. Compare loading from native, loading from zarr, loading from multiple netcdf |
Uh oh!
There was an error while loading. Please reload this page.
e.g., for EUMETSAT, load directly from .nat files (downloaded by #91); and for NWPs load directly from the raw NetCDF or GRIB files.
See this doc for more context.
SatelliteDataSource
so it can load from EUMETSAT's native files #176Related:
If this experiment works (and we read directly from native files) then close these redundant issues:
The text was updated successfully, but these errors were encountered: