Use independent processes for each "modality" #202

JackKelly · 2021-10-06T16:10:48Z

This issue is split from #166

Detailed Description

We could have separate files for each data source, for each batch.

For example, on disk, within the prepared_ML_data/train/ directory, we might have train/NWP/, train/satellite/, etc. And, as before, in each of these folders, we'd have one file per batch, identified by the batch number. And, importantly, train/NWP/1.nc and train/satellite/1.nc would still be perfectly aligned in time and space (just as they currently are).

Saving each "modality" as a different set of files opens up the possibility to further modularise and de-couple nowcasting_dataset

prepare_ml_data.py could run through each modality separately, something like:

Randomly sample the "positions" in time and space for each ML training example, and save to disk. (In a little more detail: Find all the available t0_datetimes from across all the DataSources (see Simplify the calculation of available datetimes across all DataSources #204). Randomly sample from these; and randomly sample from the available locations... This should be general enough to enable Create a proportion of examples without PV data, outside the UK #93)
Fire up a separate process for each modality (probably using futures.ProcessPoolExecutor). We could even have multiple processes per modality, where each process works on a different subset of the "positions" (e.g. if we want 4 processes for each modality, then split the "positions" list into quarters).
Each process will read from the previously-saved "positions", and save pre-prepared batches to disk for that modality.

By default, prepare_ml_data.py should create all modalities specified in the config yaml file. But the user should be able to pass in a command-line argument (#171) to only re-recreate one or a subset of modalities (e.g. if we fix a bug in the creation of batches of satellite data, and we only want to re-computed the satellite data).

Advantages:

We don't have to recreate the whole pre-prepared dataset if we only want to update or add one "modality"
This should give us fairly easy-to-debug concurrent code.
When our dataset gets really big, we could use multiple machines running in parallel to create the pre-prepared batches.
It'd be very easy to use subsets of the data (e.g. we could share just the pre-prepared satellite batches with the MSc students)
We can use whatever file format makes most sense for each 'modality'. e.g. satellite images could be stored as GeoTIFFs (which would make them easy to view).
The code to write each batch to disk could live in the superclass for GSP and NWP; and could be overridden by the GSP or NWP classes.
This is one way to Remove PyTorch from the code #86
Concurrently reads different files from disk. This should speed up execution time on leonardo and in the cloud.

Disadvantages:

It makes the "ML loading code" a little more complex. But not much more complex.
Some 'modalities' (like PV) will have tiny files on disk (a few kBytes per batch?). And tiny files are inefficient to load (both on the cloud and on our local hardware). But maybe this isn't a huge problem because, when training large complex models, we probably only need to load a few batches per second (not thousands per second!)
It's yet more "refactoring" that isn't directly improving our ML model performance :)

Subtasks, in sequence:

Pre-prepare the "plan" and save it to disk (before processing any data) (possibly do Simplify the calculation of available datetimes across all DataSources #204 at the same time, if it makes Use independent processes for each "modality" #202 easier). Then, load the plan from disk and proceed as the code currently works.
Implement DataSource.prepare_batch(t0_datetimes, x_centers, y_centers, dst_path) which does everything :) It loads a batch from the source data, selects the approprate times and spatial positions, and writes the batch to disk (this solves Experiment with loading entire batches at once #212). prepare_ml_data.py will read the entire pre-prepared "plan", and fire up a process (using ProcessPoolExecutor()) for each modality.
Remove the code that combines batches from each DataSource into a single batch
Simplify the public interface to DataSource: now that we're not combing data from different modalities, the data never needs to leave the DataSource. You could imagine that each DataSource only needs to expose two or three public methods: get_available_t0_datetimes(history_minutes, forecast_minutes), sample_locations_for_datetimes(t0_datetimes) , and prepare_batch(t0_datetimes, center_x, center_x)
Remove any unused functions (and their tests).

The text was updated successfully, but these errors were encountered:

JackKelly · 2021-10-11T10:38:04Z

I'll start work on step 1 (pre-prepare a "plan") this afternoon :)

peterdudfield · 2021-10-11T10:50:01Z

Ill adjust

batch.save_netcdf to save to lots of files
batch.from_netcdf

I'll think about adding

'save_netcdf' to datasource_output
it could in datasource, need to see what feels right. Note need to save a batch of objects, not just one object

JackKelly · 2021-10-19T17:23:39Z

A more up-to-date, and more complete sketch of the design discussed in this issue is here: #213 (comment)

JackKelly added enhancement New feature or request data New data source or feature; or modification of existing data source labels Oct 6, 2021

JackKelly mentioned this issue Oct 6, 2021

Example --> Pydantic #166

Closed

JackKelly changed the title ~~Split different "modalities" into different files using different data generation scripts~~ Each "modality" should be saved into a different set of files, using different processes Oct 7, 2021

This was referenced Oct 7, 2021

Create a proportion of examples without PV data, outside the UK #93

Open

Machine-readable schema & validator for xarray.Dataset #211

Closed

Experiment with loading entire batches at once #212

Open

"Big new design" for nowcasting_dataset #213

Closed

JackKelly mentioned this issue Oct 11, 2021

Remove PyTorch from the code #86

Closed

4 tasks

peterdudfield mentioned this issue Oct 11, 2021

Issue/202 modality files #216

Merged

7 tasks

peterdudfield closed this as completed in #216 Oct 12, 2021

peterdudfield reopened this Oct 12, 2021

JackKelly mentioned this issue Oct 12, 2021

Can we simplify the code by always keeping the data in one data type (e.g. xr.DataArray) per modality? #209

Closed

flowirtz moved this to In Progress in Nowcasting Oct 15, 2021

flowirtz added this to Nowcasting Oct 15, 2021

JackKelly changed the title ~~Each "modality" should be saved into a different set of files, using different processes~~ Use independent processes for each "modality" Oct 19, 2021

JackKelly mentioned this issue Oct 19, 2021

Multi process when saving to netcdf #244

Closed

JackKelly self-assigned this Oct 22, 2021

JackKelly linked a pull request Oct 29, 2021 that will close this issue

Big new design Part 2 :) #307

Merged

30 tasks

JackKelly closed this as completed in #307 Nov 2, 2021

Repository owner moved this from In Progress to Done in Nowcasting Nov 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use independent processes for each "modality" #202

Use independent processes for each "modality" #202

JackKelly commented Oct 6, 2021 •

edited

Loading

JackKelly commented Oct 11, 2021

peterdudfield commented Oct 11, 2021 •

edited

Loading

JackKelly commented Oct 19, 2021

Use independent processes for each "modality" #202

Use independent processes for each "modality" #202

Comments

JackKelly commented Oct 6, 2021 • edited Loading

Detailed Description

JackKelly commented Oct 11, 2021

peterdudfield commented Oct 11, 2021 • edited Loading

JackKelly commented Oct 19, 2021

JackKelly commented Oct 6, 2021 •

edited

Loading

peterdudfield commented Oct 11, 2021 •

edited

Loading