This repository was archived by the owner on Sep 11, 2023. It is now read-only.
-
-
Notifications
You must be signed in to change notification settings - Fork 6
Use independent processes for each "modality" #202
Labels
data
New data source or feature; or modification of existing data source
enhancement
New feature or request
Comments
Closed
This was referenced Oct 7, 2021
I'll start work on step 1 (pre-prepare a "plan") this afternoon :) |
Ill adjust
I'll think about adding
|
A more up-to-date, and more complete sketch of the design discussed in this issue is here: #213 (comment) |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Labels
data
New data source or feature; or modification of existing data source
enhancement
New feature or request
This issue is split from #166
Detailed Description
We could have separate files for each data source, for each batch.
For example, on disk, within the
prepared_ML_data/train/
directory, we might havetrain/NWP/
,train/satellite/
, etc. And, as before, in each of these folders, we'd have one file per batch, identified by the batch number. And, importantly,train/NWP/1.nc
andtrain/satellite/1.nc
would still be perfectly aligned in time and space (just as they currently are).Saving each "modality" as a different set of files opens up the possibility to further modularise and de-couple
nowcasting_dataset
prepare_ml_data.py
could run through each modality separately, something like:t0_datetimes
from across all the DataSources (see Simplify the calculation of available datetimes across allDataSource
s #204). Randomly sample from these; and randomly sample from the available locations... This should be general enough to enable Create a proportion of examples without PV data, outside the UK #93)futures.ProcessPoolExecutor
). We could even have multiple processes per modality, where each process works on a different subset of the "positions" (e.g. if we want 4 processes for each modality, then split the "positions" list into quarters).By default,
prepare_ml_data.py
should create all modalities specified in the config yaml file. But the user should be able to pass in a command-line argument (#171) to only re-recreate one or a subset of modalities (e.g. if we fix a bug in the creation of batches of satellite data, and we only want to re-computed the satellite data).Advantages:
GSP
andNWP
; and could be overridden by theGSP
orNWP
classes.leonardo
and in the cloud.Disadvantages:
Subtasks, in sequence:
DataSource
s #204 at the same time, if it makes Use independent processes for each "modality" #202 easier). Then, load the plan from disk and proceed as the code currently works.DataSource.prepare_batch(t0_datetimes, x_centers, y_centers, dst_path)
which does everything :) It loads a batch from the source data, selects the approprate times and spatial positions, and writes the batch to disk (this solves Experiment with loading entire batches at once #212).prepare_ml_data.py
will read the entire pre-prepared "plan", and fire up a process (usingProcessPoolExecutor()
) for each modality.DataSource
: now that we're not combing data from different modalities, the data never needs to leave the DataSource. You could imagine that each DataSource only needs to expose two or three public methods: get_available_t0_datetimes(history_minutes, forecast_minutes), sample_locations_for_datetimes(t0_datetimes) , and prepare_batch(t0_datetimes, center_x, center_x)The text was updated successfully, but these errors were encountered: