Skip to content
This repository was archived by the owner on Sep 11, 2023. It is now read-only.

Tweak the way DataSources are represented in the Configuration model #217

Closed
Tracked by #213
JackKelly opened this issue Oct 11, 2021 · 8 comments · Fixed by #255
Closed
Tracked by #213

Tweak the way DataSources are represented in the Configuration model #217

JackKelly opened this issue Oct 11, 2021 · 8 comments · Fixed by #255
Assignees
Labels
enhancement New feature or request refactoring

Comments

@JackKelly
Copy link
Member

Detailed Description

We maybe want to make a few small changes to the Configuration model:

  • include a list of enabled_data_sources
  • maybe split the config up so each data source has its own section, so that entire section can be passed into the DataSource object. e.g. there would be an NWP section, with fields zarr_path, image_size, and channels.

Context

Splitting up the Configuration so each data source has its own section should mean that we don't have to write custom code to translate between a Configuration object and the constructor for each DataSource. Instead we just do something like SatDataSource(**config.sat_data)

@peterdudfield
Copy link
Contributor

perhaps the layout could be
config:

  • data_sources
    -- satellite
    -- nwp
    -- e.t.c

  • other subfields like Git

@JackKelly
Copy link
Member Author

JackKelly commented Oct 18, 2021

Yeah, exactly! To flesh that out one more level deep:

  • data_sources:
    • default_forecast_minutes
    • default_history_minutes
    • satellite
      • zarr_path
      • channels
      • forecast_minutes # Optionally override default_forecast_minutes for Satellite data
      • history_minutes # Optionally override default_history_minutes for Satellite data
      • image_size_pixels
    • nwp
      • zarr_path
      • channels
      • forecast_minutes # Optionally override default_forecast_minutes for NWP data
      • history_minutes # Optionally override default_history_minutes for NWP data
      • image_size_pixels
    • etc.
  • git
  • process
    • seed
    • batch_size
    • local_temp_path
  • output_data
  • etc.

I'm not entirely sure about the default_forecast_minutes thing. Maybe it's conceptually simpler to just enter forecast_minutes and history_minutes separately for each data_source. The main aim is to allow us to have different lengths for different data sources (e.g. 4 hours of historical NWPs; and only 1 hour of historical satellite)

@JackKelly
Copy link
Member Author

Also, it would be great if the field names in the config file exactly match the constructor arguments for each DataSource so, to create, say, a SatelliteDataSource, we can just do SatelliteDataSource(**config.data_sources.satellite)

Should we keep the field name zarr_path and change the DataSources to take a zarr_path argument? Or should we always use filename for all the DataSources? (Although PVDataSource requires two filenames: one for the PV timeseries data; the other for metadata.)

@peterdudfield
Copy link
Contributor

Yea I like the lay out above, I'm not quite sure how to do the 'default_forecast_minutes' from a model higher up, but we could have a play.

In the idea of being verbose, I think keep it as zarr_path, and for PVDataSource, put both the filenames in>

We can add a validator to make sure the filename or zarr_path has the right suffix

@JackKelly
Copy link
Member Author

I'm not quite sure how to do the 'default_forecast_minutes' from a model higher up

Yeah, I wonder if we should keep things simple and users just have to put the forecast_minutes into each DataSource?

BTW, are you happy to have a shot at this issue? Absolutely no worries either way - I'm happy to if you've got too much other stuff on this week!

@peterdudfield
Copy link
Contributor

Yea - i'll give it a go

@peterdudfield peterdudfield self-assigned this Oct 20, 2021
@JackKelly
Copy link
Member Author

Cheers, thank you!

To give a bit more context... and don't necessarily worry about implementing this yet... but I guess the ultimate goal is that, in prepare_ml_data.py, we can have a single line like this:

data_sources = get_data_sources(config.data_sources)

And, get_data_sources() could look something like this:

def get_data_sources(config_for_all_data_sources) -> list[DataSource]:
    data_source_name_to_class = {'satellite': SatelliteDataSource, 'nwp': NWPDataSource, ...}
    data_sources = []
    for data_source_name, data_source_class in data_source_name_to_class.items():
        config_for_data_source = getattr(config_for_all_data_sources, data_source_name)
        data_source = data_source_class(**config_for_data_source)
        data_sources.append(data_source)
    return data_sources

@peterdudfield
Copy link
Contributor

That would be very neat indeed

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request refactoring
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants