Skip to content
This repository was archived by the owner on Sep 11, 2023. It is now read-only.

Issue/166 batch pydantic #195

Merged
merged 60 commits into from
Oct 11, 2021
Merged

Issue/166 batch pydantic #195

merged 60 commits into from
Oct 11, 2021

Conversation

peterdudfield
Copy link
Contributor

@peterdudfield peterdudfield commented Oct 5, 2021

Pull Request

Description

  • Pydantic model for batch - removed Examplep[dict] object
  • Pydantic model for eahc data source
  • tools for changing model to xarray (and back)
  • adjust DataSets to use pydantic models
  • new 0.nc test/data
  • sort out subselect data
  • sort out scripts

Fixes #166

How Has This Been Tested?

  • Added specific unitests

  • create Fake Dataset and check torch tensor is returned (unittest)

  • ran scripts prepare ml data (and validate script)

  • No

  • Yes

Checklist:

  • My code follows OCF's coding style guidelines
  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked my code and corrected any misspellings

@peterdudfield peterdudfield marked this pull request as ready for review October 6, 2021 08:54
@peterdudfield peterdudfield requested review from jacobbieker and JackKelly and removed request for jacobbieker October 6, 2021 08:54
Copy link
Member

@JackKelly JackKelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Lots of comments, but they're mostly pretty tiny!

A few general thoughts:

  1. Wherever possible, please consider removing any dependencies on torch, in preparation for Remove PyTorch from the code #86. e.g. please consider using np.random.randn instead of torch.randn.

  2. I think I've made nowcasting_dataset over complicated by starting nowcasting_dataset with the intention of loading data on-the-fly during ML training; and then swapping to using nowcasting_dataset to create on-disk pre-prepared batches. I think we can simplify things a lot by removing support for loading on-the-fly :) (i.e. I think we can safely say that, from now on, nowcasting_dataset will just be used for pre-preparing batches! Please see Can we simplify the code by always keeping the data in one data type (e.g. xr.DataArray) per modality? #209 - I think it could reduce the code size a lot :) And I feel really bad for not thinking about this earlier!

)
elif "time_30" in self.__getattribute__(key).dims:
self.__setattr__(
key, self.__getattribute__(key).isel(time_30=slice(start_i, end_i))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do any instances of DataSourceOutput have one or more fields with five-minutely and one or more fields with half-hourly datetime indexes? I guess not?

If that is a possibility, then start_i and end_i would be correct indexes into one of the indexes; but would be wrong indexes into the other datetime index, I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently data sources only have 5 mins, or 30 mins. I wonder if this is a good enough for the moment. If so I can add a comment -

or can we think of a case where there will be both 5 mins and 30 min data?

from nowcasting_dataset.utils import coord_to_range


class Datetime(DataSourceOutput):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

Random thought, that isn't especially relevant to this PR(!), but it's possible that we might want to remove these datetime features from the on-disk batches, and instead compute these features on-the-fly, not least because we'll want to experiment with a variety of different ways of encoding position when we really start experimenting with the perceiver IO models. Related issue: https://github.com/openclimatefix/nowcasting_utils/issues/30

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've started a separate issue to discuss this: #208

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, even if we do compute these features on-the-fly, this pydantic model (and the validation code) could still be used for the datetime features computed on the fly, I guess?

@peterdudfield peterdudfield merged commit 4fa6173 into main Oct 11, 2021
@peterdudfield peterdudfield deleted the issue/166-batch-pydantic branch October 11, 2021 09:46
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Example --> Pydantic
3 participants