Issue/166 batch pydantic #195

peterdudfield · 2021-10-05T07:30:01Z

Pull Request

Description

Pydantic model for batch - removed Examplep[dict] object
Pydantic model for eahc data source
tools for changing model to xarray (and back)
adjust DataSets to use pydantic models
new 0.nc test/data
sort out subselect data
sort out scripts

Fixes #166

How Has This Been Tested?

Added specific unitests
create Fake Dataset and check torch tensor is returned (unittest)
ran scripts prepare ml data (and validate script)
No
Yes

Checklist:

My code follows OCF's coding style guidelines
I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
I have checked my code and corrected any misspellings

…h.Tensor

JackKelly

Looks good!

Lots of comments, but they're mostly pretty tiny!

A few general thoughts:

Wherever possible, please consider removing any dependencies on torch, in preparation for Remove PyTorch from the code #86. e.g. please consider using np.random.randn instead of torch.randn.
I think I've made nowcasting_dataset over complicated by starting nowcasting_dataset with the intention of loading data on-the-fly during ML training; and then swapping to using nowcasting_dataset to create on-disk pre-prepared batches. I think we can simplify things a lot by removing support for loading on-the-fly :) (i.e. I think we can safely say that, from now on, nowcasting_dataset will just be used for pre-preparing batches! Please see Can we simplify the code by always keeping the data in one data type (e.g. xr.DataArray) per modality? #209 - I think it could reduce the code size a lot :) And I feel really bad for not thinking about this earlier!

nowcasting_dataset/data_sources/datasource_output.py

JackKelly · 2021-10-07T12:18:54Z

nowcasting_dataset/data_sources/datasource_output.py

+                    )
+                elif "time_30" in self.__getattribute__(key).dims:
+                    self.__setattr__(
+                        key, self.__getattribute__(key).isel(time_30=slice(start_i, end_i))


Do any instances of DataSourceOutput have one or more fields with five-minutely and one or more fields with half-hourly datetime indexes? I guess not?

If that is a possibility, then start_i and end_i would be correct indexes into one of the indexes; but would be wrong indexes into the other datetime index, I think?

Currently data sources only have 5 mins, or 30 mins. I wonder if this is a good enough for the moment. If so I can add a comment -

or can we think of a case where there will be both 5 mins and 30 min data?

nowcasting_dataset/data_sources/datasource_output.py

JackKelly · 2021-10-07T12:27:23Z

nowcasting_dataset/data_sources/datetime/datetime_model.py

+from nowcasting_dataset.utils import coord_to_range
+
+
+class Datetime(DataSourceOutput):


This looks great!

Random thought, that isn't especially relevant to this PR(!), but it's possible that we might want to remove these datetime features from the on-disk batches, and instead compute these features on-the-fly, not least because we'll want to experiment with a variety of different ways of encoding position when we really start experimenting with the perceiver IO models. Related issue: https://github.com/openclimatefix/nowcasting_utils/issues/30

I've started a separate issue to discuss this: #208

But, even if we do compute these features on-the-fly, this pydantic model (and the validation code) could still be used for the datetime features computed on the fly, I guess?

nowcasting_dataset/time.py

PR suggestions from JK Co-authored-by: Jack Kelly <[email protected]>

peterdudfield added 25 commits October 1, 2021 13:54

pydantic test

1a866a7

begun pydantic model, lots of dummy functions

1afa092

add models for other data sources

a866439

update more model

24dbe93

add prints, for looking at size of files

1598b2b

fix

df9ad12

fix

10fc202

tidy

a121c12

do size test on actually .nc

95c9a70

fix tests - not dataset ones

f79f398

fix tests

2c13637

add pytest fixture, of general data source

9c8b014

add general datasource

67b238c

update batch 0.nc

96e317e

udpate dataset (more todo)

5d45631

option for not retuning xr data if not there

9d547fa

remove subset of data from dataset

68aeacb

get one subselect test working

887484c

re introduce tests

c0df51d

fix for other sub selections

a346abf

remove example

ac18891

remove old code

2ae2558

tidy up folder structure

e58a00b

tidy and add README

90f7bcb

pylint

ba7d1c5

peterdudfield marked this pull request as ready for review October 6, 2021 08:54

peterdudfield requested review from jacobbieker and JackKelly and removed request for jacobbieker October 6, 2021 08:54

peterdudfield added 16 commits October 6, 2021 17:43

pylint

bde20c4

PR comment

b620b01

tidy

f51db97

put back in padding tests, and PR comments on Sub select data

7f12c86

pylint

cb7067c

rename variable

9554896

PR comment for sub selecting some data

6d75d04

fix for script prepare_ml_data

b2ac635

add types to xarray datasets

109f02e

fix for validation ml data

e0774db

Merge branch 'main' into issue/166-batch-pydantic

235b1a7

rename General to Metadata (PR comment)

018f367

update for 'object_at_center_label' label not string, helps with torc…

725a4a2

…h.Tensor

fix linting

1d7d0e5

fix test,

7c837e2

remake 0.nc test data

59b3378

JackKelly reviewed Oct 7, 2021

View reviewed changes

peterdudfield and others added 7 commits October 7, 2021 15:01

Apply suggestions from code review

a70c3a3

PR suggestions from JK Co-authored-by: Jack Kelly <[email protected]>

fix

695f9dc

remove torch from fake Batch

5e26337

Merge branch 'main' into issue/166-batch-pydantic

439c967

PR comment JK

3166a61

batch_index in to xr_dataset

7c6e5a3

PR comment JK.-

0af228f

JackKelly mentioned this pull request Oct 8, 2021

Machine-readable schema & validator for xarray.Dataset #211

Closed

peterdudfield added 2 commits October 8, 2021 13:08

some small notebooks

6960df8

update notebook

ee09b9a

JackKelly mentioned this pull request Oct 8, 2021

"Big new design" for nowcasting_dataset #213

Closed

38 tasks

peterdudfield merged commit 4fa6173 into main Oct 11, 2021

peterdudfield deleted the issue/166-batch-pydantic branch October 11, 2021 09:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue/166 batch pydantic #195

Issue/166 batch pydantic #195

peterdudfield commented Oct 5, 2021 •

edited

Loading

JackKelly left a comment

JackKelly Oct 7, 2021

peterdudfield Oct 7, 2021

JackKelly Oct 7, 2021

JackKelly Oct 7, 2021

JackKelly Oct 7, 2021

		from nowcasting_dataset.utils import coord_to_range


		class Datetime(DataSourceOutput):

Issue/166 batch pydantic #195

Issue/166 batch pydantic #195

Conversation

peterdudfield commented Oct 5, 2021 • edited Loading

Pull Request

Description

How Has This Been Tested?

Checklist:

JackKelly left a comment

Choose a reason for hiding this comment

JackKelly Oct 7, 2021

Choose a reason for hiding this comment

peterdudfield Oct 7, 2021

Choose a reason for hiding this comment

JackKelly Oct 7, 2021

Choose a reason for hiding this comment

JackKelly Oct 7, 2021

Choose a reason for hiding this comment

JackKelly Oct 7, 2021

Choose a reason for hiding this comment

peterdudfield commented Oct 5, 2021 •

edited

Loading