-
-
Notifications
You must be signed in to change notification settings - Fork 6
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Lots of comments, but they're mostly pretty tiny!
A few general thoughts:
-
Wherever possible, please consider removing any dependencies on
torch
, in preparation for Remove PyTorch from the code #86. e.g. please consider usingnp.random.randn
instead oftorch.randn
. -
I think I've made
nowcasting_dataset
over complicated by startingnowcasting_dataset
with the intention of loading data on-the-fly during ML training; and then swapping to usingnowcasting_dataset
to create on-disk pre-prepared batches. I think we can simplify things a lot by removing support for loading on-the-fly :) (i.e. I think we can safely say that, from now on,nowcasting_dataset
will just be used for pre-preparing batches! Please see Can we simplify the code by always keeping the data in one data type (e.g. xr.DataArray) per modality? #209 - I think it could reduce the code size a lot :) And I feel really bad for not thinking about this earlier!
) | ||
elif "time_30" in self.__getattribute__(key).dims: | ||
self.__setattr__( | ||
key, self.__getattribute__(key).isel(time_30=slice(start_i, end_i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do any instances of DataSourceOutput
have one or more fields with five-minutely and one or more fields with half-hourly datetime indexes? I guess not?
If that is a possibility, then start_i
and end_i
would be correct indexes into one of the indexes; but would be wrong indexes into the other datetime index, I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently data sources only have 5 mins, or 30 mins. I wonder if this is a good enough for the moment. If so I can add a comment -
or can we think of a case where there will be both 5 mins and 30 min data?
from nowcasting_dataset.utils import coord_to_range | ||
|
||
|
||
class Datetime(DataSourceOutput): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great!
Random thought, that isn't especially relevant to this PR(!), but it's possible that we might want to remove these datetime features from the on-disk batches, and instead compute these features on-the-fly, not least because we'll want to experiment with a variety of different ways of encoding position when we really start experimenting with the perceiver IO models. Related issue: https://github.com/openclimatefix/nowcasting_utils/issues/30
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've started a separate issue to discuss this: #208
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, even if we do compute these features on-the-fly, this pydantic model (and the validation code) could still be used for the datetime features computed on the fly, I guess?
PR suggestions from JK Co-authored-by: Jack Kelly <[email protected]>
Pull Request
Description
Fixes #166
How Has This Been Tested?
Added specific unitests
create Fake Dataset and check torch tensor is returned (unittest)
ran scripts prepare ml data (and validate script)
No
Yes
Checklist: