-
-
Notifications
You must be signed in to change notification settings - Fork 6
Simplify the calculation of available datetimes across all DataSources #220
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
…ps_to_5_minutes(). Ready for review (I think)
@@ -189,50 +180,6 @@ def datetime_features_in_example(index: pd.DatetimeIndex) -> Datetime: | |||
return Datetime(**datetime_dict) | |||
|
|||
|
|||
def fill_30_minutes_timestamps_to_5_minutes(index: pd.DatetimeIndex) -> pd.DatetimeIndex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel bad for deleting this function 🙂 but, as far as I can tell, this is no longer needed now that this PR simplifies the way that t0_datetimes are computed.
…ction_of_2_dataframes_of_periods()
@@ -30,6 +30,9 @@ class DataSource: | |||
will consist of a single timestep at t0. | |||
convert_to_numpy: Whether or not to convert each example to numpy. | |||
sample_period_minutes: The time delta between each data point | |||
|
|||
Attributes ending in `_len` are sequence lengths represented as integer numbers of timesteps. | |||
Attributes ending in `_dur` are sequence durations represented as pd.Timedeltas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to _length and _duration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea!
I might have gotten a bit carried away... I renamed _len
to _length
and _dur
to _duration
throughout the entire nowcasting_dataset
codebase :) (using grep -r --include=*.py "_dur[^a]"
)
|
||
logger.debug(f"Got all start times, there are {len(self.t0_datetimes)}") | ||
logger.debug(f"Got all start times, there are {len(self.t0_datetimes):,d}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does ':,d' do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the :d
formats as a decimal number. The comma inserts a comma after every three zeros. So f{1000:,d}
would be printed as 1,000
. It makes it more human-readable :)
@@ -63,6 +64,61 @@ def intersection_of_datetimeindexes(indexes: List[pd.DatetimeIndex]) -> pd.Datet | |||
return intersection | |||
|
|||
|
|||
def intersection_of_2_dataframes_of_periods(a: pd.DataFrame, b: pd.DataFrame) -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this method, I cant work out in my head what is faster,
- what you have done, passing start and end times and the have to do a loop through one of the periods
- passing two a list of valid datetimes, and then just taking the intersection
Other option: This might be a geeky thing and if 1. works then just leave 1. If there is a big slow down, then maybe we have to look at 2., but i doubt it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree: I'm optimistic that the code as currently written should be fast enough (and this code doesn't need to be super-fast because it's not in the ML training loop)... So, like you say, let's leave it as it is for now and speed it up later if needs be.
…ts break. More renaming to do tomorrow!
…ughout the whole nowcasting_dataset codebase. Tests pass.
Pull Request
Please see issue #204 and #223 for a description and motivation of the work done in this PR.
The basic idea is to modify the way that the code computes the valid t0 datetimes across all the
DataSources
so that the code will work for any arbitrary combination of sample periods (so, for example, NWPs could be hourly; satellite data could be 5-minutely; and GSP-data can be half-hourly).The main conceptual change is that each
DataSource
now computes a list of time periods when it has valid, contiguous data (in contrast, in the old code, eachDataSource
emits a list of timestamps when it has valid data which, of course, makes it tricky to compare DataSources with different sample periods).This should make it pretty trivial to implement #135.
This PR does not fully implement #223. But this PR was getting pretty huge. So I'm going to merge as is (the tests pass) and follow on with a subsequent PR :)
Tasks to be done in a subsequent PR (and tracked in this issue comment):
DataSource.get_contiguous_time_periods() -> pd.DataFrame
to emit a list of valid time periods (after excluding nighttime hours)NowcastingDataModule
to compute the intersection of all the lists of time periods from eachDataSource
.nd_time.get_start_datetimes()
,DataSource.get_t0_datetimes()
, and their tests, and use grep to check they're not called from anywhere I've missed.Description
DataModule._get_datetimes()
fill_30_minutes_timestamps_to_5_minutes()
_len
to_length
; and_dur
to_duration
throughout the code!nd_time.get_contiguous_time_periods() -> pd.DataFrame
nd_time.get_contiguous_time_periods()
This PR is a sub-task of issue #213
How Has This Been Tested?
Modified the unittests to work with the new code. The tests pass 🙂
Checklist: