Simplify the calculation of available datetimes across all DataSources #220

JackKelly · 2021-10-12T14:52:21Z

Pull Request

Please see issue #204 and #223 for a description and motivation of the work done in this PR.

The basic idea is to modify the way that the code computes the valid t0 datetimes across all the DataSources so that the code will work for any arbitrary combination of sample periods (so, for example, NWPs could be hourly; satellite data could be 5-minutely; and GSP-data can be half-hourly).

The main conceptual change is that each DataSource now computes a list of time periods when it has valid, contiguous data (in contrast, in the old code, each DataSource emits a list of timestamps when it has valid data which, of course, makes it tricky to compare DataSources with different sample periods).

This should make it pretty trivial to implement #135.

This PR does not fully implement #223. But this PR was getting pretty huge. So I'm going to merge as is (the tests pass) and follow on with a subsequent PR :)

Tasks to be done in a subsequent PR (and tracked in this issue comment):

Implement a DataSource.get_contiguous_time_periods() -> pd.DataFrame to emit a list of valid time periods (after excluding nighttime hours)
Enable NowcastingDataModule to compute the intersection of all the lists of time periods from each DataSource.
Split those periods into train, valid, test (although, if that's too complex, compute t0 datetimes first, and then split those t0 datetimes)
Compute valid t0 datetimes. Optionally do this at a user-defined frequency (e.g. every half hour) instead of at the minimum sample period of all the DataSources.
Remove nd_time.get_start_datetimes(), DataSource.get_t0_datetimes(), and their tests, and use grep to check they're not called from anywhere I've missed.

Description

Change README so it reflects the fact that we're no longer supporting on-the-fly loading.
Modified tests.
simplify DataModule._get_datetimes()
remove fill_30_minutes_timestamps_to_5_minutes()
rename _len to _length; and _dur to _duration throughout the code!
Implement a nd_time.get_contiguous_time_periods() -> pd.DataFrame
Write test for nd_time.get_contiguous_time_periods()
add more test cases for intersection function, where the two periods are the same

This PR is a sub-task of issue #213

How Has This Been Tested?

Modified the unittests to work with the new code. The tests pass 🙂

No
Yes

Checklist:

My code follows OCF's coding style guidelines
I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
I have checked my code and corrected any misspellings

…es(). Change README so it reflects the fact that we're no longer supporting on-the-fly loading. Tests pass. Still need to simplify DataModule._get_datetimes() and remove fill_30_minutes_timestamps_to_5_minutes(). #204 #213

review-notebook-app · 2021-10-12T14:52:25Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

…ps_to_5_minutes(). Ready for review (I think)

JackKelly · 2021-10-12T16:27:07Z

nowcasting_dataset/time.py

@@ -189,50 +180,6 @@ def datetime_features_in_example(index: pd.DatetimeIndex) -> Datetime:
    return Datetime(**datetime_dict)


-def fill_30_minutes_timestamps_to_5_minutes(index: pd.DatetimeIndex) -> pd.DatetimeIndex:


I feel bad for deleting this function 🙂 but, as far as I can tell, this is no longer needed now that this PR simplifies the way that t0_datetimes are computed.

…ction_of_2_dataframes_of_periods()

#223

peterdudfield · 2021-10-14T08:43:58Z

nowcasting_dataset/data_sources/data_source.py

@@ -30,6 +30,9 @@ class DataSource:
        will consist of a single timestep at t0.
      convert_to_numpy: Whether or not to convert each example to numpy.
      sample_period_minutes: The time delta between each data point
+
+    Attributes ending in `_len` are sequence lengths represented as integer numbers of timesteps.
+    Attributes ending in `_dur` are sequence durations represented as pd.Timedeltas.


rename to _length and _duration?

Good idea!

I might have gotten a bit carried away... I renamed _len to _length and _dur to _duration throughout the entire nowcasting_dataset codebase :) (using grep -r --include=*.py "_dur[^a]")

nowcasting_dataset/data_sources/data_source.py

peterdudfield · 2021-10-14T08:47:43Z

nowcasting_dataset/dataset/datamodule.py


-        logger.debug(f"Got all start times, there are {len(self.t0_datetimes)}")
+        logger.debug(f"Got all start times, there are {len(self.t0_datetimes):,d}")


what does ':,d' do?

the :d formats as a decimal number. The comma inserts a comma after every three zeros. So f{1000:,d} would be printed as 1,000. It makes it more human-readable :)

tests/test_datamodule.py

peterdudfield · 2021-10-14T08:55:07Z

nowcasting_dataset/time.py

@@ -63,6 +64,61 @@ def intersection_of_datetimeindexes(indexes: List[pd.DatetimeIndex]) -> pd.Datet
    return intersection


+def intersection_of_2_dataframes_of_periods(a: pd.DataFrame, b: pd.DataFrame) -> pd.DataFrame:


I like this method, I cant work out in my head what is faster,

what you have done, passing start and end times and the have to do a loop through one of the periods

passing two a list of valid datetimes, and then just taking the intersection

Other option: This might be a geeky thing and if 1. works then just leave 1. If there is a big slow down, then maybe we have to look at 2., but i doubt it

Yeah, I agree: I'm optimistic that the code as currently written should be fast enough (and this code doesn't need to be super-fast because it's not in the ML training loop)... So, like you say, let's leave it as it is for now and speed it up later if needs be.

tests/test_time.py

…ts break. More renaming to do tomorrow!

…ughout the whole nowcasting_dataset codebase. Tests pass.

JackKelly added enhancement New feature or request refactoring labels Oct 12, 2021

JackKelly self-assigned this Oct 12, 2021

Simplify DataModule._get_datetimes(). Remove fill_30_minutes_timestam…

af5c8e0

…ps_to_5_minutes(). Ready for review (I think)

JackKelly commented Oct 12, 2021

View reviewed changes

JackKelly marked this pull request as ready for review October 12, 2021 16:28

JackKelly requested review from peterdudfield and removed request for peterdudfield October 12, 2021 16:28

JackKelly marked this pull request as draft October 13, 2021 07:23

JackKelly added 8 commits October 13, 2021 08:40

document how to download NWP, satellite, and PV. #224

61f71ef

implemented intersection_of_2_dataframes_of_periods() and tests

25ca0d0

fixing linter error by removing blank line after docstring in interse…

579d449

…ction_of_2_dataframes_of_periods()

move README changes to separate PR

833323d

sketched out DataSource.get_contiguous_time_periods() #223

1667072

implement nd_time.get_contiguous_time_periods() #223

b78b438

adding test case for when two periods are exactly the same

d6ee033

Written test for nd_time.get_contiguous_time_periods(). All tests pass.

03508c4

#223