Skip to content
This repository was archived by the owner on Sep 11, 2023. It is now read-only.

Simplify the calculation of available datetimes across all DataSources #220

Merged
merged 16 commits into from
Oct 15, 2021

Conversation

JackKelly
Copy link
Member

@JackKelly JackKelly commented Oct 12, 2021

Pull Request

Please see issue #204 and #223 for a description and motivation of the work done in this PR.

The basic idea is to modify the way that the code computes the valid t0 datetimes across all the DataSources so that the code will work for any arbitrary combination of sample periods (so, for example, NWPs could be hourly; satellite data could be 5-minutely; and GSP-data can be half-hourly).

The main conceptual change is that each DataSource now computes a list of time periods when it has valid, contiguous data (in contrast, in the old code, each DataSource emits a list of timestamps when it has valid data which, of course, makes it tricky to compare DataSources with different sample periods).

This should make it pretty trivial to implement #135.

This PR does not fully implement #223. But this PR was getting pretty huge. So I'm going to merge as is (the tests pass) and follow on with a subsequent PR :)

Tasks to be done in a subsequent PR (and tracked in this issue comment):

  • Implement a DataSource.get_contiguous_time_periods() -> pd.DataFrame to emit a list of valid time periods (after excluding nighttime hours)
  • Enable NowcastingDataModule to compute the intersection of all the lists of time periods from each DataSource.
  • Split those periods into train, valid, test (although, if that's too complex, compute t0 datetimes first, and then split those t0 datetimes)
  • Compute valid t0 datetimes. Optionally do this at a user-defined frequency (e.g. every half hour) instead of at the minimum sample period of all the DataSources.
  • Remove nd_time.get_start_datetimes(), DataSource.get_t0_datetimes(), and their tests, and use grep to check they're not called from anywhere I've missed.

Description

  • Change README so it reflects the fact that we're no longer supporting on-the-fly loading.
  • Modified tests.
  • simplify DataModule._get_datetimes()
  • remove fill_30_minutes_timestamps_to_5_minutes()
  • rename _len to _length; and _dur to _duration throughout the code!
  • Implement a nd_time.get_contiguous_time_periods() -> pd.DataFrame
  • Write test for nd_time.get_contiguous_time_periods()
  • add more test cases for intersection function, where the two periods are the same

This PR is a sub-task of issue #213

How Has This Been Tested?

Modified the unittests to work with the new code. The tests pass 🙂

  • No
  • Yes

Checklist:

  • My code follows OCF's coding style guidelines
  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked my code and corrected any misspellings

…es(). Change README so it reflects the fact that we're no longer supporting on-the-fly loading. Tests pass. Still need to simplify DataModule._get_datetimes() and remove fill_30_minutes_timestamps_to_5_minutes(). #204  #213
@JackKelly JackKelly added enhancement New feature or request refactoring labels Oct 12, 2021
@JackKelly JackKelly self-assigned this Oct 12, 2021
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

…ps_to_5_minutes(). Ready for review (I think)
@@ -189,50 +180,6 @@ def datetime_features_in_example(index: pd.DatetimeIndex) -> Datetime:
return Datetime(**datetime_dict)


def fill_30_minutes_timestamps_to_5_minutes(index: pd.DatetimeIndex) -> pd.DatetimeIndex:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel bad for deleting this function 🙂 but, as far as I can tell, this is no longer needed now that this PR simplifies the way that t0_datetimes are computed.

@JackKelly JackKelly marked this pull request as ready for review October 12, 2021 16:28
@JackKelly JackKelly requested review from peterdudfield and removed request for peterdudfield October 12, 2021 16:28
@JackKelly JackKelly marked this pull request as draft October 13, 2021 07:23
@@ -30,6 +30,9 @@ class DataSource:
will consist of a single timestep at t0.
convert_to_numpy: Whether or not to convert each example to numpy.
sample_period_minutes: The time delta between each data point

Attributes ending in `_len` are sequence lengths represented as integer numbers of timesteps.
Attributes ending in `_dur` are sequence durations represented as pd.Timedeltas.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to _length and _duration?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

I might have gotten a bit carried away... I renamed _len to _length and _dur to _duration throughout the entire nowcasting_dataset codebase :) (using grep -r --include=*.py "_dur[^a]")


logger.debug(f"Got all start times, there are {len(self.t0_datetimes)}")
logger.debug(f"Got all start times, there are {len(self.t0_datetimes):,d}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does ':,d' do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the :d formats as a decimal number. The comma inserts a comma after every three zeros. So f{1000:,d} would be printed as 1,000. It makes it more human-readable :)

@@ -63,6 +64,61 @@ def intersection_of_datetimeindexes(indexes: List[pd.DatetimeIndex]) -> pd.Datet
return intersection


def intersection_of_2_dataframes_of_periods(a: pd.DataFrame, b: pd.DataFrame) -> pd.DataFrame:
Copy link
Contributor

@peterdudfield peterdudfield Oct 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this method, I cant work out in my head what is faster,

  1. what you have done, passing start and end times and the have to do a loop through one of the periods
  2. passing two a list of valid datetimes, and then just taking the intersection

Other option: This might be a geeky thing and if 1. works then just leave 1. If there is a big slow down, then maybe we have to look at 2., but i doubt it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree: I'm optimistic that the code as currently written should be fast enough (and this code doesn't need to be super-fast because it's not in the ML training loop)... So, like you say, let's leave it as it is for now and speed it up later if needs be.

@JackKelly JackKelly marked this pull request as ready for review October 15, 2021 10:48
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request refactoring
Projects
No open projects
Status: Done
2 participants