Skip to content
This repository was archived by the owner on Sep 11, 2023. It is now read-only.

Speed up GSPDataSource.get_locations() #305

Closed
Tracked by #341
JackKelly opened this issue Oct 28, 2021 · 9 comments · Fixed by #375
Closed
Tracked by #341

Speed up GSPDataSource.get_locations() #305

JackKelly opened this issue Oct 28, 2021 · 9 comments · Fixed by #375
Assignees
Labels
enhancement New feature or request refactoring

Comments

@JackKelly
Copy link
Member

If a constant number of GSPs are available across all timesteps then we don't need to go round a slow for loop. Instead we can just randomly pick GSPs in one go using something like np.random.choice(self.metadata["gsp_id"], size=len(t0_datetimes))

@peterdudfield
Copy link
Contributor

Don't we need to make sure there are no nan's for that timeperiod. i.e for each t0 only pick GSP where there is data.
Perhaps ive missed something though ....

@JackKelly
Copy link
Member Author

yeah, that's right, but we could do that as a one-off bit of analysis... i.e. manually check there are no NaNs in GSP data, then we can vastly speed up GSPDataSource.get_locations

@JackKelly
Copy link
Member Author

I think this should come before #304 tbh, because this could be a simple win, with minimal code to modify

@peterdudfield peterdudfield moved this from Todo to In Progress in Nowcasting Nov 12, 2021
@JackKelly JackKelly linked a pull request Nov 16, 2021 that will close this issue
7 tasks
Repository owner moved this from In Progress to Done in Nowcasting Nov 16, 2021
@JackKelly
Copy link
Member Author

In practice, unfortunately it looks like there are NaNs in the GSP data, so we still go round the big (slow) loop:

2021-11-16 15:13:06,479 WARNING There are some nans in the gsp data, so to get x,y locations we have to do a big loop at /home/jack/dev/ocf/nowcasting_dataset/nowcasting_dataset/data_sources/gsp/gsp_data_source.py#L142

Some thoughts on next steps:

  1. Investigate the GSP data on disk by hand (e.g. in a Jupyter notebook). How are the NaNs distributed? Specifically:
    • Are there a small number of timesteps when all GSPs are NaN? If so, can we just throw away those timesteps?
    • Are there some GSPs which are always NaN? If so, can we throw them away?

@JackKelly JackKelly reopened this Nov 16, 2021
Repository owner moved this from Done to In Progress in Nowcasting Nov 16, 2021
@peterdudfield
Copy link
Contributor

Thats a shame, what about #304 ?

@JackKelly
Copy link
Member Author

I'd be keen to find an elegant solution to #305 (this issue) if we can.... parallelism can be complex to think about and debug 🙂

@JackKelly
Copy link
Member Author

Hehe.. good news, @peterdudfield! Your speedy code for GSPDataSource.get_locations() is now being used, after PR #497 drops GSPs with zero PV (and I assume those GSPs also had NaNs!)

@peterdudfield
Copy link
Contributor

Hehe.. good news, @peterdudfield! Your speedy code for GSPDataSource.get_locations() is now being used, after PR #497 drops GSPs with zero PV (and I assume those GSPs also had NaNs!)

thats good news - close issue?

@JackKelly
Copy link
Member Author

yeah, let's close this for now! Great work!

Repository owner moved this from In Progress to Done in Nowcasting Nov 29, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request refactoring
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants