Skip to content
This repository was archived by the owner on Sep 11, 2023. It is now read-only.

Dataset Validation script #138

Closed
3 of 4 tasks
JackKelly opened this issue Sep 17, 2021 · 3 comments · Fixed by #151
Closed
3 of 4 tasks

Dataset Validation script #138

JackKelly opened this issue Sep 17, 2021 · 3 comments · Fixed by #151
Assignees
Labels
data New data source or feature; or modification of existing data source enhancement New feature or request

Comments

@JackKelly
Copy link
Member

JackKelly commented Sep 17, 2021

Detailed Description

A script which goes through all the pre-prepared batches and checks:

  • That there's no overlap between train, test, and validation sets :)
  • The every batch contains the fields we'd expect
  • That every value is within the range we'd expect
  • That the duration of each example is what we'd expect

Context

Hopefully our unit-tests will catch these bugs. But, just to be super-careful, it might be nice to also have a script which goes through the entire pre-prepared dataset and checks for these issues!

@JackKelly JackKelly added enhancement New feature or request data New data source or feature; or modification of existing data source labels Sep 17, 2021
@JackKelly JackKelly added this to the WP1 essential tasks milestone Sep 17, 2021
@peterdudfield
Copy link
Contributor

I'll give this ago

@JackKelly
Copy link
Member Author

Awesome, thank you!

peterdudfield added a commit that referenced this issue Sep 21, 2021
@peterdudfield peterdudfield changed the title Dataset testing script Dataset Validation script Sep 22, 2021
@peterdudfield
Copy link
Contributor

For every value being in the range we expect,
for pv and gsp data, would expect between 0 and 1, as we have normalized them
Should we come up with a list of criteria for the others too.

Maybe a later issue, would be to create a slightl better object that a Typed Dict (I'm a fan of pydantic classes), that could do this validation on the fly

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
data New data source or feature; or modification of existing data source enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants