-
-
Notifications
You must be signed in to change notification settings - Fork 6
Get nowcasting_dataset
running on OCF's on-premises server
#152
Conversation
…emises.py. No changes as of yet!
… local hardware. Does not work yet!
Worth adding 'pathy' to help run tests |
I'll work on fixing the failing tests and merging |
@@ -64,7 +64,7 @@ def gsp_data_source(): | |||
|
|||
@pytest.fixture | |||
def configuration(): | |||
filename = os.path.join(os.path.dirname(nowcasting_dataset.__file__), "config", "example.yaml") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR deletes example.yaml
(see the 'conversation' tab!)
@@ -14,17 +14,20 @@ dependencies: | |||
- zarr | |||
- xarray | |||
- ipykernel | |||
- h5netcdf # For opening NetCDF files from cloud buckets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
black
made this change (and a bunch more like this). Turns out that black violates the pep8 recommendation to have two spaces before a comment, and I can't see how to force black to allow two spaces before comments, but ho-hum :)
@@ -43,3 +44,38 @@ def gcp_to_aws(gcp_filename: str, gcs: gcsfs.GCSFileSystem, aws_filename: str, a | |||
upload_one_file( | |||
remote_filename=aws_filename, bucket=aws_bucket, local_filename=local_filename | |||
) | |||
|
|||
|
|||
def get_maximum_batch_id(path: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved from nowcasting_dataset/utils.py
(and slightly modified)
@@ -28,7 +29,6 @@ process: | |||
- lcc | |||
- mcc | |||
- hcc | |||
prcesion: 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
precision
is left-over from when nowcasting_dataset
was used during on-the-fly ML training. precision
is only needed during ML training (not for creating batches), so removing this field
@@ -42,4 +42,3 @@ process: | |||
- VIS008 | |||
- WV_062 | |||
- WV_073 | |||
val_check_interval: 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val_check_interval
is left-over from when nowcasting_dataset
was used during on-the-fly ML training.
Returns: pydantic class | ||
""" | ||
|
||
# load the file to a dictionary | ||
with open(filename, "r") as stream: | ||
with fsspec.open(filename, mode="r") as stream: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fsspec.open()
will open from local, GCP or AWS (it's magic!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we get ride of https://github.com/openclimatefix/nowcasting_dataset/blob/main/nowcasting_dataset/config/load.py#L28 then?
@@ -133,38 +132,3 @@ def pad_data( | |||
data[variable] = pad_nans(data[variable], pad_width=((0, 0), pad_shape)) # (axis0, axis1) | |||
|
|||
return data | |||
|
|||
|
|||
def get_maximum_batch_id_from_gcs(remote_path: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to nowcasting_dataset/cloud/utils.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good move
@@ -32,3 +32,9 @@ def delete_all_files_in_temp_path(path: Path): | |||
_LOG.info(f"Deleting {len(files)} files from {path}.") | |||
for f in files: | |||
os.remove(f) | |||
|
|||
|
|||
def check_path_exists(path: Union[str, Path]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add doc string?
yaml.safe_dump(d, yaml_file, default_flow_style=False) | ||
|
||
|
||
def save_configuration_to_gcs(configuration: Configuration): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
nowcasting_dataset/dataset/batch.py
Outdated
|
||
|
||
def write_batch_locally(batch: List[Example], batch_i: int): | ||
def write_batch_locally(batch: List[Example], batch_i: int, path: Union[Path, Pathy]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weoth adding a default of path = Path("~/temp/").expanduser()? or is this too hidden?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good question. Hmm... I think that, because ~/temp/
is set as the default in model.py
, we perhaps don't want to set the default twice (i.e. in model.py
and here?). But I don't have any strong feelings about this!
I have just added path
to the docstring though (oops!)
ENABLE_NEPTUNE_LOGGING = False | ||
|
||
# load configuration, this can be changed to a different filename as needed. | ||
# TODO: Pass this in as a command-line argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when this is done can we use Click. Is it worth making an issue for this? and then putting that issue number int eh code. Keeps the TODOs a bit tidier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second Click!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! And thanks for letting me know about Click - I hadn't seen that before!
I've started an issue here.
I'll add a link in the code to the issue.
@@ -175,8 +211,7 @@ def main(): | |||
iterate_over_dataloader_and_write_to_disk(datamodule.test_dataloader(), DST_TEST_PATH) | |||
_LOG.info("Done!") | |||
|
|||
# save configuration to gcs | |||
save_configuration_to_cloud(configuration=config, cloud=CLOUD) | |||
save_yaml_configuration(config) | |||
|
|||
|
|||
if __name__ == "__main__": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work making that much nice with the direct paths
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I can't take any credit! fsspec
does all the magic :) There's still more work to do to properly use fsspec
throughout our code though (issue #153 tracks that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
tests/config/test.yaml
Outdated
output_data: | ||
filepath: solar-pv-nowcasting-data/prepared_ML_training_data/v5/ | ||
filepath: NA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, would it be even more verbose syaing filepath: not-setting-this-as-its-for-a-test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good - great to get it working on Leonardo - well done.
Ive added a few comments but they are all pretty minor, so choose to implement as you wish
Actually something funny is happening with the CI?CD, so lets sort that before merging |
Not sue why any of these new ones should cause an issue |
…eter/on-premises # Conflicts: # requirements.txt
Origin/peter/on premises
Pull Request
Description
Modifies
nowcasting_dataset
soprepare_ml_data.py
can be run on our local hardware, just by telling it to use theon_premises.yaml
config.Some other things this PR does:
*_zarr_path
)fsspec
and/orpathy
to provide a unified interface to GCP, AWS, and local filesystems #153)example.yaml
just to make our lives easier (so we don't have to continually updateexample.yaml
). Eithergcp.yaml
oron_premises.yaml
can be used as examples.yaml
files are (hopefully) a bit simpler and easier to understand now: Instead of my daft idea of being overly complicated and asking users to provide variousbase_paths
, and then the code joins the base path to the other paths... the user now gives the full path for each field in the config. This ends up with slightly longer lines but hopefully it's more explicit and easier for users to understand what's going on; and also simplifies the code a little.Fixes issue #121
TODO:
How Has This Been Tested?
It successfully runs on
leonardo
(there are occasional crashes, but those are unrelated to this PR, I think. See #158). And I've fixed all the unittests, so they all pass now :)Checklist: