Skip to content
This repository was archived by the owner on Sep 11, 2023. It is now read-only.

Get nowcasting_dataset running on OCF's on-premises server #152

Merged
merged 19 commits into from
Sep 28, 2021

Conversation

JackKelly
Copy link
Member

@JackKelly JackKelly commented Sep 22, 2021

Pull Request

Description

Modifies nowcasting_dataset so prepare_ml_data.py can be run on our local hardware, just by telling it to use the on_premises.yaml config.

Some other things this PR does:

  • tweaks some of the config arguments (e.g. consistently using *_zarr_path)
  • To minimise the amount of code we have to write to handle multiple compute environments, this PR makes a start on using Pathy and fsspec to provide a unified interface to GCP, AWS, and local (Use fsspec and/or pathy to provide a unified interface to GCP, AWS, and local filesystems #153)
  • I deleted example.yaml just to make our lives easier (so we don't have to continually update example.yaml). Either gcp.yaml or on_premises.yaml can be used as examples.
  • The config yaml files are (hopefully) a bit simpler and easier to understand now: Instead of my daft idea of being overly complicated and asking users to provide various base_paths, and then the code joins the base path to the other paths... the user now gives the full path for each field in the config. This ends up with slightly longer lines but hopefully it's more explicit and easier for users to understand what's going on; and also simplifies the code a little.

Fixes issue #121

TODO:

  • Check for other uses of some of the functions I've deleted / renamed

How Has This Been Tested?

It successfully runs on leonardo (there are occasional crashes, but those are unrelated to this PR, I think. See #158). And I've fixed all the unittests, so they all pass now :)

  • No
  • Yes

Checklist:

  • My code follows OCF's coding style guidelines
  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked my code and corrected any misspellings

@JackKelly JackKelly added enhancement New feature or request data New data source or feature; or modification of existing data source infrastructure/on-premises hardware labels Sep 22, 2021
@JackKelly JackKelly added this to the WP1 essential tasks milestone Sep 22, 2021
@JackKelly JackKelly self-assigned this Sep 22, 2021
@JackKelly JackKelly linked an issue Sep 22, 2021 that may be closed by this pull request
@peterdudfield
Copy link
Contributor

Worth adding 'pathy' to help run tests

@peterdudfield peterdudfield removed this from the WP1 essential tasks milestone Sep 24, 2021
@JackKelly
Copy link
Member Author

I'll work on fixing the failing tests and merging main into this branch...

@@ -64,7 +64,7 @@ def gsp_data_source():

@pytest.fixture
def configuration():
filename = os.path.join(os.path.dirname(nowcasting_dataset.__file__), "config", "example.yaml")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR deletes example.yaml (see the 'conversation' tab!)

@@ -14,17 +14,20 @@ dependencies:
- zarr
- xarray
- ipykernel
- h5netcdf # For opening NetCDF files from cloud buckets.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

black made this change (and a bunch more like this). Turns out that black violates the pep8 recommendation to have two spaces before a comment, and I can't see how to force black to allow two spaces before comments, but ho-hum :)

@@ -43,3 +44,38 @@ def gcp_to_aws(gcp_filename: str, gcs: gcsfs.GCSFileSystem, aws_filename: str, a
upload_one_file(
remote_filename=aws_filename, bucket=aws_bucket, local_filename=local_filename
)


def get_maximum_batch_id(path: str):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from nowcasting_dataset/utils.py (and slightly modified)

@@ -28,7 +29,6 @@ process:
- lcc
- mcc
- hcc
prcesion: 16
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

precision is left-over from when nowcasting_dataset was used during on-the-fly ML training. precision is only needed during ML training (not for creating batches), so removing this field

@@ -42,4 +42,3 @@ process:
- VIS008
- WV_062
- WV_073
val_check_interval: 1000
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val_check_interval is left-over from when nowcasting_dataset was used during on-the-fly ML training.

Returns: pydantic class
"""

# load the file to a dictionary
with open(filename, "r") as stream:
with fsspec.open(filename, mode="r") as stream:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fsspec.open() will open from local, GCP or AWS (it's magic!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -133,38 +132,3 @@ def pad_data(
data[variable] = pad_nans(data[variable], pad_width=((0, 0), pad_shape)) # (axis0, axis1)

return data


def get_maximum_batch_id_from_gcs(remote_path: str):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to nowcasting_dataset/cloud/utils.py

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good move

@@ -32,3 +32,9 @@ def delete_all_files_in_temp_path(path: Path):
_LOG.info(f"Deleting {len(files)} files from {path}.")
for f in files:
os.remove(f)


def check_path_exists(path: Union[str, Path]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add doc string?

yaml.safe_dump(d, yaml_file, default_flow_style=False)


def save_configuration_to_gcs(configuration: Configuration):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice



def write_batch_locally(batch: List[Example], batch_i: int):
def write_batch_locally(batch: List[Example], batch_i: int, path: Union[Path, Pathy]):
Copy link
Contributor

@peterdudfield peterdudfield Sep 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weoth adding a default of path = Path("~/temp/").expanduser()? or is this too hidden?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good question. Hmm... I think that, because ~/temp/ is set as the default in model.py, we perhaps don't want to set the default twice (i.e. in model.py and here?). But I don't have any strong feelings about this!

I have just added path to the docstring though (oops!)

ENABLE_NEPTUNE_LOGGING = False

# load configuration, this can be changed to a different filename as needed.
# TODO: Pass this in as a command-line argument.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when this is done can we use Click. Is it worth making an issue for this? and then putting that issue number int eh code. Keeps the TODOs a bit tidier

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second Click!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! And thanks for letting me know about Click - I hadn't seen that before!

I've started an issue here.

I'll add a link in the code to the issue.

@@ -175,8 +211,7 @@ def main():
iterate_over_dataloader_and_write_to_disk(datamodule.test_dataloader(), DST_TEST_PATH)
_LOG.info("Done!")

# save configuration to gcs
save_configuration_to_cloud(configuration=config, cloud=CLOUD)
save_yaml_configuration(config)


if __name__ == "__main__":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work making that much nice with the direct paths

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I can't take any credit! fsspec does all the magic :) There's still more work to do to properly use fsspec throughout our code though (issue #153 tracks that)

Copy link
Member

@jacobbieker jacobbieker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

output_data:
filepath: solar-pv-nowcasting-data/prepared_ML_training_data/v5/
filepath: NA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, would it be even more verbose syaing filepath: not-setting-this-as-its-for-a-test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! Done!

Copy link
Contributor

@peterdudfield peterdudfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good - great to get it working on Leonardo - well done.

Ive added a few comments but they are all pretty minor, so choose to implement as you wish

@peterdudfield
Copy link
Contributor

Actually something funny is happening with the CI?CD, so lets sort that before merging

@peterdudfield
Copy link
Contributor

Actually something funny is happening with the CI?CD, so lets sort that before merging

Not sue why any of these new ones should cause an issue
s3fs
fsspec
pathy

@JackKelly JackKelly merged commit d449fba into main Sep 28, 2021
@JackKelly JackKelly deleted the jack/on-premises branch September 28, 2021 15:51
@JackKelly JackKelly linked an issue Oct 1, 2021 that may be closed by this pull request
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
data New data source or feature; or modification of existing data source enhancement New feature or request infrastructure/on-premises hardware
Projects
None yet
3 participants