Skip to content

simplify Dockerfile #113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 10, 2023
Merged

simplify Dockerfile #113

merged 5 commits into from
May 10, 2023

Conversation

bochocki
Copy link
Contributor

@bochocki bochocki commented May 9, 2023

Overview

This PR uses updated versions of Python and prophet to greatly simplify the python environment setup in the Dockerfile. The code has been tested by creating a local Docker container, and sample outputs were written to the following tables in moz-fx-data-bq-data-science.bochocki:

  • tmp_desktop_kpi_forecast
  • tmp_desktop_kpi_forecast_confidences
  • tmp_mobile_kpi_forecast
  • tmp_mobile_kpi_forecast_confidences

Additional Changes

  • .gitignore: ignore additional filetypes
  • kpi_forecasting.py: set confidence intervals target from config instead of relying on hardcoded "desktop". This target is overwritten in write_confidence_intervals_to_bigquery here, but I think this change makes the it clear that we're not unintentionally using "desktop" labels on "mobile" forecasts.
  • PosteriorSampling.py: minor refactoring required to resolve errors and deprecation warnings that are now being raised by pandas as a result of package upgrades.
  • README.md: update examples
  • requirements.txt: updated packages to get easier-install versions of prophet and statsforecast.

Checklist for reviewer:

  • Commits should reference a bug or github issue, if relevant (if a bug is
    referenced, the pull request should include the bug number in the title)
  • Scan the PR and verify that no changes (particularly to
    .circleci/config.yml) will cause environment variables (particularly
    credentials) to be exposed in test logs
  • Ensure the container image will be using permissions granted to
    telemetry-airflow
    responsibly.

## Overview
This PR uses updated versions of Python and `prophet` to greatly simplify the python environment setup in the Dockerfile. The code has been tested by creating a local Docker container, and sample outputs were written to the following tables in `moz-fx-data-bq-data-science.bochocki`:
- `tmp_desktop_kpi_forecast`
- `tmp_desktop_kpi_forecast_confidences`
- `tmp_mobile_kpi_forecast`
- `tmp_mobile_kpi_forecast_confidences`

## Additional Changes
- `.gitignore`: ignore additional filetypes
- `kpi_forecasting.py`: set confidence intervals `target` from `config` instead of relying on hardcoded `"desktop"`. This `target` is overwritten in `write_confidence_intervals_to_bigquery` [here](https://github.com/mozilla/docker-etl/blob/4cfbec915375343023944d1ca23f527251a5ada8/jobs/kpi-forecasting/kpi-forecasting/Utils/DBWriter.py#L116), but I think this change makes the it clear that we're not unintentionally using "desktop" labels on "mobile" forecasts.
- `PosteriorSampling.py`: minor refactoring required to resolve errors and deprecation warnings that are now being raised by pandas as a result of package upgrades.
- `README.md`: update examples
- `requirements.txt`: updated packages to get easier-install versions of `prophet` and `statsforecast`.
@@ -31,10 +31,9 @@ def get_confidence_intervals(
uncertainty_samples["ds"] > np.datetime64(final_observed_sample_date)
]
.groupby("{}".format(aggregation_unit_of_time))
.sum()
.sum(numeric_only=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +89 to +91
uncertainty_samples_aggregated.iloc[0, 1:] += observed_aggregated["value"].iloc[
-1
]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same intended logic as before, but the previous code doesn't work in new versions of pandas because observed_aggregated.iloc[-1].value doesn't return a single value, it returns an array of values. Using the . column access method was also confusing, because at first glance it looks like a typo of .values which casts a pandas column to a numpy array.

@@ -71,6 +70,8 @@ def get_confidence_intervals(
columns={"y": "value"}
).sort_values(by="{}".format(aggregation_unit_of_time))

observed_aggregated = observed_aggregated.astype({"value": np.float64})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observed_aggregated["value"] is being stored as an Int64Dtype, which is a pandas type for storing large integers. For some reason, using this type breaks the following merge on line 100:

   all_aggregated = pd.merge(
        observed_aggregated,
        uncertainty_samples_aggregated,
        on=["{}".format(aggregation_unit_of_time), "value", "type"],
        how="outer",
    )

I think using float64 instead is an okay workaround here, since the values in the confidence intervals are reported as float64 anyways.

@bochocki bochocki requested a review from perrymcmanis144 May 9, 2023 23:01
@bochocki bochocki enabled auto-merge (squash) May 9, 2023 23:36
@bochocki bochocki requested a review from irrationalagent May 9, 2023 23:36
Copy link
Contributor

@perrymcmanis144 perrymcmanis144 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very happy to see this PR. LGTM and matches the expectations I had about this work based on prior conversations we've had 👍

@bochocki bochocki merged commit 545c6f9 into main May 10, 2023
@bochocki bochocki deleted the kpi-forecasting_simplify-dockerfile branch May 10, 2023 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants