-
Notifications
You must be signed in to change notification settings - Fork 16
simplify Dockerfile #113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simplify Dockerfile #113
Conversation
## Overview This PR uses updated versions of Python and `prophet` to greatly simplify the python environment setup in the Dockerfile. The code has been tested by creating a local Docker container, and sample outputs were written to the following tables in `moz-fx-data-bq-data-science.bochocki`: - `tmp_desktop_kpi_forecast` - `tmp_desktop_kpi_forecast_confidences` - `tmp_mobile_kpi_forecast` - `tmp_mobile_kpi_forecast_confidences` ## Additional Changes - `.gitignore`: ignore additional filetypes - `kpi_forecasting.py`: set confidence intervals `target` from `config` instead of relying on hardcoded `"desktop"`. This `target` is overwritten in `write_confidence_intervals_to_bigquery` [here](https://github.com/mozilla/docker-etl/blob/4cfbec915375343023944d1ca23f527251a5ada8/jobs/kpi-forecasting/kpi-forecasting/Utils/DBWriter.py#L116), but I think this change makes the it clear that we're not unintentionally using "desktop" labels on "mobile" forecasts. - `PosteriorSampling.py`: minor refactoring required to resolve errors and deprecation warnings that are now being raised by pandas as a result of package upgrades. - `README.md`: update examples - `requirements.txt`: updated packages to get easier-install versions of `prophet` and `statsforecast`.
@@ -31,10 +31,9 @@ def get_confidence_intervals( | |||
uncertainty_samples["ds"] > np.datetime64(final_observed_sample_date) | |||
] | |||
.groupby("{}".format(aggregation_unit_of_time)) | |||
.sum() | |||
.sum(numeric_only=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uncertainty_samples_aggregated.iloc[0, 1:] += observed_aggregated["value"].iloc[ | ||
-1 | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same intended logic as before, but the previous code doesn't work in new versions of pandas because observed_aggregated.iloc[-1].value
doesn't return a single value, it returns an array of values. Using the .
column access method was also confusing, because at first glance it looks like a typo of .values
which casts a pandas column to a numpy array.
@@ -71,6 +70,8 @@ def get_confidence_intervals( | |||
columns={"y": "value"} | |||
).sort_values(by="{}".format(aggregation_unit_of_time)) | |||
|
|||
observed_aggregated = observed_aggregated.astype({"value": np.float64}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
observed_aggregated["value"]
is being stored as an Int64Dtype
, which is a pandas type for storing large integers. For some reason, using this type breaks the following merge on line 100:
all_aggregated = pd.merge(
observed_aggregated,
uncertainty_samples_aggregated,
on=["{}".format(aggregation_unit_of_time), "value", "type"],
how="outer",
)
I think using float64 instead is an okay workaround here, since the values in the confidence intervals are reported as float64 anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very happy to see this PR. LGTM and matches the expectations I had about this work based on prior conversations we've had 👍
Overview
This PR uses updated versions of Python and
prophet
to greatly simplify the python environment setup in the Dockerfile. The code has been tested by creating a local Docker container, and sample outputs were written to the following tables inmoz-fx-data-bq-data-science.bochocki
:tmp_desktop_kpi_forecast
tmp_desktop_kpi_forecast_confidences
tmp_mobile_kpi_forecast
tmp_mobile_kpi_forecast_confidences
Additional Changes
.gitignore
: ignore additional filetypeskpi_forecasting.py
: set confidence intervalstarget
fromconfig
instead of relying on hardcoded"desktop"
. Thistarget
is overwritten inwrite_confidence_intervals_to_bigquery
here, but I think this change makes the it clear that we're not unintentionally using "desktop" labels on "mobile" forecasts.PosteriorSampling.py
: minor refactoring required to resolve errors and deprecation warnings that are now being raised by pandas as a result of package upgrades.README.md
: update examplesrequirements.txt
: updated packages to get easier-install versions ofprophet
andstatsforecast
.Checklist for reviewer:
referenced, the pull request should include the bug number in the title)
.circleci/config.yml
) will cause environment variables (particularlycredentials) to be exposed in test logs
telemetry-airflow
responsibly.