Description
This is a proposal for an addition to the procedure for preprocessing -> fitting -> predicting, currently used in the package.
Current behaviour:
tib <- tibble(
time_value = c(1:10, 1:10), geo_value = rep(letters[1:2], each = 5),
x = rnorm(20), y = rnorm(20)
) |> as_epi_df()
r <- epi_recipe(tib) |> # stores a data template
step_*() |>
# more steps
f <- frosting() |>
layer_*() |>
# more layers
ewf <- epi_workflow(r, parsnip::fit_engine(), f) # up to now, no processing or estimation has occurred
ewf <- fit(ewf, tib) # this runs the preprocessing and model fitting on tib, first checking that tib
# matches the template in r updates the workflow, returning a new "fitted" workflow
td <- get_test_data(r, tib) # grabs the necessary rows of tib so that we can process it based on r and
# then produce a single prediction at the latest time value for all keys (geo_value + any others
# in the metadata) depending on the steps in `r`, this is likely tib[c(5, 10), ] plus some additional
# preceeding time values
p <- predict(ewf, new_data = td) # produces a forecast because we used the "tail" of the training data
Alternative, non-forecast as currently implemented. Not used, really, but should work:
tib2 <- tib[0, ]
r <- epi_recipe(tib2) |> # stores a data template, no difference in behaviour from the above
step_*() |>
# more steps
f <- frosting() |>
layer_*() |>
# more layers
ewf <- epi_workflow(r, parsnip::fit_engine(), f) # up to now, no processing or estimation has occurred
ewf <- fit(ewf, tib) # this runs the preprocessing and model fitting on tib, first checking that tib
# matches the template in `r`, updates the workflow, returning a new "fitted" workflow
tda <- tib[c(1, 6), ] # first time value in each geo
p1 <- predict(ewf, new_data = tda) # this will work, assuming that we can create the desired
# leads/lags (as specified in r) with tda
Proposed adjustment:
r <- epi_recipe(tib) |> # stores all the data, not just the template
step_*() |>
# more steps
f <- frosting() |>
layer_*() |>
# more layers
ewf <- epi_workflow(r, parsnip::fit_engine(), f) # up to now, no processing or estimation has occurred
p <- forecast(ewf) # automatically treat the stored template as training data, process it,
# fit the workflow, then predict only the future. Note that horizons are specified in the recipe.
# no "test-time" data is needed
Side issue: inheritance from {tidymodels}
means that we store template information about the original data frame in the epi_recipe
S3 object. {recipes}
stores the entire data. An epi_recipe
only stores a 0-row tibble with the column names. To get this proposal to work, we would need to change to match the {recipes}
behaviour and store the original data. This could potentially be large (the reason I avoided doing this before), though note that it is the original data, not the processed data. As currently implemented, certain test-time preprocessing operations that could benefit from access to the training data (smoothing, rolling averages, etc) can potentially be buggy because they are applied only to the test-time data (td
).
Storing the training data would help here. However, {tidymodels}
actually doesn't want to merge train-time and test-time data because it tries to emphasize (pedagogically?) that operations performed on train-time data should save the necessary summary statistics to be reused on test-time data. For example, centering and scaling a predictor should save the mean and sd at train time, and use those to adjust the test-time data (rather then computing the mean and sd of the test data and using those). As with most things, time series makes this complicated, and forecasts can potentially depend on all available data (rather than just "new" data). It's likely worth thinking carefully about this problem (though perhaps that's exactly what we're doing here).
forecast()
would only need the workflow as an argument, though we could potentially allow an optional additional_data
argument. They that would be added to the train-time data with the forecast now produced after the end of the additional_data
.