Skip to content

Add retrospective gapfilling to CPR hospital admissions signals #1539

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
krivard opened this issue Feb 28, 2022 · 5 comments · Fixed by #1555
Closed

Add retrospective gapfilling to CPR hospital admissions signals #1539

krivard opened this issue Feb 28, 2022 · 5 comments · Fixed by #1555
Assignees
Labels
math Priority-P0 Must-do; lab will self-destruct without it

Comments

@krivard
Copy link
Contributor

krivard commented Feb 28, 2022

Because the CPR only includes one (sometimes two) reference dates for each signal, and because the CPR is not published on weekends, the resulting COVIDcast signals are only available 5 days a week. See the green line in this timeseries chart:

image

We should use simple interpolation to fill these gaps retrospectively (once the next file becomes available). Extrapolation on days where no new files are posted is left as a future research project.

We'll need to do something like this:

  • When computing the list of CPR files to process, extend the series backwards by one file ("additional CPR file")
  • Process the CPR files into a df as usual
  • Expand the df to cover a contiguous date sequence, NA-filling
  • Impute the missing values
  • Drop the per-signal reference dates in the additional CPR file
  • Aggregate to nation, export, etc as usual

Complication

The existing imputation utility in delphi_utils.smooth assumes we only ever impute a value for date X using data from dates Y<X. We will probably need to add support for some kind of symmetrical mode. Expected signal data for this change looks like [x1, NA, NA, x4] -- literally, since these gaps mostly occur at weekends. x1 is from the additional CPR file, x4 is from today's CPR file, and we want to publish the two imputed values and x4. This probably means a linear or other low-degree fit that doesn't need a lot of context to work.

This revision must be completed before we can begin showing county-level hospital admissions in the visualizations on the website.

@krivard krivard added math Priority-P0 Must-do; lab will self-destruct without it labels Feb 28, 2022
@dshemetov
Copy link
Contributor

Feels like I'm missing the context for this. Is this from new code somewhere? What is CPR?

@krivard
Copy link
Contributor Author

krivard commented Mar 1, 2022

Apologies for the name confusion -- CPR is the DSEW Community Profile Report, one of the new sets of signals developed as part of the omicron effort. Pipeline code is here

@dshemetov
Copy link
Contributor

So my initial thought is that getting this into the smooth util would require some involved math. A quick solution could be to use pandas.DataFrame.interpolate.

Here is what method="time" (linear interpolation) looks like on a test dataset:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "time_value": pd.date_range("2022-01-01", "2022-04-01"),
    "value": 10*np.sin([i/3 if i % 7 >= 2 else np.nan for i in range(91)])
}).set_index("time_value")

import matplotlib.pyplot as plt

plt.figure()
plt.plot(df)
plt.figure()
plt.plot(df.interpolate(method="time"))

image

Here is what method="cubic" looks like:
image

@dshemetov dshemetov self-assigned this Mar 7, 2022
@dshemetov
Copy link
Contributor

This is what the plot in the OP looks like with cubic interpolation:
image

@krivard
Copy link
Contributor Author

krivard commented Mar 9, 2022

Notes from pair session with Dmitry:

Solving this will require reading in additional, already-processed files, which seems complicated. We've discarded "read in extra files" before as a solution for the "volume and positivity are technically for different reference dates" problem, and we should consider avoiding it here as well.

Alternatives:

  • Make the frontend do interpolation
  • Make users do interpolation
  • Make the API do interpolation

Mitigating factors:

  • ArchiveDiffer will automatically drop estimates & files whose content is identical to what's already in the API
  • download_and_parse and fetch_new_reports already handle multiple files, since the "new" report mode fetches all unprocessed/unseen files from healthdata.gov, not just the most recent.

Next steps:

  • Write a function that will re-index a dsew df to include skipped days, then compute a cubic interpolation over the gaps
  • Write tests to make sure interpolation doesn't cross geo_id boundaries (i.e., make geo_id 1 values between 0 and 1, and geo_id 2 values between 1e6 and 1e6+1, and check that the interpolations don't go wild)
  • Write tests to make sure that the anchor values which we expect to stay the same, actually stay the same, so that archivediffer will suppress them in the final output delivered to receiving.
  • Add new logic to fetch_listing which extends the active file set backwards by 1<N<7 files
  • Determine useful value for N which is not day-of-week dependent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
math Priority-P0 Must-do; lab will self-destruct without it
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants