Fix problem with covid_hosp skipping state revisions. #1064
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Prevents future recurrence of the HHS state hospital admissions outage active 2023-01-12.
Prerequisites:
dev
branchdev
Summary
In January 2023, we noticed that the HHS hospital admissions data had unusually high lag. On investigation, it turned out that timeseries datasets had not been fetched in over a week, even though new timeseries revisions were available from healthdata.gov. It turned out that successful import of daily revisions (which in Jan 2023 have 7 days of lag, while timeseries revisions have only 1-2) were masking the new timeseries revisions from the acquisition system.
This PR changes the usage of the
dataset_name
column in thecovid_hosp_meta
table from containing the data table name (for whichcovid_hosp_state_timeseries
is shared by both the timeseries and daily revisions pipelines) to containing the healthdata.gov dataset ID (which is unique). This lets the acquisition system check for the last known timeseries file when pulling timeseries revisions, and the last known daily file when pulling daily revisions.This PR also changes how we track metadata. Previously, each run of the pipeline collected together all revisions posted to healthdata.gov on a particular day, and recorded only one line in metadata for the whole batch -- preventing us humans from having any idea whether any particular file from healthdata.gov was actually ingested by the acquisition system. The proposed change records a line in metadata for each file from healthdata.gov which is included in the batch.
This PR includes a migration to run after deploy and before next acquisition, which will update the
covid_hosp_meta
table to tag state rows with their healthdata.gov ID based on the name of the revision file stored in therevision_timestamp
column.Things this PR DOES NOT include:
older_than
inequality to permit pulling revisions from the current day: we exclude current-day revisions on purpose to avoid a scenario where the initial data reported for an issue turns out to be incomplete and must be updated later, i.e., needing to version our versions in addition to versioning the reference dates. essentially, we wait until we're sure the issue from healthdata.gov is complete before ingesting it.