Skip to content

[DSEW CPR] [DRAFT] Add basic interpolation function and a test stub #1555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Apr 18, 2022

Conversation

dshemetov
Copy link
Contributor

@dshemetov dshemetov commented Mar 10, 2022

Description

Adds interpolation to most DSEW CPR signals. Partially fixes #1539. Final work will be in #1576.

Changelog

  • Add an interpolation function that fills in missing rows with reasonable defaults.
  • Add tests for the interpolation.

Fixes

@dshemetov dshemetov requested review from krivard and nmdefries March 10, 2022 00:17
@dshemetov
Copy link
Contributor Author

Pushed more and clearer tests. Also made the interpolation default to linear - requires only two points to do, has simpler behavior, and should produce consistent interpolation when datasets don't fully overlap.

Want to do a few more tests today and convince myself that the archiver will be given consistent input, week after week.

@dshemetov
Copy link
Contributor Author

dshemetov commented Mar 24, 2022

So I ran the current indicator on two overlapping date ranges to see how the outputs would differ.

dates1: 2022-03-05 -- 2022-03-11
dates2: 2022-03-08 -- 2022-03-14

Comparing the two receiving directories, we get a few cases:

  1. Files that are only in receiving_dates1 have the date ranges [20220225 -- 20220307]
  2. Files that are only in receiving_dates2 have the date ranges [20220309 -- 20220314]
  3. Common files on the left time boundary [20220226 -- 20220301] containing a few rows present in receiving_dates1 and missing in receiving_dates2
    • this is either a) an actual deletion, b) an interpolation that loses a value on the left of the NA, so can't fill anymore
  4. Common files in middle of the time boundary [20220302 -- 20220304], with 90% row changes. This is likely updating data. Eyeballing, most of these changes are minor (~1% error value changes).
  5. Common files on the right time boundary [20220305 -- 20220308] containing a few rows present in receiving_dates2 and missing in receiving_dates1
    • either a) an actual new value, b) an interpolation that loses its value on the right of the NA, so can't be filled

Seems like it will be easy to introduce bugs here if we aren't careful. Possible issue: the archivediffer treating missing rows in (3) and (5) as deletions and deleting valid interpolations. Possible solution is to restrict the receiving output of the indicator to a given range, such as that middle common files range (4). Now can we find that time range consistently, given the strange behavior of weekends in this indicator? Unclear. Another drawback is that this time range lags behind the most recent actually available data (6 days in the case above). Finally, this time range is limited to ~6-day window because for larger windows the indicator runs into an error with having multiple values for the same (geo, time, signal) pair. @krivard thoughts?

@krivard
Copy link
Contributor

krivard commented Mar 25, 2022

Good call explicitly verifying an overlapping range. Let's be a little more rigorous -- would you complete this table for one of the hospital admissions signals? That will help us figure out whether this change is doing what we intend.

File coverage for the confirmed covid-19 admissions signal:

branch reports param DSEW files read dates in receiving
main 2022-03-05 -- 2022-03-11 20220307, 20220308, 20220309, 20220310, 20220311 20220305 -- 20220309
main 2022-03-08 -- 2022-03-14 20220308, 20220309, 20220310, 20220311, 20220314 20220306 -- 20220309, 20220312
ds/dsew-interpolation 2022-03-05 -- 2022-03-11 20220307, 20220308, 20220309, 20220310, 20220311 20220305 -- 20220309
ds/dsew-interpolation 2022-03-08 -- 2022-03-14 20220308, 20220309, 20220310, 20220311, 20220314 20220306 -- 20220312

@dshemetov
Copy link
Contributor Author

@krivard I updated your comment with the filled table.

@krivard
Copy link
Contributor

krivard commented Mar 25, 2022

Awesome. I had some time to kill while waiting for builds to finish so I fiddled with this a bit, and it looks like (at least for hospital admissions) we're getting exactly what we want:

  • neither the main branch nor the interpolation branch makes changes to previously-computed days where reports ranges overlap
  • the interpolation branch makes no changes to days where we have source data
  • the interpolation branch generates output for days where we don't have source data

What I did:

$ for branch in main ds/dsew-interpolation; do git checkout $branch; for r in 2022-03-05--2022-03-11 2022-03-08--2022-03-14; do rcv="receiving_${branch/\//_}_$r"; mkdir $rcv ||break;  env/bin/python -m delphi_utils set indicator.reports $r common.export_dir $rcv; env/bin/python -m delphi_dsew_community_profile ||break; done; done;
$ diff -rq receiving_main_2022-03-0* |less
# ^^ looks like the old pipeline doesn't sort its output, but we can get around that
$ diff -rq receiving_main_2022-03-0* |grep differ |awk '{print "diff -u <(sort " $2 ")","<(sort " $4 ")"}' |bash |less
# ^^ no files truly differ after sorting
$ diff -rq receiving_ds_dsew-interpolation_2022-03-0* |less
# ^^ new pipeline does sort its output, so there are no files that differ
$ diff -rq receiving_*_2022-03-05--2022-03-11 |less
# ^^ no new files, as expected; differing files is probably the sorting issue
$ diff -rq receiving_*_2022-03-05--2022-03-11 |grep differ |awk '{print "diff -u <(sort " $2 ")","<(sort " $4 ")"}' |bash |less
# ^^ no files truly differ
$ diff -rq receiving_*_2022-03-08--2022-03-14 |less
# ^^ new files covering the gap in the interpolation branch; differing files is probably the sorting issue
$ diff -rq receiving_*_2022-03-08--2022-03-14 |grep differ |awk '{print "diff -u <(sort " $2 ")","<(sort " $4 ")"}' |bash |less
# ^^ no files truly differ

What needs to happen next?

@dshemetov
Copy link
Contributor Author

dshemetov commented Mar 25, 2022

Oh you're right, this is good news. So what I would like to double check is which signals were having differences when I diffed the receiving_ds_dsew-interpolation_2022-03-05--2022-03-11 and receiving_ds_dsew-interpolation_2022-03-08--2022-03-14 folders yesterday. There might be some numerical issues going on.

But this would be the ideal case: every adjacent time window only generates new files and the overlapping files stay identical. If we have that behavior, we're done. If we experience "deletions" like I described above, then we have to figure out some way to avoid those.

@krivard
Copy link
Contributor

krivard commented Mar 28, 2022

cool, give a shout if you want a pair for your local analysis; i don't have time until Wednesday but Nat might be available sooner

@dshemetov
Copy link
Contributor Author

dshemetov commented Mar 28, 2022

  • I just reran the folder comparison analysis above and the only signal with differing files is naat_pct_positive_7dav. Is there anything special about that signal?
    image

  • Here is a sample deletion from 20220228_county_covid_naat_pct_positive_7dav.csv.

    • Not sure where these are coming from. My best guess is what I wrote in my first comment: the window 20220308 -- 20220314 loses a value on the left and is thus unable to interpolate. But this doesn't explain why this is only occurring in this signal.
      image
  • Here is another example 20220302_county_covid_naat_pct_positive_7dav.csv, where the majority of the file is changed, but as you can see the changes to the values are very minor (~1%).

    • FWIW, 20220302 is a date that doesn't show up on the main branch, so all these values are interpolated. So it's likely that all these changes reflect slight updates to the data that cause slightly different interpolations.
      image

@nmdefries if you have some time and this isn't too context-less for you to jump in on.

@krivard
Copy link
Contributor

krivard commented Mar 28, 2022

The only scenario when this source updates previously-published estimates is for testing signals (naats), where an initial estimate is updated exactly once ~7 days later. It is unlikely that an updated estimate is the cause of what you're seeing here.

The naats code is under revision to resolve logic errors. I wouldn't trust any of the naats results until that's been completed.

@dshemetov
Copy link
Contributor Author

dshemetov commented Mar 28, 2022

So I suppose we wait until the naats logic is fixed and then rerun these comparisons?

@krivard
Copy link
Contributor

krivard commented Mar 31, 2022

Okay, you're right that something is definitely horked in interpolation, and for whatever reason it's only showing up with naats.

I aspirationally merged Nat's naats fix into a spare branch, like this:

$ git checkout ds/dsew-interpolation
$ git checkout -b krivard/dsew-interpolation
$ git merge ndefries/cpr-lenient-check-ts-per-publishdate

(don't merge those changes into ds/dsew-interpolation yet, in case Ananya finds something and Nat needs to rebase. this is purely a throwaway merge)

Then I ran each branch (krivard/dsew-interpolation and ndefries/cpr-lenient-check-ts-per-publishdate) twice: once with reports set to 2022-03-09--2022-03-11 and once with reports set to 2022-03-11--2022-03-14. The smaller ranges are necessary so that we can avoid overlaps unique to the naats data.

Here are the available files in those two ranges, along with the positivity and volume reference dates they contain:

publish date week positivity refdate volume refdate
2022-03-09 previous 2022-02-27 2022-02-23
2022-03-09 last 2022-03-06 2022-03-02
2022-03-10 previous 2022-02-28 2022-02-24
2022-03-10 last 2022-03-07 2022-03-03
2022-03-11 previous 2022-03-01 2022-02-25
2022-03-11 last 2022-03-08 2022-03-04
2022-03-14 previous 2022-03-04 2022-02-28
2022-03-14 last 2022-03-11 2022-03-07

Sorted by positivity refdate (which is the refdate used for output) we get:

publish date positivity refdate volume refdate
2022-03-09 2022-02-27 2022-02-23
2022-03-10 2022-02-28 2022-02-24
2022-03-11 2022-03-01 2022-02-25
2022-03-14 2022-03-04 2022-02-28
2022-03-09 2022-03-06 2022-03-02
2022-03-10 2022-03-07 2022-03-03
2022-03-11 2022-03-08 2022-03-04
2022-03-14 2022-03-11 2022-03-07

There are no backfill entries across these two ranges: no reference date is reissued with updated data.

Between the two ranges (9-11 and 11-14) there is one overlapping file, 2022-03-11. If the code were working correctly, the output from that file (refdates 2022-03-01 and 2022-03-08) should match between the two runs. Right?

The output from ndefries/cpr-lenient-check-ts-per-publishdate matches, so I'm pretty sure we're not dealing with a logic bug in the naats handling per se.

The output from krivard/dsew-interpolation does not match. Sample diff (sorted; header manually edited for readability):

--- receiving_krivard_dsew-interpolation_2022-03-09--2022-03-11/20220301_county_covid_naat_pct_positive_7dav.csv 2022-03-31 17:46:26.517258105 -0400
+++ receiving_krivard_dsew-interpolation_2022-03-11--2022-03-14/20220301_county_covid_naat_pct_positive_7dav.csv 2022-03-31 17:46:26.517258105 -0400
@@ -347,7 +347,6 @@
 13055,0.046,NA,26.285714285714285
 13057,0.045,NA,383.2857142857143
 13059,0.042,NA,171.14285714285714
-13061,0.15,NA,15.535714285714285
 13063,0.013,NA,2531.8571428571427
 13067,0.038,NA,1430.4285714285713
 13069,0.051,NA,34.0
@@ -450,7 +449,6 @@
 13299,0.078,NA,27.857142857142858
 13303,0.079,NA,10.857142857142858
 13305,0.057,NA,23.0
-13309,0.0578571,NA,6.6938775510204085
 13311,0.068,NA,42.0
 13313,0.03,NA,264.0
 13317,0.0,NA,6.142857142857143
@@ -752,7 +750,6 @@
 20173,0.044,NA,686.4285714285714
 20175,0.047,NA,10.0
 20177,0.052,NA,382.2857142857143
-20181,0.0335,NA,6.0476190476190474
 20189,0.032,NA,6.571428571428571
 20191,0.071,NA,19.714285714285715
 20193,0.234,NA,6.0
[...]

Hopefully that gives you a solid place to start looking for the cause?

@dshemetov
Copy link
Contributor Author

dshemetov commented Apr 1, 2022

Hm, ok. So the mismatch is still: a) unique to interpolation, b) unique to naats signal.

So just to make sure I have the terminology down: publish_date is the date on the file, positivity_refdate is the date in the positivity data, and volume_refdate is the date in the sample size data? Can you say more about how the positivity and volume data are used? Do we compute ratios for any signals by matching positivity_refdate and volume_refdate across files?

Just looking at the diffs, there are two types: a) either a missing geo entirely gets deleted, b) very small (~1%) differences in the values. I'm guessing these come from two different types of interpolation issues.

@krivard
Copy link
Contributor

krivard commented Apr 1, 2022

Correct on publish date, positivity refdate, and volume refdate.

Consult #1562 and/or Nat for how the two refdates are used -- the TL;DR is that we match volume to positivity only within each file, and only within each week ("last week" or "previous week").

@nmdefries
Copy link
Contributor

nmdefries commented Apr 1, 2022

I haven't been following this super closely, but is the missing geo perhaps related to this?

@dshemetov
Copy link
Contributor Author

@nmdefries shouldn't be, since we are comparing two receiving outputs from the same branch using the same geomapper source. The only thing changing in our comparisons is the time window over which the indicator aggregates input files.

@dshemetov
Copy link
Contributor Author

dshemetov commented Apr 6, 2022

TL;DR: I found two concrete examples of the ways the deletions and updates occurred. It's basically what was expected:

  1. the data contains is in [a, b] and [b, c] and so the interpolation for both doesn't cover the same values
  2. the data contains interleaved data and so the interpolation imputes different values

Case 1

Let's look at the county "13061" for the naat positivity signal. Here is what the raw data looks like:

# 2022-03-09--2022-03-11
            val        se  sample_size
timestamp                             
2022-02-27  0.2  0.100905    15.714286
2022-03-07  0.0  0.000000    15.000000
2022-03-08  0.0  0.000000    15.285714

# 2022-03-11--2022-03-14
            val   se  sample_size
timestamp                        
2022-03-08  0.0  0.0    15.285714
2022-03-11  0.0  0.0    15.285714

When interpolated, we get the following

# 2022-03-09--2022-03-11
              val        se  sample_size
timestamp                               
2022-02-27  0.200  0.100905    15.714286
2022-02-28  0.175  0.088292    15.625000
2022-03-01  0.150  0.075679    15.535714
2022-03-02  0.125  0.063066    15.446429
2022-03-03  0.100  0.050452    15.357143
2022-03-04  0.075  0.037839    15.267857
2022-03-05  0.050  0.025226    15.178571
2022-03-06  0.025  0.012613    15.089286
2022-03-07  0.000  0.000000    15.000000
2022-03-08  0.000  0.000000    15.285714

# 2022-03-11--2022-03-14
            val   se  sample_size
timestamp                        
2022-03-08  0.0  0.0    15.285714
2022-03-09  0.0  0.0    15.285714
2022-03-10  0.0  0.0    15.285714
2022-03-11  0.0  0.0    15.285714

This is a simple case: the only overlap for the data is the date 2022-03-08, where the values are the same. Clearly the rest of the values won't show up in receiving and will be viewed as deletions for this geo.

Case 2

Here is another geo "13059", with a less trivial case:

# 2022-03-09--2022-03-11
              val        se  sample_size
timestamp                               
2022-02-27  0.050  0.015800   190.285714
2022-02-28  0.045  0.015458   179.857143
2022-03-01  0.042  0.015333   171.142857
2022-03-06  0.015  0.008446   207.142857
2022-03-07  0.011  0.006831   233.142857
2022-03-08  0.011  0.006637   247.000000

# 2022-03-11--2022-03-14
              val        se  sample_size
timestamp                               
2022-03-01  0.042  0.015333   171.142857
2022-03-04  0.021  0.010839   175.000000
2022-03-08  0.011  0.006637   247.000000
2022-03-11  0.022  0.009534   236.714286

When interpolated, we get

# 2022-03-09--2022-03-11
               val        se  sample_size
timestamp                                
2022-02-27  0.0500  0.015800   190.285714
2022-02-28  0.0450  0.015458   179.857143
2022-03-01  0.0420  0.015333   171.142857
2022-03-02  0.0366  0.013956   178.342857
2022-03-03  0.0312  0.012578   185.542857
2022-03-04  0.0258  0.011201   192.742857
2022-03-05  0.0204  0.009823   199.942857
2022-03-06  0.0150  0.008446   207.142857
2022-03-07  0.0110  0.006831   233.142857
2022-03-08  0.0110  0.006637   247.000000

# 2022-03-11--2022-03-14
                 val        se  sample_size
timestamp                                  
2022-03-01  0.042000  0.015333   171.142857
2022-03-02  0.035000  0.013835   172.428571
2022-03-03  0.028000  0.012337   173.714286
2022-03-04  0.021000  0.010839   175.000000
2022-03-05  0.018500  0.009788   193.000000
2022-03-06  0.016000  0.008738   211.000000
2022-03-07  0.013500  0.007687   229.000000
2022-03-08  0.011000  0.006637   247.000000
2022-03-09  0.014667  0.007602   243.571429
2022-03-10  0.018333  0.008568   240.142857
2022-03-11  0.022000  0.009534   236.714286

Here the data from the two date ranges is interleaved: both have 2022-03-01 and 2022-03-08, but only the first has 2022-03-06 and only the second has 2022-03-04. This naturally causes the interpolations on 2022-03-02, 2022-03-03, 2022-03-05, 2022-03-07 to be different.

Conclusions

  • The first case seems like an unfortunate case of inconsistent updates across geos.
  • The second case seems like a legitimate data update and it seems reasonable to add values in that case.

Ideally we could get the archiver to ignore the first case and not delete values. I'm not sure how we can make that happen without adding even more complex logic here.

@krivard
Copy link
Contributor

krivard commented Apr 7, 2022

Case 1

This is a simple case: the only overlap for the data is the date 2022-03-08, where the values are the same. Clearly the rest of the values won't show up in receiving and will be viewed as deletions for this geo.

A deletion is only detected when a file is present in both places, but a row is missing -- if the rest of the values are in files that aren't output, we shouldn't get any deletions. In this particular case, the 11-14 run has output coverage over reference dates from 3/01 through 3/11 via other regions, so we will get deletions for 3/01-3/07, but not for 2/27-2/28.

Case 2

I agree with your reasoning that the interpolator is working as intended here, but I don't like the idea of replacing the report-confirmed values for 3/06 and 3/07 with interpolated ones. This probably means extending the date range we load in each run, but that's going to get tricky with the naats matching algorithm Nat developed. If we can get it to work though, that will probably also take care of the mistaken deletions problem from Case 1.

What next

  • write up the problem for Ryan, since this is more complicated than we anticipated
  • get you, me, Nat, and maybe Ananya in a meeting to figure out how to get interpolation to activate only for reference dates where we have no data, while maintaining our ability to match total naats and naats positivity by week

@dshemetov
Copy link
Contributor Author

Case 1: good clarification. The deletions come from the inconsistency in reporting across geos for overlapping dates.
Case 2: while extending the range will solve the issue of replacing a known value with an interpolation, I am skeptical that it will solve the problem of case 1. I expect that there will data files with publish dates on the boundaries of the range that will present inconsistent reference dates across geos and lead to the same artificial deletion phenomena.

The archiver limits us because it was designed to handle indicators that output either full date ranges or partial date ranges on each run, but it was not designed to handle partial geo reports.

@krivard
Copy link
Contributor

krivard commented Apr 7, 2022

I expect that there will data files with publish dates on the boundaries of the range that will present inconsistent reference dates across geos and lead to the same artificial deletion phenomena.

If we ran with reports: "all" each day, we wouldn't have boundaries 😄

...but the running time would slowly devour the world, so that's probably not the solution we want long-term

@dshemetov
Copy link
Contributor Author

Yeaaaa, I don't think that's feasible even now - opening and interpolating just 7 files takes ~5 minutes on my machine, so if the full history is 300+ files, it would take hours.

@dshemetov
Copy link
Contributor Author

@krivard I think this is ready for a final review, once I figure out what's failing with the build.

@krivard
Copy link
Contributor

krivard commented Apr 15, 2022

pretty sure it's just the linter

@dshemetov
Copy link
Contributor Author

@krivard the linter is happy! i ran this with a modification of your

for branch in ds/dsew-interpolation; do git checkout $branch; for r in 2022-03-05--2022-03-11 2022-03-08--2022-03-14; do rcv="receiving_${branch/\//_}_$r"; mkdir $rcv ||break;  env/bin/python -m delphi_utils set indicator.reports $r common.export_dir $rcv; env/bin/python -m delphi_dsew_community_profile ||break; done; done;

and found that there were no changed files, only new and removed in the diff.

maybe to be safe i should try this on even more date windows...

@dshemetov
Copy link
Contributor Author

Alright, I started running this

date_ranges=()
for i in {1..60}; do
    date_ranges[i]=$(date -I -d "2022-01-01 +$i days")--$(date -I -d "2022-01-01 +$(expr $i + 5) days")
done

for branch in ds/dsew-interpolation; do git checkout $branch; for r in "$date_ranges[@]"; do rcv="receiving_${branch/\//_}_$r"; mkdir $rcv ||break;  env/bin/python -m delphi_utils set indicator.reports $r common.export_dir $rcv; env/bin/python -m delphi_dsew_community_profile ||break; done; done;

and was going to diff all consecutive folders, just to make sure and I already found something strange.

Setting reports to "2022-01-02--2022-01-06" breaks because the 2021-12-30 publish date only has 1 reference value, but add_max_ts_col expects 2. The strange thing is that the 2021-12-30 publish date file has 2 reference values and setting reports to "2022-01-02--2022-01-05" confirms this, since it runs just fine. My guess is that extending to 2022-01-06 somehow filters out reference values from the 2021-12-30 file.

@dshemetov
Copy link
Contributor Author

dshemetov commented Apr 15, 2022

I think I see what happened. The 2021-12-30 file has two reference dates: 2021-12-27 and 2021-12-20. The 2022-01-06 file also has the 2021-12-27 reference date, so when we hit this line of code, the 2022-01-06 version overwrites the 2021-12-30 one and leaves it with a single reference date.

latest_sig_df = pd.concat(
            lst
        ).groupby(
            "timestamp"
        ).apply(
            lambda x: x[x["publish_date"] == x["publish_date"].max()]
        ).drop_duplicates(
        )

Aaand it turns out that Nat already fixed this issue and I just haven't merged from main into this branch in a while.

Copy link
Contributor

@krivard krivard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@krivard krivard merged commit a38fcf8 into main Apr 18, 2022
@krivard krivard deleted the ds/dsew-interpolation branch April 18, 2022 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add retrospective gapfilling to CPR hospital admissions signals
3 participants