-
Notifications
You must be signed in to change notification settings - Fork 16
[DSEW CPR] [DRAFT] Add basic interpolation function and a test stub #1555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pushed more and clearer tests. Also made the interpolation default to linear - requires only two points to do, has simpler behavior, and should produce consistent interpolation when datasets don't fully overlap. Want to do a few more tests today and convince myself that the archiver will be given consistent input, week after week. |
So I ran the current indicator on two overlapping date ranges to see how the outputs would differ. dates1: 2022-03-05 -- 2022-03-11 Comparing the two receiving directories, we get a few cases:
Seems like it will be easy to introduce bugs here if we aren't careful. Possible issue: the archivediffer treating missing rows in (3) and (5) as deletions and deleting valid interpolations. Possible solution is to restrict the receiving output of the indicator to a given range, such as that middle common files range (4). Now can we find that time range consistently, given the strange behavior of weekends in this indicator? Unclear. Another drawback is that this time range lags behind the most recent actually available data (6 days in the case above). Finally, this time range is limited to ~6-day window because for larger windows the indicator runs into an error with having multiple values for the same (geo, time, signal) pair. @krivard thoughts? |
Good call explicitly verifying an overlapping range. Let's be a little more rigorous -- would you complete this table for one of the hospital admissions signals? That will help us figure out whether this change is doing what we intend. File coverage for the
|
@krivard I updated your comment with the filled table. |
Awesome. I had some time to kill while waiting for builds to finish so I fiddled with this a bit, and it looks like (at least for hospital admissions) we're getting exactly what we want:
What I did:
What needs to happen next? |
Oh you're right, this is good news. So what I would like to double check is which signals were having differences when I diffed the But this would be the ideal case: every adjacent time window only generates new files and the overlapping files stay identical. If we have that behavior, we're done. If we experience "deletions" like I described above, then we have to figure out some way to avoid those. |
cool, give a shout if you want a pair for your local analysis; i don't have time until Wednesday but Nat might be available sooner |
@nmdefries if you have some time and this isn't too context-less for you to jump in on. |
The only scenario when this source updates previously-published estimates is for testing signals (naats), where an initial estimate is updated exactly once ~7 days later. It is unlikely that an updated estimate is the cause of what you're seeing here. The naats code is under revision to resolve logic errors. I wouldn't trust any of the naats results until that's been completed. |
So I suppose we wait until the naats logic is fixed and then rerun these comparisons? |
Okay, you're right that something is definitely horked in interpolation, and for whatever reason it's only showing up with naats. I aspirationally merged Nat's naats fix into a spare branch, like this:
(don't merge those changes into Then I ran each branch ( Here are the available files in those two ranges, along with the positivity and volume reference dates they contain:
Sorted by positivity refdate (which is the refdate used for output) we get:
There are no backfill entries across these two ranges: no reference date is reissued with updated data. Between the two ranges (9-11 and 11-14) there is one overlapping file, 2022-03-11. If the code were working correctly, the output from that file (refdates 2022-03-01 and 2022-03-08) should match between the two runs. Right? The output from The output from
Hopefully that gives you a solid place to start looking for the cause? |
Hm, ok. So the mismatch is still: a) unique to interpolation, b) unique to naats signal. So just to make sure I have the terminology down: publish_date is the date on the file, positivity_refdate is the date in the positivity data, and volume_refdate is the date in the sample size data? Can you say more about how the positivity and volume data are used? Do we compute ratios for any signals by matching positivity_refdate and volume_refdate across files? Just looking at the diffs, there are two types: a) either a missing geo entirely gets deleted, b) very small (~1%) differences in the values. I'm guessing these come from two different types of interpolation issues. |
Correct on publish date, positivity refdate, and volume refdate. Consult #1562 and/or Nat for how the two refdates are used -- the TL;DR is that we match volume to positivity only within each file, and only within each week ("last week" or "previous week"). |
I haven't been following this super closely, but is the missing geo perhaps related to this? |
@nmdefries shouldn't be, since we are comparing two |
TL;DR: I found two concrete examples of the ways the deletions and updates occurred. It's basically what was expected:
Case 1Let's look at the county "13061" for the # 2022-03-09--2022-03-11
val se sample_size
timestamp
2022-02-27 0.2 0.100905 15.714286
2022-03-07 0.0 0.000000 15.000000
2022-03-08 0.0 0.000000 15.285714
# 2022-03-11--2022-03-14
val se sample_size
timestamp
2022-03-08 0.0 0.0 15.285714
2022-03-11 0.0 0.0 15.285714 When interpolated, we get the following # 2022-03-09--2022-03-11
val se sample_size
timestamp
2022-02-27 0.200 0.100905 15.714286
2022-02-28 0.175 0.088292 15.625000
2022-03-01 0.150 0.075679 15.535714
2022-03-02 0.125 0.063066 15.446429
2022-03-03 0.100 0.050452 15.357143
2022-03-04 0.075 0.037839 15.267857
2022-03-05 0.050 0.025226 15.178571
2022-03-06 0.025 0.012613 15.089286
2022-03-07 0.000 0.000000 15.000000
2022-03-08 0.000 0.000000 15.285714
# 2022-03-11--2022-03-14
val se sample_size
timestamp
2022-03-08 0.0 0.0 15.285714
2022-03-09 0.0 0.0 15.285714
2022-03-10 0.0 0.0 15.285714
2022-03-11 0.0 0.0 15.285714 This is a simple case: the only overlap for the data is the date 2022-03-08, where the values are the same. Clearly the rest of the values won't show up in receiving and will be viewed as deletions for this geo. Case 2Here is another geo "13059", with a less trivial case: # 2022-03-09--2022-03-11
val se sample_size
timestamp
2022-02-27 0.050 0.015800 190.285714
2022-02-28 0.045 0.015458 179.857143
2022-03-01 0.042 0.015333 171.142857
2022-03-06 0.015 0.008446 207.142857
2022-03-07 0.011 0.006831 233.142857
2022-03-08 0.011 0.006637 247.000000
# 2022-03-11--2022-03-14
val se sample_size
timestamp
2022-03-01 0.042 0.015333 171.142857
2022-03-04 0.021 0.010839 175.000000
2022-03-08 0.011 0.006637 247.000000
2022-03-11 0.022 0.009534 236.714286 When interpolated, we get # 2022-03-09--2022-03-11
val se sample_size
timestamp
2022-02-27 0.0500 0.015800 190.285714
2022-02-28 0.0450 0.015458 179.857143
2022-03-01 0.0420 0.015333 171.142857
2022-03-02 0.0366 0.013956 178.342857
2022-03-03 0.0312 0.012578 185.542857
2022-03-04 0.0258 0.011201 192.742857
2022-03-05 0.0204 0.009823 199.942857
2022-03-06 0.0150 0.008446 207.142857
2022-03-07 0.0110 0.006831 233.142857
2022-03-08 0.0110 0.006637 247.000000
# 2022-03-11--2022-03-14
val se sample_size
timestamp
2022-03-01 0.042000 0.015333 171.142857
2022-03-02 0.035000 0.013835 172.428571
2022-03-03 0.028000 0.012337 173.714286
2022-03-04 0.021000 0.010839 175.000000
2022-03-05 0.018500 0.009788 193.000000
2022-03-06 0.016000 0.008738 211.000000
2022-03-07 0.013500 0.007687 229.000000
2022-03-08 0.011000 0.006637 247.000000
2022-03-09 0.014667 0.007602 243.571429
2022-03-10 0.018333 0.008568 240.142857
2022-03-11 0.022000 0.009534 236.714286 Here the data from the two date ranges is interleaved: both have 2022-03-01 and 2022-03-08, but only the first has 2022-03-06 and only the second has 2022-03-04. This naturally causes the interpolations on 2022-03-02, 2022-03-03, 2022-03-05, 2022-03-07 to be different. Conclusions
Ideally we could get the archiver to ignore the first case and not delete values. I'm not sure how we can make that happen without adding even more complex logic here. |
Case 1
A deletion is only detected when a file is present in both places, but a row is missing -- if the rest of the values are in files that aren't output, we shouldn't get any deletions. In this particular case, the 11-14 run has output coverage over reference dates from 3/01 through 3/11 via other regions, so we will get deletions for 3/01-3/07, but not for 2/27-2/28. Case 2I agree with your reasoning that the interpolator is working as intended here, but I don't like the idea of replacing the report-confirmed values for 3/06 and 3/07 with interpolated ones. This probably means extending the date range we load in each run, but that's going to get tricky with the naats matching algorithm Nat developed. If we can get it to work though, that will probably also take care of the mistaken deletions problem from Case 1. What next
|
Case 1: good clarification. The deletions come from the inconsistency in reporting across geos for overlapping dates. The archiver limits us because it was designed to handle indicators that output either full date ranges or partial date ranges on each run, but it was not designed to handle partial geo reports. |
If we ran with ...but the running time would slowly devour the world, so that's probably not the solution we want long-term |
Yeaaaa, I don't think that's feasible even now - opening and interpolating just 7 files takes ~5 minutes on my machine, so if the full history is 300+ files, it would take hours. |
…isting DSEW-CPR: Extend files to be processed for interpolation
…/covidcast-indicators into ds/dsew-interpolation
@krivard I think this is ready for a final review, once I figure out what's failing with the build. |
pretty sure it's just the linter |
@krivard the linter is happy! i ran this with a modification of your
and found that there were no changed files, only new and removed in the diff. maybe to be safe i should try this on even more date windows... |
Alright, I started running this
and was going to diff all consecutive folders, just to make sure and I already found something strange. Setting reports to "2022-01-02--2022-01-06" breaks because the 2021-12-30 publish date only has 1 reference value, but |
I think I see what happened. The 2021-12-30 file has two reference dates: 2021-12-27 and 2021-12-20. The 2022-01-06 file also has the 2021-12-27 reference date, so when we hit this line of code, the 2022-01-06 version overwrites the 2021-12-30 one and leaves it with a single reference date.
Aaand it turns out that Nat already fixed this issue and I just haven't merged from main into this branch in a while. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Description
Adds interpolation to most DSEW CPR signals. Partially fixes #1539. Final work will be in #1576.
Changelog
Fixes