-
Notifications
You must be signed in to change notification settings - Fork 16
Fix CPR bugs associated with non-unique timestamps by publish date #1562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Since deduplication keeps latest publish date for a given timestamp, it is possible for an older publish date to have fewer than two unique timestamps per geo region-signal type. tune asserts
…licate timestamps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some missing documentation, and I may also be getting confused about one context of two reference dates ("last week" "previous week") and another ("last week" positivity vs volume after 2021-03-17)
"Each publish date-geo value combination should be available for both " + \ | ||
"test positivity and test volume." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Community_Profile_Report_20220225.xlsx has 158 counties where positivity is not available for "last week" but volume is, and 7 counties where volume is not available but positivity is. Will this assert inappropriately reject that file? Or do you mean that previous steps should have already dropped the entries with mismatched availability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wording may be ambiguous on my part. "available" here means that there is a row for a given region and date. It doesn't check if the signal is NA
or not. pandas.read_excel
reads empty fields as NA/NaN
, so obs are included as NA
where the signal was not available in the original report.
The way the sample size filter is set up at the moment, any obs with missing (NA
) n or n <= 5 are dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rows with missing val
(test positivity) but sample size (test volume) available and > 5 are included.
Co-authored-by: Katie Mazaitis <[email protected]>
…st testing sig-combine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! 👍 want me to wait for ananya before merging or no?
Prefer to wait for @Ananya-Joshi to take a look given the complexity of some of the logic |
Tests and some sample dates I looked at from march 2022 seem to be working fine/generally in the right range. Happy to approve pending some information on the comments above. Good job Nat! |
Description
Fix two CPR bugs:
Changelog
pull.py
Fixes
Closes #1556.