-
Notifications
You must be signed in to change notification settings - Fork 67
[Draft] just-in-time (JIT) meta computations #947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
….. and also restore a method that got deleted??
…nd of regular acquisition method, removed helper to run dbjobs as a standalone process
Co-authored-by: Andrew Chin <[email protected]>
Co-authored-by: Katie Mazaitis <[email protected]>
Co-authored-by: Katie Mazaitis <[email protected]>
…rep-prep V4 schema revisions candidate
Bumps [tzinfo](https://github.com/tzinfo/tzinfo) from 1.2.9 to 1.2.10. - [Release notes](https://github.com/tzinfo/tzinfo/releases) - [Changelog](https://github.com/tzinfo/tzinfo/blob/master/CHANGES.md) - [Commits](tzinfo/tzinfo@v1.2.9...v1.2.10) --- updated-dependencies: - dependency-name: tzinfo dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]>
…o-1.2.10 Bump tzinfo from 1.2.9 to 1.2.10 in /docs
Exposition on A/B TestsFirst, I took a chunk of data from I ran an A/B test between my meta just-in-time (JIT) computations and the existing database approach. All the values matched except for the I was at a loss. How can the derived values possibly have different lags? Katie figured it out immediately: the update profiles look completely different for cumulative versus incidence. Consider: if we backfill 1 count to a cumulative signal to a day in the past, then all the consecutive days need to be updated (since the entire curve from that day on is raised by 1). For the incidence signal, however, this will just be a single backfilled blip and the rest of the values are unaffected. So if the cumulative signal is reissued, its lag/issue field is very likely to fall out of sync with the incidence signal. ![[Pasted image 20220721131752.png]] This plot shows lag against time for 3 signals, for a single county. However, when computing the diffed signal JIT, we only have access to a rolling window of two rows, from which we must decide on the issue (we can't look arbitrarily far in the past to see when But more on that later. To see if I could avoid this issue, I instead opted to collect data from the very beginning of the Consider the following query for a single county for the earliest data possible. >>> start_day = date(2020, 1, 15)
>>> end_day = date(2020, 1, 23)
>>> df1 = covidcast.signal(data_source="jhu-csse", signal="confirmed_cumulative_num", start_day=start_day, end_day=end_day, time_type="day", geo_type="county", geo_values="02100")
>>> df2 = covidcast.signal(data_source="jhu-csse", signal="confirmed_incidence_num", start_day=start_day, end_day=end_day, time_type="day", geo_type="county", geo_values="02100")
>>> >>> df3 = covidcast.signal(data_source="jhu-csse", signal="confirmed_7dav_incidence_num", start_day=start_day, end_day=end_day + timedelta(days=30), time_type="day", geo_type="county", geo_values="02100")
>>> d1
geo_value signal time_value issue lag missing_value missing_stderr missing_sample_size value stderr sample_size geo_type data_source
0 02100 confirmed_cumulative_num 2020-01-22 2020-05-14 113 0 5 5 0.0 None None county jhu-csse
0 02100 confirmed_cumulative_num 2020-01-23 2020-05-14 112 0 5 5 0.0 None None county jhu-csse
>>> df2
geo_value signal time_value issue lag missing_value missing_stderr missing_sample_size value stderr sample_size geo_type data_source
0 02100 confirmed_incidence_num 2020-01-22 2020-05-14 113 0 5 5 0.0 None None county jhu-csse
0 02100 confirmed_incidence_num 2020-01-23 2020-05-14 112 0 5 5 0.0 None None county jhu-csse
>>> df3
geo_value signal time_value issue lag missing_value missing_stderr missing_sample_size value stderr sample_size geo_type data_source
0 02100 confirmed_7dav_incidence_num 2020-02-20 2021-04-01 406 0 0 0 0.0 None None county jhu-csse
0 02100 confirmed_7dav_incidence_num 2020-02-21 2021-04-01 405 0 0 0 0.0 None None county jhu-csse
0 02100 confirmed_7dav_incidence_num 2020-02-22 2021-04-01 404 0 0 0 0.0 None None county jhu-csse
So for that reason, I decided to bump my data up a little later, so that the earliest data available for cumulative and incidence matches the smoothed signal (so I chose cumulative 2020-02-13 -- 2020-05-01 and incidence 2020-02-14 -- 2020-05-01). This results in This data has issues too. I started with having discrepancies between the >>> end_day = date(2020, 5, 1)
>>> df1 = covidcast.signal(data_source="jhu-csse", signal="confirmed_cumulative_num", start_day=end_day - timedelta(days=7), end_day=end_day, time_type="day", geo_type="county", geo_values="02100")
>>> df2 = covidcast.signal(data_source="jhu-csse", signal="confirmed_incidence_num", start_day=end_day - timedelta(days=6), end_day=end_day, time_type="day", geo_type="county", geo_values="02100")
>>> df3 = covidcast.signal(data_source="jhu-csse", signal="confirmed_7dav_incidence_num", start_day=end_day, end_day=end_day, time_type="day", geo_type="county", geo_values="02100")
>>> df1
geo_value signal time_value issue lag missing_value missing_stderr missing_sample_size value stderr sample_size geo_type data_source
0 02100 confirmed_cumulative_num 2020-04-24 2021-04-01 342 0 0 0 0.0 None None county jhu-csse
0 02100 confirmed_cumulative_num 2020-04-25 2021-04-01 341 0 0 0 0.0 None None county jhu-csse
0 02100 confirmed_cumulative_num 2020-04-26 2021-04-01 340 0 0 0 0.0 None None county jhu-csse
0 02100 confirmed_cumulative_num 2020-04-27 2021-04-01 339 0 0 0 0.0 None None county jhu-csse
0 02100 confirmed_cumulative_num 2020-04-28 2021-04-01 338 0 0 0 0.0 None None county jhu-csse
0 02100 confirmed_cumulative_num 2020-04-29 2021-04-01 337 0 0 0 0.0 None None county jhu-csse
0 02100 confirmed_cumulative_num 2020-04-30 2021-04-01 336 0 0 0 0.0 None None county jhu-csse
0 02100 confirmed_cumulative_num 2020-05-01 2021-04-01 335 0 0 0 0.0 None None county jhu-csse
>>> df2
geo_value signal time_value issue lag missing_value missing_stderr missing_sample_size value stderr sample_size geo_type data_source
0 02100 confirmed_incidence_num 2020-04-25 2021-04-01 341 0 0 0 0.0 None None county jhu-csse
0 02100 confirmed_incidence_num 2020-04-26 2021-04-01 340 0 0 0 0.0 None None county jhu-csse
0 02100 confirmed_incidence_num 2020-04-27 2021-04-01 339 0 0 0 0.0 None None county jhu-csse
0 02100 confirmed_incidence_num 2020-04-28 2021-04-01 338 0 0 0 0.0 None None county jhu-csse
0 02100 confirmed_incidence_num 2020-04-29 2021-04-01 337 0 0 0 0.0 None None county jhu-csse
0 02100 confirmed_incidence_num 2020-04-30 2021-04-01 336 0 0 0 0.0 None None county jhu-csse
0 02100 confirmed_incidence_num 2020-05-01 2021-04-01 335 0 0 0 0.0 None None county jhu-csse
>>> df3
geo_value signal time_value issue lag missing_value missing_stderr missing_sample_size value stderr sample_size geo_type data_source
0 02100 confirmed_7dav_incidence_num 2020-05-01 2021-04-01 335 0 0 0 0.0 None None county jhu-csse Track the lag values: the lag values of This seems reasonable. A reasonable approach to computing the lag value for a derived value that depends on a window of previous values is this:
I reimplemented this approach. Now I'm getting new issues with >>> end_day = date(2020, 3, 12)
>>> df1 = covidcast.signal(data_source="jhu-csse", signal="confirmed_cumulative_num", start_day=end_day - timedelta(days=7), end_day=end_day, time_type="day", geo_type="state", geo_values="ak")
>>> df2 = covidcast.signal(data_source="jhu-csse", signal="confirmed_incidence_num", start_day=end_day - timedelta(days=6), end_day=end_day , time_type="day", geo_type="state", geo_values="ak")
>>> df3 = covidcast.signal(data_source="jhu-csse", signal="confirmed_7dav_incidence_num", start_day=end_day, end_day=end_day, time_type="day", geo_type="state", geo_values="ak")
>>> df1
geo_value signal time_value issue lag missing_value missing_stderr missing_sample_size value stderr sample_size geo_type data_source
0 ak confirmed_cumulative_num 2020-03-05 2021-04-01 392 0 0 0 0.0 None None state jhu-csse
0 ak confirmed_cumulative_num 2020-03-06 2021-04-01 391 0 0 0 0.0 None None state jhu-csse
0 ak confirmed_cumulative_num 2020-03-07 2021-04-01 390 0 0 0 0.0 None None state jhu-csse
0 ak confirmed_cumulative_num 2020-03-08 2021-04-01 389 0 0 0 0.0 None None state jhu-csse
0 ak confirmed_cumulative_num 2020-03-09 2021-04-01 388 0 0 0 0.0 None None state jhu-csse
0 ak confirmed_cumulative_num 2020-03-10 2021-04-01 387 0 0 0 0.0 None None state jhu-csse
0 ak confirmed_cumulative_num 2020-03-11 2021-04-01 386 0 0 0 0.0 None None state jhu-csse
0 ak confirmed_cumulative_num 2020-03-12 2020-10-29 231 0 0 0 0.0 None None state jhu-csse
>>> df2
geo_value signal time_value issue lag missing_value missing_stderr missing_sample_size value stderr sample_size geo_type data_source
0 ak confirmed_incidence_num 2020-03-06 2021-04-01 391 0 0 0 0.0 None None state jhu-csse
0 ak confirmed_incidence_num 2020-03-07 2021-04-01 390 0 0 0 0.0 None None state jhu-csse
0 ak confirmed_incidence_num 2020-03-08 2021-04-01 389 0 0 0 0.0 None None state jhu-csse
0 ak confirmed_incidence_num 2020-03-09 2021-04-01 388 0 0 0 0.0 None None state jhu-csse
0 ak confirmed_incidence_num 2020-03-10 2021-04-01 387 0 0 0 0.0 None None state jhu-csse
0 ak confirmed_incidence_num 2020-03-11 2021-04-01 386 0 0 0 0.0 None None state jhu-csse
0 ak confirmed_incidence_num 2020-03-12 2020-10-29 231 0 0 0 0.0 None None state jhu-csse
>>> df3
geo_value signal time_value issue lag missing_value missing_stderr missing_sample_size value stderr sample_size geo_type data_source
0 ak confirmed_7dav_incidence_num 2020-03-12 2020-10-29 231 0 0 0 0.0 None None state jhu-csse |
Hopefully my notes above help understand my testing strategy. See the comments in test_meta-cache-updater |
8aef215
to
c0dff23
Compare
* separate meta operations from database.py into database_meta.py * add tests
c35bc9a
to
392a0af
Compare
]) | ||
>>> df.to_csv("test-data.csv") | ||
""" | ||
self._insert_csv("repos/delphi/delphi-epidata/tests/acquisition/covidcast/test-data/test-data2.csv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a testdata folder in the delphi-epidata folder. It may allow for easier reuse of these flat files if needed for other tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, let me move these files there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another note on this. The directory structure here is assuming that the base working directory is above the repos folder. This is inconsistent with the test_utils.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I've never even looked at that file before. Looks like it's all for tests of covid_hosp
, which I haven't really touched.
Wait, it looks like test_utils.py
does some complicated directory traversing up the tree until it can find testdata
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you’re right. Let me dig through the source to see if that’s used elsewhere.
* move meta AB tests to test_covidcast_meta from test_covidcast_meta_cache_updater * move test data files to testdata directory
* remove unused db access in test_covidcast_meta_cache_updater
* Revert the spacing in test_covidcast_meta_cache_updater for cleaner diffs
* incorporation of test improvements from parallel branch Updates to include improvements from "merged key dimension table" branch (`krivard/v4-rpp-mergeddim-leftjoin`), specifically at commit hash `fbf878e`. Changes include: - unit/integration testing refactorization and other improvements. - percona dbms now used in db docker image, plus changes for resulting compatibility issues. - db schema names are specified in ddl files. - removed of obsolete index hint guessing method. - updated comments. This changeset is essentially just a port of the excellent work @krivard did to refactor and otherwise improve the test architecture, as she applied it to branch mentioned above. Most of the files were simply copied over from the other branch to this one to create this PR; I only really made edits to these files: (and most edits were to strip out "mergedkey" stuff) - src/ddl/v4_schema.sql - src/acquisition/covidcast/database.py - src/acquisition/covidcast/test_utils.py - integrations/acquisition/covidcast/test_covidcast_meta_caching.py - integrations/server/covidcast/test_covidcast_meta.py * small cleanup to edit i made earlier * documentation clarification Co-authored-by: Katie Mazaitis <[email protected]> Co-authored-by: Katie Mazaitis <[email protected]>
… columns (#963) * renamed v4 db objects: load, latest, and history tables, and their id columns src/ddl/migrations/v4_renaming.sql - contains the SQL to do the renaming of a live v4 system source code changes were done by these 4 shell commands: find ./src ./tests ./integrations -type f -exec sed -i 's/signal_history/epimetric_full/g' {} \; find ./src ./tests ./integrations -type f -exec sed -i 's/signal_latest/epimetric_latest/g' {} \; find ./src ./tests ./integrations -type f -exec sed -i 's/signal_load/epimetric_load/g' {} \; find ./src ./tests ./integrations -type f -exec sed -i 's/signal_data_id/epimetric_id/g' {} \; all tests pass on the codebase as it stands in this commit; they also pass on a `delphi_web_epidata` docker image that was created before this commit and then had the migration file changes applied to it. * whoops, ran the renaming commands on the migration script * whitespace nit Co-authored-by: Katie Mazaitis <[email protected]> Co-authored-by: Katie Mazaitis <[email protected]>
Co-authored-by: Katie Mazaitis <[email protected]>
Metadata threads and ANALYZE and etc
addresses issue #941
DBLoadStateException: halt acquisition when unexpected data found in load table
Remove remaining wip pieces in tests
I'm going to merge this into |
Addresses a part of #646.
Contains work for computing meta values for the JIT signals. The basic setup is this:
This work refactors
database.py
and moves the meta computations intodatabase_meta.py
. The main work is in the latter file. The main testing is intest_covidcast_meta_cache_updater.py
.Would appreciate reviews on testing strategy, testing completeness, and clarity of the code.
TODO: