[Draft] just-in-time (JIT) meta computations #947

dshemetov · 2022-07-23T21:14:12Z

Addresses a part of #646.

Contains work for computing meta values for the JIT signals. The basic setup is this:

base signals (non-derived, aka non-JIT, signals) have their meta computed normally
derived signals make a request to a local API server and then use Pandas to compute the resulting meta values

This work refactors database.py and moves the meta computations into database_meta.py. The main work is in the latter file. The main testing is in test_covidcast_meta_cache_updater.py.

Would appreciate reviews on testing strategy, testing completeness, and clarity of the code.

TODO:

Make complete tests.
- Make a table of the integration tests that already exist for JIT computations.
  - List all the queries that have been tested. List all the possible edge cases.
- Make a table of the meta computation A/B tests.
  - List all the queries that have been tested. List all the possible edge cases.

….. and also restore a method that got deleted??

…nd of regular acquisition method, removed helper to run dbjobs as a standalone process

Co-authored-by: Andrew Chin <[email protected]>

Co-authored-by: Katie Mazaitis <[email protected]>

…rep-prep V4 schema revisions candidate

Bumps [tzinfo](https://github.com/tzinfo/tzinfo) from 1.2.9 to 1.2.10. - [Release notes](https://github.com/tzinfo/tzinfo/releases) - [Changelog](https://github.com/tzinfo/tzinfo/blob/master/CHANGES.md) - [Commits](tzinfo/tzinfo@v1.2.9...v1.2.10) --- updated-dependencies: - dependency-name: tzinfo dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]>

…o-1.2.10 Bump tzinfo from 1.2.9 to 1.2.10 in /docs

dshemetov · 2022-07-23T21:16:12Z

Exposition on A/B Tests

First, I took a chunk of data from jhu-csse:confirmed_cumulative_num between 2022-04-01 and 2022-06-01 to test my meta computations. I chose 4 counties and 2 states.

I ran an A/B test between my meta just-in-time (JIT) computations and the existing database approach. All the values matched except for the max_lag field. Not only that, but all the values matched for confirmed_cumulative_num, the derived signals didn't though.

I was at a loss. How can the derived values possibly have different lags?

Katie figured it out immediately: the update profiles look completely different for cumulative versus incidence. Consider: if we backfill 1 count to a cumulative signal to a day in the past, then all the consecutive days need to be updated (since the entire curve from that day on is raised by 1). For the incidence signal, however, this will just be a single backfilled blip and the rest of the values are unaffected. So if the cumulative signal is reissued, its lag/issue field is very likely to fall out of sync with the incidence signal.

![[Pasted image 20220721131752.png]]

This plot shows lag against time for 3 signals, for a single county.

However, when computing the diffed signal JIT, we only have access to a rolling window of two rows, from which we must decide on the issue (we can't look arbitrarily far in the past to see when confirmed_cumulative_num was updated, for that defeat the server's streaming architecture). So JIT has no alternative but to deviate here from existing functionality and choose an issue from one of the two rows it has access to when diffing.

But more on that later.

To see if I could avoid this issue, I instead opted to collect data from the very beginning of the jhu-csse signal history. The earliest data available is on 2020-01-22, so I grabbed data from then until 2020-05-01. It turned out that there were data discrepancies there too. This results in test-data1.csv.

Consider the following query for a single county for the earliest data possible.

>>> start_day = date(2020, 1, 15)
>>> end_day = date(2020, 1, 23)
>>> df1 = covidcast.signal(data_source="jhu-csse", signal="confirmed_cumulative_num", start_day=start_day, end_day=end_day, time_type="day", geo_type="county", geo_values="02100")
>>> df2 = covidcast.signal(data_source="jhu-csse", signal="confirmed_incidence_num", start_day=start_day, end_day=end_day, time_type="day", geo_type="county", geo_values="02100")
>>> >>> df3 = covidcast.signal(data_source="jhu-csse", signal="confirmed_7dav_incidence_num", start_day=start_day, end_day=end_day + timedelta(days=30), time_type="day", geo_type="county", geo_values="02100")
>>> d1
	geo_value	signal	time_value	issue	lag	missing_value	missing_stderr	missing_sample_size	value	stderr	sample_size	geo_type	data_source
0	02100	confirmed_cumulative_num	2020-01-22	2020-05-14	113	0	5	5	0.0	None	None	county	jhu-csse
0	02100	confirmed_cumulative_num	2020-01-23	2020-05-14	112	0	5	5	0.0	None	None	county	jhu-csse

>>> df2
geo_value	signal	time_value	issue	lag	missing_value	missing_stderr	missing_sample_size	value	stderr	sample_size	geo_type	data_source
0	02100	confirmed_incidence_num	2020-01-22	2020-05-14	113	0	5	5	0.0	None	None	county	jhu-csse
0	02100	confirmed_incidence_num	2020-01-23	2020-05-14	112	0	5	5	0.0	None	None	county	jhu-csse

>>> df3
	geo_value	signal	time_value	issue	lag	missing_value	missing_stderr	missing_sample_size	value	stderr	sample_size	geo_type	data_source
0	02100	confirmed_7dav_incidence_num	2020-02-20	2021-04-01	406	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_7dav_incidence_num	2020-02-21	2021-04-01	405	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_7dav_incidence_num	2020-02-22	2021-04-01	404	0	0	0	0.0	None	None	county	jhu-csse

discrepancy 1: the earliest issues available for incidence match cumulative. While mostly inconsequential and the earliest incidence value is filled with a 0, we couldn't actually have access to this data, so it must be a spoof
discrepancy 2: the earliest smoothed data available is on 2020-02-20, which lags behind the earliest cumulative data by over a month

So for that reason, I decided to bump my data up a little later, so that the earliest data available for cumulative and incidence matches the smoothed signal (so I chose cumulative 2020-02-13 -- 2020-05-01 and incidence 2020-02-14 -- 2020-05-01). This results in test-data2.csv.

This data has issues too.

I started with having discrepancies between the min_lag values in my JIT code and in the data. Here's what the data showed:

>>> end_day = date(2020, 5, 1)
>>> df1 = covidcast.signal(data_source="jhu-csse", signal="confirmed_cumulative_num", start_day=end_day - timedelta(days=7), end_day=end_day, time_type="day", geo_type="county", geo_values="02100")
>>> df2 = covidcast.signal(data_source="jhu-csse", signal="confirmed_incidence_num", start_day=end_day - timedelta(days=6), end_day=end_day, time_type="day", geo_type="county", geo_values="02100")
>>> df3 = covidcast.signal(data_source="jhu-csse", signal="confirmed_7dav_incidence_num", start_day=end_day, end_day=end_day, time_type="day", geo_type="county", geo_values="02100")
>>> df1
geo_value	signal	time_value	issue	lag	missing_value	missing_stderr	missing_sample_size	value	stderr	sample_size	geo_type	data_source
0	02100	confirmed_cumulative_num	2020-04-24	2021-04-01	342	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_cumulative_num	2020-04-25	2021-04-01	341	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_cumulative_num	2020-04-26	2021-04-01	340	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_cumulative_num	2020-04-27	2021-04-01	339	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_cumulative_num	2020-04-28	2021-04-01	338	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_cumulative_num	2020-04-29	2021-04-01	337	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_cumulative_num	2020-04-30	2021-04-01	336	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_cumulative_num	2020-05-01	2021-04-01	335	0	0	0	0.0	None	None	county	jhu-csse
>>> df2
geo_value	signal	time_value	issue	lag	missing_value	missing_stderr	missing_sample_size	value	stderr	sample_size	geo_type	data_source
0	02100	confirmed_incidence_num	2020-04-25	2021-04-01	341	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_incidence_num	2020-04-26	2021-04-01	340	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_incidence_num	2020-04-27	2021-04-01	339	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_incidence_num	2020-04-28	2021-04-01	338	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_incidence_num	2020-04-29	2021-04-01	337	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_incidence_num	2020-04-30	2021-04-01	336	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_incidence_num	2020-05-01	2021-04-01	335	0	0	0	0.0	None	None	county	jhu-csse
>>> df3
	geo_value	signal	time_value	issue	lag	missing_value	missing_stderr	missing_sample_size	value	stderr	sample_size	geo_type	data_source
0	02100	confirmed_7dav_incidence_num	2020-05-01	2021-04-01	335	0	0	0	0.0	None	None	county	jhu-csse

Track the lag values: the lag values of confirmed_7dav_incidence_num matches with the lag at the same date and time for the confirmed_cumulative_num signal.

This seems reasonable. A reasonable approach to computing the lag value for a derived value that depends on a window of previous values is this:

take the window of values in the base signal (here, confirmed_cumulative_num) and find the largest issue (here, 2021-04-01)
compute the distance between this largest issue and the time_value that we're computing for (generally the time_value we're computing for is the right-most point in the window, so in this case it's 2020-05-01), so here we get 335

I reimplemented this approach. Now I'm getting new issues with min_lag. Tracking it down, I found that the smoothed_incidence signal just inherits the issue of the cumulative signal without taking a max over all the rows present in the summation.

>>> end_day = date(2020, 3, 12)
>>> df1 = covidcast.signal(data_source="jhu-csse", signal="confirmed_cumulative_num", start_day=end_day - timedelta(days=7), end_day=end_day, time_type="day", geo_type="state", geo_values="ak")
>>> df2 = covidcast.signal(data_source="jhu-csse", signal="confirmed_incidence_num", start_day=end_day - timedelta(days=6), end_day=end_day , time_type="day", geo_type="state", geo_values="ak")
>>> df3 = covidcast.signal(data_source="jhu-csse", signal="confirmed_7dav_incidence_num", start_day=end_day, end_day=end_day, time_type="day", geo_type="state", geo_values="ak")
>>> df1
geo_value	signal	time_value	issue	lag	missing_value	missing_stderr	missing_sample_size	value	stderr	sample_size	geo_type	data_source
0	ak	confirmed_cumulative_num	2020-03-05	2021-04-01	392	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_cumulative_num	2020-03-06	2021-04-01	391	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_cumulative_num	2020-03-07	2021-04-01	390	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_cumulative_num	2020-03-08	2021-04-01	389	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_cumulative_num	2020-03-09	2021-04-01	388	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_cumulative_num	2020-03-10	2021-04-01	387	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_cumulative_num	2020-03-11	2021-04-01	386	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_cumulative_num	2020-03-12	2020-10-29	231	0	0	0	0.0	None	None	state	jhu-csse

>>> df2
geo_value	signal	time_value	issue	lag	missing_value	missing_stderr	missing_sample_size	value	stderr	sample_size	geo_type	data_source
0	ak	confirmed_incidence_num	2020-03-06	2021-04-01	391	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_incidence_num	2020-03-07	2021-04-01	390	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_incidence_num	2020-03-08	2021-04-01	389	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_incidence_num	2020-03-09	2021-04-01	388	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_incidence_num	2020-03-10	2021-04-01	387	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_incidence_num	2020-03-11	2021-04-01	386	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_incidence_num	2020-03-12	2020-10-29	231	0	0	0	0.0	None	None	state	jhu-csse

>>> df3
geo_value	signal	time_value	issue	lag	missing_value	missing_stderr	missing_sample_size	value	stderr	sample_size	geo_type	data_source
0	ak	confirmed_7dav_incidence_num	2020-03-12	2020-10-29	231	0	0	0	0.0	None	None	state	jhu-csse

dshemetov · 2022-07-23T21:17:15Z

Hopefully my notes above help understand my testing strategy. See the comments in test_meta-cache-updater

* separate meta operations from database.py into database_meta.py * add tests

nolangormley · 2022-08-23T19:43:54Z

tests/acquisition/covidcast/test_covidcast_meta_cache_updater.py

+        ])
+        >>> df.to_csv("test-data.csv")
+        """
+        self._insert_csv("repos/delphi/delphi-epidata/tests/acquisition/covidcast/test-data/test-data2.csv")


There's a testdata folder in the delphi-epidata folder. It may allow for easier reuse of these flat files if needed for other tests.

True, let me move these files there.

Another note on this. The directory structure here is assuming that the base working directory is above the repos folder. This is inconsistent with the test_utils.py

Hm, I've never even looked at that file before. Looks like it's all for tests of covid_hosp, which I haven't really touched.

Wait, it looks like test_utils.py does some complicated directory traversing up the tree until it can find testdata.

Yeah you’re right. Let me dig through the source to see if that’s used elsewhere.

* move meta AB tests to test_covidcast_meta from test_covidcast_meta_cache_updater * move test data files to testdata directory

* remove unused db access in test_covidcast_meta_cache_updater

* Revert the spacing in test_covidcast_meta_cache_updater for cleaner diffs

@krivard

* incorporation of test improvements from parallel branch Updates to include improvements from "merged key dimension table" branch (`krivard/v4-rpp-mergeddim-leftjoin`), specifically at commit hash `fbf878e`. Changes include: - unit/integration testing refactorization and other improvements. - percona dbms now used in db docker image, plus changes for resulting compatibility issues. - db schema names are specified in ddl files. - removed of obsolete index hint guessing method. - updated comments. This changeset is essentially just a port of the excellent work @krivard did to refactor and otherwise improve the test architecture, as she applied it to branch mentioned above. Most of the files were simply copied over from the other branch to this one to create this PR; I only really made edits to these files: (and most edits were to strip out "mergedkey" stuff) - src/ddl/v4_schema.sql - src/acquisition/covidcast/database.py - src/acquisition/covidcast/test_utils.py - integrations/acquisition/covidcast/test_covidcast_meta_caching.py - integrations/server/covidcast/test_covidcast_meta.py * small cleanup to edit i made earlier * documentation clarification Co-authored-by: Katie Mazaitis <[email protected]> Co-authored-by: Katie Mazaitis <[email protected]>

… columns (#963) * renamed v4 db objects: load, latest, and history tables, and their id columns src/ddl/migrations/v4_renaming.sql - contains the SQL to do the renaming of a live v4 system source code changes were done by these 4 shell commands: find ./src ./tests ./integrations -type f -exec sed -i 's/signal_history/epimetric_full/g' {} \; find ./src ./tests ./integrations -type f -exec sed -i 's/signal_latest/epimetric_latest/g' {} \; find ./src ./tests ./integrations -type f -exec sed -i 's/signal_load/epimetric_load/g' {} \; find ./src ./tests ./integrations -type f -exec sed -i 's/signal_data_id/epimetric_id/g' {} \; all tests pass on the codebase as it stands in this commit; they also pass on a `delphi_web_epidata` docker image that was created before this commit and then had the migration file changes applied to it. * whoops, ran the renaming commands on the migration script * whitespace nit Co-authored-by: Katie Mazaitis <[email protected]> Co-authored-by: Katie Mazaitis <[email protected]>

…fixes)

…ce fixes)

Co-authored-by: Katie Mazaitis <[email protected]>

Metadata threads and ANALYZE and etc

addresses issue #941

DBLoadStateException: halt acquisition when unexpected data found in load table

Remove remaining wip pieces in tests

dshemetov · 2022-09-08T05:46:56Z

I'm going to merge this into jit_computations branch. With all the v4 updates to the tests, it duplicates the work to keep these branches both in sync (and I spent a lot of time merging the v4 updates in today).

dshemetov and others added 19 commits May 26, 2022 13:10

Remove remaining wip pieces in tests

3d3b8a7

Put long name test back

c71b3f4

schema refinement, dbjobs speed increase, and added migration routine

090c0df

fix to re-do changes the last commit undid from the commit before it.…

b540699

….. and also restore a method that got deleted??

removed run_dbjobs_old() and cleaned up table name references

dca9e18

removed process_status column of load table, moved dbjobs() call to e…

d3130de

…nd of regular acquisition method, removed helper to run dbjobs as a standalone process

adding method to fix autoincrement counter

d65797f

Update integrations/server/test_covidcast_endpoints.py

3f27050

Co-authored-by: Andrew Chin <[email protected]>

Update src/acquisition/covidcast/database.py

43fc1da

Co-authored-by: Katie Mazaitis <[email protected]>

Update src/server/endpoints/covidcast.py

fd3d03f

Co-authored-by: Katie Mazaitis <[email protected]>

changes from comments on PR

0fceb43

Apply suggestions from code review & discussion

33dfbde

Switch to CREATE TABLE SELECT for signal_latest

1f2c175

Remove dev setup from prod v4 DDL

c9a2708

Specify schema for each DDL file

7e28e0f

Merge pull request #903 from cmu-delphi/v4-schema-revisions-release-p…

83b6988

…rep-prep V4 schema revisions candidate

proper cli arguments incl a flag to resume processing (by id)

aa81eb8

Merge pull request #946 from cmu-delphi/dependabot/bundler/docs/tzinf…

3f0ff0f

…o-1.2.10 Bump tzinfo from 1.2.9 to 1.2.10 in /docs

dshemetov requested a review from krivard July 23, 2022 21:14

dshemetov marked this pull request as draft July 23, 2022 21:14

dshemetov mentioned this pull request Aug 2, 2022

[Refactor] Clarify a few csv_importer types and names #949

Merged

4 tasks

dshemetov force-pushed the ds/jit-update-meta branch from 8aef215 to c0dff23 Compare August 2, 2022 22:51

dshemetov mentioned this pull request Aug 4, 2022

[JIT]: just-in-time series computations #646

Draft

27 tasks

dshemetov added 2 commits August 3, 2022 17:54

Add JIT meta caching

440fa05

* separate meta operations from database.py into database_meta.py * add tests

Remove some more relative imports for now

392a0af

dshemetov force-pushed the ds/jit-update-meta branch from c35bc9a to 392a0af Compare August 4, 2022 00:54

dshemetov mentioned this pull request Aug 4, 2022

consider using sort(key=...) function #573

Closed

nolangormley reviewed Aug 23, 2022

View reviewed changes

dshemetov and others added 19 commits August 23, 2022 14:19

JIT tests cleanup:

5ef37db

* move meta AB tests to test_covidcast_meta from test_covidcast_meta_cache_updater * move test data files to testdata directory

JIT tests cleanup:

278ea4c

* remove unused db access in test_covidcast_meta_cache_updater

JIT tests cleanup:

2dc967f

* Revert the spacing in test_covidcast_meta_cache_updater for cleaner diffs

cli option for specifying # of metadata worker threads

0d4188b

added flag to Makefile for new percona db image (and some whitespace …

5e3eb51

…fixes)

converted JOINs in VIEWs to 'USING' instead of 'ON'

b9ab3f3

added ANALYZE TABLE to the end of acquisition runs (and some whitespa…

a61a5db

…ce fixes)

help string clarification

ab773f7

Co-authored-by: Katie Mazaitis <[email protected]>

en-structure logging statement

037e211

Merge pull request #964 from cmu-delphi/metadata_threads_and_etc

c890712

Metadata threads and ANALYZE and etc

Fix mocked tests in test_csv_to_database.py

789a404

halt acquisition when unexpected data found in load table

3026299

addresses issue #941

Merge pull request #967 from cmu-delphi/db_load_state_exception

a6d53f4

DBLoadStateException: halt acquisition when unexpected data found in load table

Merge branch 'v4-schema-revisions-release-prep' into ds/remove-wip-tests

09013fa

Merge pull request #917 from cmu-delphi/ds/remove-wip-tests

2e49339

Remove remaining wip pieces in tests

Merge branch 'v4-schema-revisions-release-prep' into ds/jit-update-meta

c558fc9

Fix tests

e8b18d9

dshemetov marked this pull request as ready for review September 8, 2022 05:47

dshemetov mentioned this pull request Sep 16, 2022

Incorrect values in usa-facts:confirmed_incidence_num cmu-delphi/covidcast-indicators#1696

Open

Add recent AB tests and recent changes

ecd49bc

dshemetov mentioned this pull request Oct 4, 2022

Add pytest-check to Python requirements cmu-delphi/operations#71

Merged

dshemetov merged commit cc2da33 into jit_computations Oct 4, 2022

dshemetov deleted the ds/jit-update-meta branch October 4, 2022 22:32

dshemetov restored the ds/jit-update-meta branch October 7, 2022 19:40

dshemetov deleted the ds/jit-update-meta branch October 7, 2022 19:41

dshemetov mentioned this pull request Oct 7, 2022

[JIT] meta caching system #995

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Draft] just-in-time (JIT) meta computations #947

[Draft] just-in-time (JIT) meta computations #947

Uh oh!

dshemetov commented Jul 23, 2022 •

edited

Loading

Uh oh!

dshemetov commented Jul 23, 2022

Uh oh!

dshemetov commented Jul 23, 2022

Uh oh!

nolangormley Aug 23, 2022

Uh oh!

dshemetov Aug 23, 2022

Uh oh!

nolangormley Aug 23, 2022

Uh oh!

dshemetov Aug 23, 2022

Uh oh!

nolangormley Aug 23, 2022

Uh oh!

dshemetov commented Sep 8, 2022 •

edited

Loading

Uh oh!

Uh oh!

[Draft] just-in-time (JIT) meta computations #947

[Draft] just-in-time (JIT) meta computations #947

Uh oh!

Conversation

dshemetov commented Jul 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dshemetov commented Jul 23, 2022

Exposition on A/B Tests

Uh oh!

dshemetov commented Jul 23, 2022

Uh oh!

nolangormley Aug 23, 2022

Choose a reason for hiding this comment

Uh oh!

dshemetov Aug 23, 2022

Choose a reason for hiding this comment

Uh oh!

nolangormley Aug 23, 2022

Choose a reason for hiding this comment

Uh oh!

dshemetov Aug 23, 2022

Choose a reason for hiding this comment

Uh oh!

nolangormley Aug 23, 2022

Choose a reason for hiding this comment

Uh oh!

dshemetov commented Sep 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dshemetov commented Jul 23, 2022 •

edited

Loading

dshemetov commented Sep 8, 2022 •

edited

Loading