Skip to content

[Draft] just-in-time (JIT) meta computations #947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 49 commits into from
Oct 4, 2022

Conversation

dshemetov
Copy link
Contributor

@dshemetov dshemetov commented Jul 23, 2022

Addresses a part of #646.

Contains work for computing meta values for the JIT signals. The basic setup is this:

  • base signals (non-derived, aka non-JIT, signals) have their meta computed normally
  • derived signals make a request to a local API server and then use Pandas to compute the resulting meta values

This work refactors database.py and moves the meta computations into database_meta.py. The main work is in the latter file. The main testing is in test_covidcast_meta_cache_updater.py.

Would appreciate reviews on testing strategy, testing completeness, and clarity of the code.

TODO:

  • Make complete tests.
    • Make a table of the integration tests that already exist for JIT computations.
      • List all the queries that have been tested. List all the possible edge cases.
    • Make a table of the meta computation A/B tests.
      • List all the queries that have been tested. List all the possible edge cases.

dshemetov and others added 19 commits May 26, 2022 13:10
….. and also restore a method that got deleted??
…nd of regular acquisition method, removed helper to run dbjobs as a standalone process
Bumps [tzinfo](https://github.com/tzinfo/tzinfo) from 1.2.9 to 1.2.10.
- [Release notes](https://github.com/tzinfo/tzinfo/releases)
- [Changelog](https://github.com/tzinfo/tzinfo/blob/master/CHANGES.md)
- [Commits](tzinfo/tzinfo@v1.2.9...v1.2.10)

---
updated-dependencies:
- dependency-name: tzinfo
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
…o-1.2.10

Bump tzinfo from 1.2.9 to 1.2.10 in /docs
@dshemetov dshemetov requested a review from krivard July 23, 2022 21:14
@dshemetov dshemetov marked this pull request as draft July 23, 2022 21:14
@dshemetov
Copy link
Contributor Author

Exposition on A/B Tests

First, I took a chunk of data from jhu-csse:confirmed_cumulative_num between 2022-04-01 and 2022-06-01 to test my meta computations. I chose 4 counties and 2 states.

I ran an A/B test between my meta just-in-time (JIT) computations and the existing database approach. All the values matched except for the max_lag field. Not only that, but all the values matched for confirmed_cumulative_num, the derived signals didn't though.

I was at a loss. How can the derived values possibly have different lags?

Katie figured it out immediately: the update profiles look completely different for cumulative versus incidence. Consider: if we backfill 1 count to a cumulative signal to a day in the past, then all the consecutive days need to be updated (since the entire curve from that day on is raised by 1). For the incidence signal, however, this will just be a single backfilled blip and the rest of the values are unaffected. So if the cumulative signal is reissued, its lag/issue field is very likely to fall out of sync with the incidence signal.

![[Pasted image 20220721131752.png]]

This plot shows lag against time for 3 signals, for a single county.

However, when computing the diffed signal JIT, we only have access to a rolling window of two rows, from which we must decide on the issue (we can't look arbitrarily far in the past to see when confirmed_cumulative_num was updated, for that defeat the server's streaming architecture). So JIT has no alternative but to deviate here from existing functionality and choose an issue from one of the two rows it has access to when diffing.

But more on that later.

To see if I could avoid this issue, I instead opted to collect data from the very beginning of the jhu-csse signal history. The earliest data available is on 2020-01-22, so I grabbed data from then until 2020-05-01. It turned out that there were data discrepancies there too. This results in test-data1.csv.

Consider the following query for a single county for the earliest data possible.

>>> start_day = date(2020, 1, 15)
>>> end_day = date(2020, 1, 23)
>>> df1 = covidcast.signal(data_source="jhu-csse", signal="confirmed_cumulative_num", start_day=start_day, end_day=end_day, time_type="day", geo_type="county", geo_values="02100")
>>> df2 = covidcast.signal(data_source="jhu-csse", signal="confirmed_incidence_num", start_day=start_day, end_day=end_day, time_type="day", geo_type="county", geo_values="02100")
>>> >>> df3 = covidcast.signal(data_source="jhu-csse", signal="confirmed_7dav_incidence_num", start_day=start_day, end_day=end_day + timedelta(days=30), time_type="day", geo_type="county", geo_values="02100")
>>> d1
	geo_value	signal	time_value	issue	lag	missing_value	missing_stderr	missing_sample_size	value	stderr	sample_size	geo_type	data_source
0	02100	confirmed_cumulative_num	2020-01-22	2020-05-14	113	0	5	5	0.0	None	None	county	jhu-csse
0	02100	confirmed_cumulative_num	2020-01-23	2020-05-14	112	0	5	5	0.0	None	None	county	jhu-csse

>>> df2
geo_value	signal	time_value	issue	lag	missing_value	missing_stderr	missing_sample_size	value	stderr	sample_size	geo_type	data_source
0	02100	confirmed_incidence_num	2020-01-22	2020-05-14	113	0	5	5	0.0	None	None	county	jhu-csse
0	02100	confirmed_incidence_num	2020-01-23	2020-05-14	112	0	5	5	0.0	None	None	county	jhu-csse

>>> df3
	geo_value	signal	time_value	issue	lag	missing_value	missing_stderr	missing_sample_size	value	stderr	sample_size	geo_type	data_source
0	02100	confirmed_7dav_incidence_num	2020-02-20	2021-04-01	406	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_7dav_incidence_num	2020-02-21	2021-04-01	405	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_7dav_incidence_num	2020-02-22	2021-04-01	404	0	0	0	0.0	None	None	county	jhu-csse
  • discrepancy 1: the earliest issues available for incidence match cumulative. While mostly inconsequential and the earliest incidence value is filled with a 0, we couldn't actually have access to this data, so it must be a spoof
  • discrepancy 2: the earliest smoothed data available is on 2020-02-20, which lags behind the earliest cumulative data by over a month

So for that reason, I decided to bump my data up a little later, so that the earliest data available for cumulative and incidence matches the smoothed signal (so I chose cumulative 2020-02-13 -- 2020-05-01 and incidence 2020-02-14 -- 2020-05-01). This results in test-data2.csv.

This data has issues too.

I started with having discrepancies between the min_lag values in my JIT code and in the data. Here's what the data showed:

>>> end_day = date(2020, 5, 1)
>>> df1 = covidcast.signal(data_source="jhu-csse", signal="confirmed_cumulative_num", start_day=end_day - timedelta(days=7), end_day=end_day, time_type="day", geo_type="county", geo_values="02100")
>>> df2 = covidcast.signal(data_source="jhu-csse", signal="confirmed_incidence_num", start_day=end_day - timedelta(days=6), end_day=end_day, time_type="day", geo_type="county", geo_values="02100")
>>> df3 = covidcast.signal(data_source="jhu-csse", signal="confirmed_7dav_incidence_num", start_day=end_day, end_day=end_day, time_type="day", geo_type="county", geo_values="02100")
>>> df1
geo_value	signal	time_value	issue	lag	missing_value	missing_stderr	missing_sample_size	value	stderr	sample_size	geo_type	data_source
0	02100	confirmed_cumulative_num	2020-04-24	2021-04-01	342	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_cumulative_num	2020-04-25	2021-04-01	341	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_cumulative_num	2020-04-26	2021-04-01	340	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_cumulative_num	2020-04-27	2021-04-01	339	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_cumulative_num	2020-04-28	2021-04-01	338	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_cumulative_num	2020-04-29	2021-04-01	337	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_cumulative_num	2020-04-30	2021-04-01	336	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_cumulative_num	2020-05-01	2021-04-01	335	0	0	0	0.0	None	None	county	jhu-csse
>>> df2
geo_value	signal	time_value	issue	lag	missing_value	missing_stderr	missing_sample_size	value	stderr	sample_size	geo_type	data_source
0	02100	confirmed_incidence_num	2020-04-25	2021-04-01	341	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_incidence_num	2020-04-26	2021-04-01	340	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_incidence_num	2020-04-27	2021-04-01	339	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_incidence_num	2020-04-28	2021-04-01	338	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_incidence_num	2020-04-29	2021-04-01	337	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_incidence_num	2020-04-30	2021-04-01	336	0	0	0	0.0	None	None	county	jhu-csse
0	02100	confirmed_incidence_num	2020-05-01	2021-04-01	335	0	0	0	0.0	None	None	county	jhu-csse
>>> df3
	geo_value	signal	time_value	issue	lag	missing_value	missing_stderr	missing_sample_size	value	stderr	sample_size	geo_type	data_source
0	02100	confirmed_7dav_incidence_num	2020-05-01	2021-04-01	335	0	0	0	0.0	None	None	county	jhu-csse

Track the lag values: the lag values of confirmed_7dav_incidence_num matches with the lag at the same date and time for the confirmed_cumulative_num signal.

This seems reasonable. A reasonable approach to computing the lag value for a derived value that depends on a window of previous values is this:

  • take the window of values in the base signal (here, confirmed_cumulative_num) and find the largest issue (here, 2021-04-01)
  • compute the distance between this largest issue and the time_value that we're computing for (generally the time_value we're computing for is the right-most point in the window, so in this case it's 2020-05-01), so here we get 335

I reimplemented this approach. Now I'm getting new issues with min_lag. Tracking it down, I found that the smoothed_incidence signal just inherits the issue of the cumulative signal without taking a max over all the rows present in the summation.

>>> end_day = date(2020, 3, 12)
>>> df1 = covidcast.signal(data_source="jhu-csse", signal="confirmed_cumulative_num", start_day=end_day - timedelta(days=7), end_day=end_day, time_type="day", geo_type="state", geo_values="ak")
>>> df2 = covidcast.signal(data_source="jhu-csse", signal="confirmed_incidence_num", start_day=end_day - timedelta(days=6), end_day=end_day , time_type="day", geo_type="state", geo_values="ak")
>>> df3 = covidcast.signal(data_source="jhu-csse", signal="confirmed_7dav_incidence_num", start_day=end_day, end_day=end_day, time_type="day", geo_type="state", geo_values="ak")
>>> df1
geo_value	signal	time_value	issue	lag	missing_value	missing_stderr	missing_sample_size	value	stderr	sample_size	geo_type	data_source
0	ak	confirmed_cumulative_num	2020-03-05	2021-04-01	392	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_cumulative_num	2020-03-06	2021-04-01	391	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_cumulative_num	2020-03-07	2021-04-01	390	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_cumulative_num	2020-03-08	2021-04-01	389	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_cumulative_num	2020-03-09	2021-04-01	388	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_cumulative_num	2020-03-10	2021-04-01	387	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_cumulative_num	2020-03-11	2021-04-01	386	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_cumulative_num	2020-03-12	2020-10-29	231	0	0	0	0.0	None	None	state	jhu-csse

>>> df2
geo_value	signal	time_value	issue	lag	missing_value	missing_stderr	missing_sample_size	value	stderr	sample_size	geo_type	data_source
0	ak	confirmed_incidence_num	2020-03-06	2021-04-01	391	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_incidence_num	2020-03-07	2021-04-01	390	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_incidence_num	2020-03-08	2021-04-01	389	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_incidence_num	2020-03-09	2021-04-01	388	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_incidence_num	2020-03-10	2021-04-01	387	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_incidence_num	2020-03-11	2021-04-01	386	0	0	0	0.0	None	None	state	jhu-csse
0	ak	confirmed_incidence_num	2020-03-12	2020-10-29	231	0	0	0	0.0	None	None	state	jhu-csse

>>> df3
geo_value	signal	time_value	issue	lag	missing_value	missing_stderr	missing_sample_size	value	stderr	sample_size	geo_type	data_source
0	ak	confirmed_7dav_incidence_num	2020-03-12	2020-10-29	231	0	0	0	0.0	None	None	state	jhu-csse

@dshemetov
Copy link
Contributor Author

Hopefully my notes above help understand my testing strategy. See the comments in test_meta-cache-updater

* separate meta operations from database.py into database_meta.py
* add tests
])
>>> df.to_csv("test-data.csv")
"""
self._insert_csv("repos/delphi/delphi-epidata/tests/acquisition/covidcast/test-data/test-data2.csv")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a testdata folder in the delphi-epidata folder. It may allow for easier reuse of these flat files if needed for other tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, let me move these files there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another note on this. The directory structure here is assuming that the base working directory is above the repos folder. This is inconsistent with the test_utils.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I've never even looked at that file before. Looks like it's all for tests of covid_hosp, which I haven't really touched.

Wait, it looks like test_utils.py does some complicated directory traversing up the tree until it can find testdata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you’re right. Let me dig through the source to see if that’s used elsewhere.

dshemetov and others added 19 commits August 23, 2022 14:19
* move meta AB tests to test_covidcast_meta from test_covidcast_meta_cache_updater
* move test data files to testdata directory
* remove unused db access in test_covidcast_meta_cache_updater
* Revert the spacing in test_covidcast_meta_cache_updater for cleaner diffs
* incorporation of test improvements from parallel branch

Updates to include improvements from "merged key dimension table" branch (`krivard/v4-rpp-mergeddim-leftjoin`), specifically at commit hash `fbf878e`.
Changes include:
 - unit/integration testing refactorization and other improvements.
 - percona dbms now used in db docker image, plus changes for resulting compatibility issues.
 - db schema names are specified in ddl files.
 - removed of obsolete index hint guessing method.
 - updated comments.

This changeset is essentially just a port of the excellent work @krivard did to refactor and otherwise improve the test architecture, as she applied it to branch mentioned above.  Most of the files were simply copied over from the other branch to this one to create this PR; I only really made edits to these files: (and most edits were to strip out "mergedkey" stuff)
 - src/ddl/v4_schema.sql
 - src/acquisition/covidcast/database.py
 - src/acquisition/covidcast/test_utils.py
 - integrations/acquisition/covidcast/test_covidcast_meta_caching.py
 - integrations/server/covidcast/test_covidcast_meta.py

* small cleanup to edit i made earlier

* documentation clarification

Co-authored-by: Katie Mazaitis <[email protected]>

Co-authored-by: Katie Mazaitis <[email protected]>
… columns (#963)

* renamed v4 db objects: load, latest, and history tables, and their id columns

src/ddl/migrations/v4_renaming.sql - contains the SQL to do the renaming of a live v4 system

source code changes were done by these 4 shell commands:
  find ./src ./tests ./integrations -type f -exec sed -i 's/signal_history/epimetric_full/g' {} \;
  find ./src ./tests ./integrations -type f -exec sed -i 's/signal_latest/epimetric_latest/g' {} \;
  find ./src ./tests ./integrations -type f -exec sed -i 's/signal_load/epimetric_load/g' {} \;
  find ./src ./tests ./integrations -type f -exec sed -i 's/signal_data_id/epimetric_id/g' {} \;

all tests pass on the codebase as it stands in this commit;
they also pass on a `delphi_web_epidata` docker image that was created before this commit and then had the migration file changes applied to it.

* whoops, ran the renaming commands on the migration script

* whitespace nit

Co-authored-by: Katie Mazaitis <[email protected]>

Co-authored-by: Katie Mazaitis <[email protected]>
Co-authored-by: Katie Mazaitis <[email protected]>
DBLoadStateException: halt acquisition when unexpected data found in load table
Remove remaining wip pieces in tests
@dshemetov
Copy link
Contributor Author

dshemetov commented Sep 8, 2022

I'm going to merge this into jit_computations branch. With all the v4 updates to the tests, it duplicates the work to keep these branches both in sync (and I spent a lot of time merging the v4 updates in today).

@dshemetov dshemetov merged commit cc2da33 into jit_computations Oct 4, 2022
@dshemetov dshemetov deleted the ds/jit-update-meta branch October 4, 2022 22:32
@dshemetov dshemetov restored the ds/jit-update-meta branch October 7, 2022 19:40
@dshemetov dshemetov deleted the ds/jit-update-meta branch October 7, 2022 19:41
@dshemetov dshemetov mentioned this pull request Oct 7, 2022
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants