Skip to content

Consider deduping CPR entries using their Archive Repository timestamp #1480

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
krivard opened this issue Jan 20, 2022 · 3 comments
Closed
Assignees
Labels
future-solution Solutions to problems we don't have yet but still dread

Comments

@krivard
Copy link
Contributor

krivard commented Jan 20, 2022

Occasionally, the Community Profile Report (CPR) publishes multiple xlsx files with the same nominal publish date.
#1479 handles this case with naive deduping, since to-date this has only happened once (2021-09-13), and the two files were identical.

If we ever see duplicate publish dates where the file contents differ, we will need to revisit that solution.

The core dataset does not provide a timestamp for any attachment, however, the Archive Repository lists the upload time of every change made to the dataset, including adding/removing attachments. By diffing the Metadata Updates column in sequential rows, we can determine which attachments were added during each upload entry.

For the filename labeled for 2021-09-13, the Archive Repository shows that the pdf and xlsx file were initially uploaded at 2021 Sep 14 10:14:55 AM, and the second xlsx file was uploaded at 2021 Sep 15 12:38:36 PM (in the same upload entry as the pdf and xlsx files labeled for 2021-09-14).

@krivard krivard added the future-solution Solutions to problems we don't have yet but still dread label Jan 20, 2022
@krivard
Copy link
Contributor Author

krivard commented Jan 27, 2022

Here's a file that maps the upload timestamp with the filename: dsew_upload_times.json.txt

Created using:

$ curl -sL "https://healthdata.gov/resource/6hii-ae4f.json" >dsew.json
$ jq 'map({date:.update_date, updates:(.metadata_updates |fromjson |.updates)}) |map(select(.updates|objects)) |map(select(.updates|has("attachments"))) |map({date:.date, file:(.updates.attachments|map(.filename) |map(select(test("xlsx")))[0] )})' <dsew.json >dsew_upload_times.json

I did some additional calculations to get average upload lag, and it's 1.3 days (ignoring outliers of 7 days or more, but keeping Mondays -- dsew doesn't upload on weekends so Mondays are always 3-4 days)

@neul3
Copy link
Contributor

neul3 commented Apr 14, 2022

Hi Katie, I'll just use this account and shift over once I get an Andrew email. Feel free to assign the task here.

@nmdefries
Copy link
Contributor

dsew-cpr has been removed #1871. Never saw the issue where duplicate publish dates were uploaded with differing contents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
future-solution Solutions to problems we don't have yet but still dread
Projects
None yet
Development

No branches or pull requests

3 participants