Add search result ordering by download count #1182

di · 2016-05-10T14:59:20Z

This PR adds search result ordering by download count as discussed in #702.

dstufft · 2016-05-14T00:19:25Z

So I like this, but there's one problem with it-- I think it's going to trigger half a million queries when reindexing against the production database.

di · 2016-06-09T04:12:22Z

@dstufft I added a commit here that should significantly cut down the number of queries required for reindexing, however I noticed that in the month since I opened this PR, download statistics were removed from PyPI (https://bitbucket.org/pypa/pypi/commits/f54).

I'm not sure how much PyPI and Warehouse have in common at this point, but I did notice that in the prod DB snapshot I have, some projects report 0 downloads for their latest releases, e.g.:

>>> [r.downloads for r in db.query(Project).filter(Project.name=='Django').first().releases][:10]
[0, 0, 0, 20177, 270956, 264277, 3568, 13574, 10831, 6499]

So I'm wondering if this is still worth adding!

dstufft · 2016-06-09T17:44:47Z

Download counts are temporarily disabled on PyPI because they broke and I haven't fixed them yet.

The current state of our statistics pipeline is that we're archiving all of the download events to an S3 like storage that DreamHost provides, and we're writing download events into a BigQuery database that Google provides. It doesn't exist yet, but there's probably going to be some sort of cronjob that will take data from BigQuery and inject them into the existing downloads field.

What does that mean for this PR? It means if total number of downloads is something we think we want to sort by, then this PR is still useful as that's still going to work. However BigQuery offers us some additional possible queries we can do in the future (like Daily/Monthly/Weekly counts) that maybe we think will give us better numbers (or maybe we don't).

An additional consideration is what effect this will have on #701-- Maybe the answer is we need to periodically do a full reindex. Or maybe we need a Celery cron task that will take N items from Elasticsearch that were indexed the longest time ago, and reindex them.

di · 2016-06-09T19:04:43Z

However BigQuery offers us some additional possible queries we can do in the future (like Daily/Monthly/Weekly counts) that maybe we think will give us better numbers (or maybe we don't).

I think that the "average" daily counts (for the last release) in this PR are about as useful if not more useful than the daily/monthly/weekly stats we could get from BigQuery. However, it would remove the need to reindex to get the latest stats (for downloads).

dstufft · 2016-06-09T19:05:45Z

We can get average daily counts for any release from BigQuery.

di · 2016-06-09T19:28:53Z

@dstufft Gotcha. It doesn't seem like we're using BigQuery in Warehouse at the moment, any thoughts/direction on adding it? Do you know if it's possible to emulate locally?

dstufft · 2016-06-09T19:31:28Z

@di We're not using it in Warehouse at the moment, I have a branch locally that half way starts to add it.

I don't think it's possible to emulate locally, but for the record, the BigQuery data is completely public. See https://mail.python.org/pipermail/distutils-sig/2016-May/028986.html.

jaddison · 2017-03-21T21:22:04Z

The current state of our statistics pipeline is that we're archiving all of the download events to an S3 like storage that DreamHost provides

I'm late to the party here, but if you have the download data in S3 (or S3-like storage), I assume it is stored in datetime-based filenames. Can you not just run a cron job every 15 mins or hour to grab the most recent, unprocessed download logs - and then run elasticsearch bulk document in-place updates via https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html#bulk-update (scripted field updating)?

I assume that BigQuery is used for other purposes (statistics related?), but if not it could be removed from the infrastructure to simplify things.

Lots of assumptions in my statements, I'm willing to admit. Apologies if this is just noise.

di · 2017-10-27T21:39:55Z

This is sufficiently old and dusty, so I'm going to close this as it's not a priority.

di force-pushed the order-by-downloads branch from 510e933 to 256c313 Compare June 9, 2016 04:04

di force-pushed the order-by-downloads branch from 256c313 to ff27374 Compare July 11, 2016 17:35

di added 2 commits July 11, 2016 13:36

Add search result ordering by download count

b45ed33

Estimate downloads from last release

ff27374

di closed this Oct 27, 2017

di deleted the order-by-downloads branch April 27, 2018 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add search result ordering by download count #1182

Add search result ordering by download count #1182

Uh oh!

di commented May 10, 2016

Uh oh!

dstufft commented May 14, 2016

Uh oh!

di commented Jun 9, 2016

Uh oh!

dstufft commented Jun 9, 2016

Uh oh!

di commented Jun 9, 2016

Uh oh!

dstufft commented Jun 9, 2016

Uh oh!

di commented Jun 9, 2016

Uh oh!

dstufft commented Jun 9, 2016

Uh oh!

jaddison commented Mar 21, 2017

Uh oh!

di commented Oct 27, 2017

Uh oh!

Uh oh!

Add search result ordering by download count #1182

Add search result ordering by download count #1182

Uh oh!

Conversation

di commented May 10, 2016

Uh oh!

dstufft commented May 14, 2016

Uh oh!

di commented Jun 9, 2016

Uh oh!

dstufft commented Jun 9, 2016

Uh oh!

di commented Jun 9, 2016

Uh oh!

dstufft commented Jun 9, 2016

Uh oh!

di commented Jun 9, 2016

Uh oh!

dstufft commented Jun 9, 2016

Uh oh!

jaddison commented Mar 21, 2017

Uh oh!

di commented Oct 27, 2017

Uh oh!

Uh oh!