Skip to content

Add search result ordering by download count #1182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

di
Copy link
Member

@di di commented May 10, 2016

This PR adds search result ordering by download count as discussed in #702.

@dstufft
Copy link
Member

dstufft commented May 14, 2016

So I like this, but there's one problem with it-- I think it's going to trigger half a million queries when reindexing against the production database.

@di di force-pushed the order-by-downloads branch from 510e933 to 256c313 Compare June 9, 2016 04:04
@di
Copy link
Member Author

di commented Jun 9, 2016

@dstufft I added a commit here that should significantly cut down the number of queries required for reindexing, however I noticed that in the month since I opened this PR, download statistics were removed from PyPI (https://bitbucket.org/pypa/pypi/commits/f54).

I'm not sure how much PyPI and Warehouse have in common at this point, but I did notice that in the prod DB snapshot I have, some projects report 0 downloads for their latest releases, e.g.:

>>> [r.downloads for r in db.query(Project).filter(Project.name=='Django').first().releases][:10]
[0, 0, 0, 20177, 270956, 264277, 3568, 13574, 10831, 6499]

So I'm wondering if this is still worth adding!

@dstufft
Copy link
Member

dstufft commented Jun 9, 2016

Download counts are temporarily disabled on PyPI because they broke and I haven't fixed them yet.

The current state of our statistics pipeline is that we're archiving all of the download events to an S3 like storage that DreamHost provides, and we're writing download events into a BigQuery database that Google provides. It doesn't exist yet, but there's probably going to be some sort of cronjob that will take data from BigQuery and inject them into the existing downloads field.

What does that mean for this PR? It means if total number of downloads is something we think we want to sort by, then this PR is still useful as that's still going to work. However BigQuery offers us some additional possible queries we can do in the future (like Daily/Monthly/Weekly counts) that maybe we think will give us better numbers (or maybe we don't).

An additional consideration is what effect this will have on #701-- Maybe the answer is we need to periodically do a full reindex. Or maybe we need a Celery cron task that will take N items from Elasticsearch that were indexed the longest time ago, and reindex them.

@di
Copy link
Member Author

di commented Jun 9, 2016

However BigQuery offers us some additional possible queries we can do in the future (like Daily/Monthly/Weekly counts) that maybe we think will give us better numbers (or maybe we don't).

I think that the "average" daily counts (for the last release) in this PR are about as useful if not more useful than the daily/monthly/weekly stats we could get from BigQuery. However, it would remove the need to reindex to get the latest stats (for downloads).

@dstufft
Copy link
Member

dstufft commented Jun 9, 2016

We can get average daily counts for any release from BigQuery.

@di
Copy link
Member Author

di commented Jun 9, 2016

@dstufft Gotcha. It doesn't seem like we're using BigQuery in Warehouse at the moment, any thoughts/direction on adding it? Do you know if it's possible to emulate locally?

@dstufft
Copy link
Member

dstufft commented Jun 9, 2016

@di We're not using it in Warehouse at the moment, I have a branch locally that half way starts to add it.

I don't think it's possible to emulate locally, but for the record, the BigQuery data is completely public. See https://mail.python.org/pipermail/distutils-sig/2016-May/028986.html.

@di di force-pushed the order-by-downloads branch from 256c313 to ff27374 Compare July 11, 2016 17:35
@jaddison
Copy link

The current state of our statistics pipeline is that we're archiving all of the download events to an S3 like storage that DreamHost provides

I'm late to the party here, but if you have the download data in S3 (or S3-like storage), I assume it is stored in datetime-based filenames. Can you not just run a cron job every 15 mins or hour to grab the most recent, unprocessed download logs - and then run elasticsearch bulk document in-place updates via https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html#bulk-update (scripted field updating)?

I assume that BigQuery is used for other purposes (statistics related?), but if not it could be removed from the infrastructure to simplify things.

Lots of assumptions in my statements, I'm willing to admit. Apologies if this is just noise.

@di
Copy link
Member Author

di commented Oct 27, 2017

This is sufficiently old and dusty, so I'm going to close this as it's not a priority.

@di di closed this Oct 27, 2017
@di di deleted the order-by-downloads branch April 27, 2018 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants