-
Notifications
You must be signed in to change notification settings - Fork 1k
Add search result ordering by download count #1182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
So I like this, but there's one problem with it-- I think it's going to trigger half a million queries when reindexing against the production database. |
@dstufft I added a commit here that should significantly cut down the number of queries required for reindexing, however I noticed that in the month since I opened this PR, download statistics were removed from PyPI (https://bitbucket.org/pypa/pypi/commits/f54). I'm not sure how much PyPI and Warehouse have in common at this point, but I did notice that in the prod DB snapshot I have, some projects report 0 downloads for their latest releases, e.g.:
So I'm wondering if this is still worth adding! |
Download counts are temporarily disabled on PyPI because they broke and I haven't fixed them yet. The current state of our statistics pipeline is that we're archiving all of the download events to an S3 like storage that DreamHost provides, and we're writing download events into a BigQuery database that Google provides. It doesn't exist yet, but there's probably going to be some sort of cronjob that will take data from BigQuery and inject them into the existing downloads field. What does that mean for this PR? It means if total number of downloads is something we think we want to sort by, then this PR is still useful as that's still going to work. However BigQuery offers us some additional possible queries we can do in the future (like Daily/Monthly/Weekly counts) that maybe we think will give us better numbers (or maybe we don't). An additional consideration is what effect this will have on #701-- Maybe the answer is we need to periodically do a full reindex. Or maybe we need a Celery cron task that will take N items from Elasticsearch that were indexed the longest time ago, and reindex them. |
I think that the "average" daily counts (for the last release) in this PR are about as useful if not more useful than the daily/monthly/weekly stats we could get from BigQuery. However, it would remove the need to reindex to get the latest stats (for downloads). |
We can get average daily counts for any release from BigQuery. |
@dstufft Gotcha. It doesn't seem like we're using BigQuery in Warehouse at the moment, any thoughts/direction on adding it? Do you know if it's possible to emulate locally? |
@di We're not using it in Warehouse at the moment, I have a branch locally that half way starts to add it. I don't think it's possible to emulate locally, but for the record, the BigQuery data is completely public. See https://mail.python.org/pipermail/distutils-sig/2016-May/028986.html. |
I'm late to the party here, but if you have the download data in S3 (or S3-like storage), I assume it is stored in datetime-based filenames. Can you not just run a cron job every 15 mins or hour to grab the most recent, unprocessed download logs - and then run elasticsearch bulk document in-place updates via https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html#bulk-update (scripted field updating)? I assume that BigQuery is used for other purposes (statistics related?), but if not it could be removed from the infrastructure to simplify things. Lots of assumptions in my statements, I'm willing to admit. Apologies if this is just noise. |
This is sufficiently old and dusty, so I'm going to close this as it's not a priority. |
This PR adds search result ordering by download count as discussed in #702.