-
Notifications
You must be signed in to change notification settings - Fork 1k
Investigate replacing search with MeiliSearch #8002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @erlend-sh, thanks for sharing, that looks really slick and the results look really good! A few questions for you after a quick glance:
|
As of right now the only option is self-hosting, but we are indeed working on a hosted option. If you can email me we can discuss that further. One of my colleagues will come by soon to answer your other questions. |
Hi! I am the author of the MeiliSearch PyPI demo (and a member of the MeiliSearch dev team) and I am here to discuss any details, because this seems like a great and exciting idea 🎉 Regarding your questions:
MeiliSearch provides both an offset and limit parameters to a Search request, which are consistent and would solve naturally #4006 this is not used in the demo, but can totally be used for pagination
Meilisearch provides filtering and will release really soon the faceted search. We would need to re-index all the content from PyPI packages, to take into account the filters you want to apply, but the functionality would be provided 'by default' by MeiliSearch, no need to implement anything new, just need to be sure that the corresponding fields are present in the MeiliSearch index.
MeiliSearch uses Ranking rules which are totally customizable. I have played with this ranking rules to try to provide the most relevant results, but I actually didn't have in mind to do an exact match of the name of the package (it is present in the ran king rules, but with a very low relevance) because we wanted more to provide keyword and concepts search. For example when you lookup "machine learning" you get a package called "machine learning" which is far from being the idal package to use, I wanted to see for example "tensorflow" in the firsts results. Same for "web framework"... I want so see Django, Flask, etc... Anyway, this is fully customizable and we can discuss which are the priorities and adapt the MeiliSearch is also typo-tolerant, which needs to be taken into account for this point
For the PyPI demo we are getting the last month downloads from BigQuery (once a month, and we cache it), and we use it as a ranking rule. The thing is that MeiliSearch uses a Bucket Sort algorithm to apply the ranking rules, which caused a few problems to categorize packages, because download numbers is very specific to each package. So instead of basing the results simply on Downloads numbers, we created a "fame" field that assigns a score to each package based on downloads. For example, the first 100 most downloaded packages from last month will receive the highest score in fame. This means that the Bucket sort algo will consider those 100 packages as a single bucket, and apply the rest of the rules inside of it. This means that you can have a result that has less downloads and still be more pertinent, but a fame score of 9 will always be better than a fame score of 8. This is also customizable.
I am no sure I undertand your question :( I hope this solves more or less your questions, but please do not hesitate to ask for any clarification or any new information you would like to have! |
Sorry for the delay! Replies inline:
I'm not seeing how just the offset/limit parameters alone would solve #4006. The issue is that URLs with the same query but different offsets expire from the cache at different times, which may cause a project to appear more than once (or not at all). Ideally we'd need a way to know all the projects that would be in the result for a given query, so we could purge all queries that contain a given project if that project changes. That said, I don't think this is something that MeiliSearch needs to solve for us to adopt it, as it sounds like as-is this has feature parity with our current search provider. Just wanted to see if there was any potential solution for this problem.
Sounds like faceted search is what we'd need here for classifiers.
Right, I think the difference here is that the query
Interesting, thanks for sharing. FYI, right now we compute the "zscore" to determine which packages are "trending", which we can order results by (as well as the ability to sort by date last updated). I assume the download counts can be updated the same way as updating any other search metadata?
Sorry! What I mean is: right now PyPI works pretty well if you disable JavaScript in your browser. We probably want this to continue to be true. I see that the demo uses JavaScript extensively to provide auto-updating searching, but it seems like if I disable JavaScript entirely, using the input as a regular search field (typing a query and submitting with 'enter') doesn't work. I feel like this might be just the nature of the demo though, so my question is whether JavaScript is a requirement or not. |
Hi @erlend-sh, @eskombro, any updates here? |
I'm quite sure MeiliSearch can work fine without JS, but @eskombro will have to confirm. Once that's cleared up, what are the next steps here? Could a fork of pypi.org be set up as a staging site? Like |
Indeed, the demo was done to show the idea of search-as-you-type experience, and this requires javascript enabled on the client. But as you point out, that is just the nature of the demo. It doesn't have a "results" page you can navigate to. Disabling JS shouldn't have any impact on the way MeiliSearch behaves, and this would be just a front-end matter we can solve easily with a different implementation.
Exactly! |
I think we were waiting on a few things to move forward:
I think next steps would be determining where we need to host Meilisearch, and then starting to develop this in a feature branch. We could potentially spin up a staging site -- what would be the goal for that? We also have https://test.pypi.org/ which we could use, as long as everything Meilisearch-related is behind a feature flag. |
Okay, great. We will set up a hosted instance of MeiliSearch shortly. Can you get that feature branch started and link it here? |
Feature branch is here: https://github.com/pypa/warehouse/tree/meilisearch |
Thanks @di I did already fork the project and set up the environment, and I'm exploring it a bit. I think next week I can start working on this. How would you like to proceed with this feature development? |
Right now the codebase definitely assumes that we are using Elasticsearch, so first steps would be to create a generic This should allow us to configure which service we're using by changing a single environment variable, Then, we would implement a We'd also need to update the development environment to be able to use MeiliSearch locally as well: I'm assuming that there's a docker image we'd be able to use for this? |
any updates on this? |
An update here: MeiliSearch set us up with a demo of their hosted service and have offered to provide it to us as an in-kind donation, so I think it makes sense to move forward with steps in #8002 (comment) so we can try it out for PyPI. |
Originally posted by @erlend-sh in #3486 (comment)
The text was updated successfully, but these errors were encountered: