Skip to content

Search method in XML-RPC API treats {"summary": "fix code"} as {"summary": ["fix", "code"]} #1886

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jwodder opened this issue Apr 2, 2017 · 8 comments

Comments

@jwodder
Copy link
Contributor

jwodder commented Apr 2, 2017

As stated in the title, the search method of PyPI/Warehouse's XML-RPC API splits the values of its spec argument on whitespace, resulting in the method returning (for the example in the title) packages with either "fix" or "code" in their summary, not just those with both "fix" and "code" (let alone just those that contain the exact string "fix code"). This happens even if the spec value is passed as a one-element list (e.g., {"summary": ["fix code"]}), and while I haven't tested every field, it is not limited to just "summary".

On the off chance that this is the intended behavior, the documentation should at least be updated.

@di di added the bug 🐛 label Jan 18, 2018
@brainwane brainwane added this to the 2: End User MVP milestone Jan 30, 2018
@brainwane
Copy link
Contributor

We discussed this in today's Warehouse bug triage meeting and decided this is probably a small fix, we need to compare it to behavior on legacy PyPI, and we'll address it in the End User MVP milestone (probably in March). Thanks, @jwodder!

@brainwane
Copy link
Contributor

@waseem18 I saw you're working on this? Thanks! Hope you'll join us in Freenode IRC in #pypa-dev if you have any questions.

@waseem18
Copy link
Contributor

@brainwane Yeah I've started working on this an will put a WIP PR soon.

@waseem18
Copy link
Contributor

waseem18 commented Feb 15, 2018

Cause of this bug :

In xmlrpc.py line number 86 and 88 we're using match query to query elasticsearch for the given value.

In match query what happens is when the value of the given field has space(s) it tokenizes the given value thereby searching for each of the split tokens.

"query": { "match" : { "summary" : "fix code" } }

Above is a match query with space in the value of the field summary. So it tokenizes the value to ["fix", "code"] and searches for fix or code.

Resolution :

Use Term query when the value of the given field has space(s) and match query when the value doesn't have any space(s).

#2850 is similar to this issue. Using Term query / match_phrase when the entered value has double quotes.

@di @brainwane Will be happy to get your feedback on this.

@di
Copy link
Member

di commented Feb 15, 2018

@waseem18 Using the term query makes sense, especially considering the API allows for multiple strings to be provided.

Any reason why we wouldn't just use it for queries without whitespace as well? Seems like it wouldn't make much difference if the user is just searching for a single word.

For #2850, it seems like we might actually want to do Phrase Matching instead, but let's just focus on this issue for now.

@waseem18
Copy link
Contributor

waseem18 commented Feb 15, 2018

Good catch @di

Using term query for values without space wouldn't make much difference. I'll go with it 👍

@brainwane
Copy link
Contributor

@waseem18 Thank you for the fix!

If you have time to help us with testing #2898, that would be great! And if you want a next issue to work with, would you consider helping me start on #2930?

@waseem18
Copy link
Contributor

Sure @brainwane I'll look into the comment you put on #2898

Regarding #2930 I'd be happy to help you on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants