-
Notifications
You must be signed in to change notification settings - Fork 160
Search order implementation with additional tests. #366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'm also planning to simplify the index in a subsequent PR: we could remove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dart SDK API index
What is this?
} | ||
|
||
void removeLowScores(double fraction) { | ||
final double maxValue = values.values.fold(0.0, max); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a big weird: Basically if I search for "foobar" and none of the packages are relevant, though our character n-grams returns some low-score results, we find the max here, and someone passes like fraction = 0.9
(meaning should be in top 10%) then we get results, even though none are relevant.
I think a better way of doing this is to always ensure that if a package is perfect match, then it's score needs to be 1.0
then you can just say
removeWhere((key) => values[key] < 0.9);
Then it's even possible to have optimizations which pushes this filter down to the things calling addValues
so you can avoid work if you know already it can't pass the filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may happen automatically, because we weight longer N-grams and prefixes more, and if there is a 6+ character match, we will most likely remove the low-quality matches anyway.
I'll experiment with this, and do a follow-up in a subsequent PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe slightly said in another way: If a search query has only results with poor scores, we should display none of them - not even the highest scored one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense. I've added a low-score filter, but it is very conservative now. I would still experiment with it more, because it is not always the package name that people are searching for, and the rules around it may become complex.
if (value != null) { | ||
val[key] = value; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side note: not the most efficient code our "great json serializer source gen" is doing :-/
app/lib/search/index_simple.dart
Outdated
..addValues(_readmeIndex.search(text), 0.06); | ||
textScore.removeLowScores(0.05); | ||
} else if (packagePrefix != null) { | ||
textScore = new Score()..addValues(_nameIndex.search(packagePrefix), 0.8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this _nameIndex.search(packagePrefix)
really searching for a prefix or does it match anywhere in the name?
If it doesn't search for a prefix, then we shouldn't use the name "prefix".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was tricky to write (it is there in the older version too), and now it may not be needed at all. The idea was that if you don't specify any text query, yet specify a package name prefix, we still need to order the results by something, and at that point we only had this kind of "something".
I'll remove this, and disallow queries that have no text but filter on package prefix and order on text match score.
app/lib/search/index_simple.dart
Outdated
Set<String> urls, int compare(PackageDocument a, PackageDocument b)) { | ||
final List<PackageScore> list = urls | ||
.map((url) => | ||
new PackageScore(url: url, package: _documents[url].package)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear why you are not adding a , score: values[url]
here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the use case for e.g. sort by updated
, where we don't have scores (no similarity, no health/popularity).
textScore?.getKeys()?.toSet() ?? _documents.keys.toSet(); | ||
|
||
// filter on package prefix | ||
if (query.packagePrefix != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future optimizations: There might be possibilities where one can avoid constructing the big maps in Score
if we know only a small subset of keys will survive.
i.e. Instead of building up big datastructures and removing from them later, we could try filtering early on and construct smaller datastructures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will address it in a follow-up.
app/lib/search/index_simple.dart
Outdated
..addValues(textScore?.values, 0.85) | ||
..addValues(getPopularityScore(urls), 0.10) | ||
..addValues(getHealthScore(urls), 0.05); | ||
results = _flattenFromValues(overallScore.values); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of 'flatten' we could use 'rank', because what it really is doing is ordering/ranking the results.
switch (query.order ?? SearchOrder.overall) { | ||
case SearchOrder.overall: | ||
final Score overallScore = new Score() | ||
..addValues(textScore?.values, 0.85) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could avoid creating a copy of textScore
and modify the copy, by just modify the textScore
itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case I think it is cleaner to keep it separate, as it is explicit how the text match score relates to the rest.
app/lib/search/index_simple.dart
Outdated
results = _flattenFromValues(textScore.values); | ||
break; | ||
case SearchOrder.updated: | ||
results = _flattenFromUrls(urls, _compareUpdated); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of the function is misleading, it has notting to do with urls, it's ordering/ranking based on updated timestamp.
#190 (we shall eventually do code search, including Dart SDK APIs). |
b08dfba
to
02f3f5b
Compare
02f3f5b
to
8717b8c
Compare
Merging this now, further comments will be addressed in a subsequent PR. |
..addValues(_nameIndex.search(text), 0.82) | ||
..addValues(_descrIndex.search(text), 0.12) | ||
..addValues(_readmeIndex.search(text), 0.06); | ||
// removes scores that are less than 5% of the best |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is confusing, since less than 5% of the best
could be interpreted as we drop the lowest 95%
. Maybe just consider removing these comments if the code itself is self-explaining.
There is one side-effect for the text-search: there is a clipping where we remove low score values from the results, and this happens earlier (before adding the quality and health scores). The result is fewer results, which have roughly the same score as before.
Some examples: