Skip to content

Search order implementation with additional tests. #366

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 21, 2017

Conversation

isoos
Copy link
Collaborator

@isoos isoos commented Sep 19, 2017

There is one side-effect for the text-search: there is a clipping where we remove low score values from the results, and this happens earlier (before adding the quality and health scores). The result is fewer results, which have roughly the same score as before.

Some examples:

@isoos
Copy link
Collaborator Author

isoos commented Sep 19, 2017

I'm also planning to simplify the index in a subsequent PR: we could remove url as an id and use only package. When we introduce the Dart SDK API index, it is likely that it will get into a different structure than PackageDocument, so there is no reason to have it around now, and it will simplify some of the lookups in this current change.

Copy link
Member

@mkustermann mkustermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dart SDK API index

What is this?

}

void removeLowScores(double fraction) {
final double maxValue = values.values.fold(0.0, max);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a big weird: Basically if I search for "foobar" and none of the packages are relevant, though our character n-grams returns some low-score results, we find the max here, and someone passes like fraction = 0.9 (meaning should be in top 10%) then we get results, even though none are relevant.

I think a better way of doing this is to always ensure that if a package is perfect match, then it's score needs to be 1.0 then you can just say

removeWhere((key) => values[key] < 0.9);

Then it's even possible to have optimizations which pushes this filter down to the things calling addValues so you can avoid work if you know already it can't pass the filter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may happen automatically, because we weight longer N-grams and prefixes more, and if there is a 6+ character match, we will most likely remove the low-quality matches anyway.
I'll experiment with this, and do a follow-up in a subsequent PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe slightly said in another way: If a search query has only results with poor scores, we should display none of them - not even the highest scored one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. I've added a low-score filter, but it is very conservative now. I would still experiment with it more, because it is not always the package name that people are searching for, and the rules around it may become complex.

if (value != null) {
val[key] = value;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side note: not the most efficient code our "great json serializer source gen" is doing :-/

..addValues(_readmeIndex.search(text), 0.06);
textScore.removeLowScores(0.05);
} else if (packagePrefix != null) {
textScore = new Score()..addValues(_nameIndex.search(packagePrefix), 0.8);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this _nameIndex.search(packagePrefix) really searching for a prefix or does it match anywhere in the name?

If it doesn't search for a prefix, then we shouldn't use the name "prefix".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was tricky to write (it is there in the older version too), and now it may not be needed at all. The idea was that if you don't specify any text query, yet specify a package name prefix, we still need to order the results by something, and at that point we only had this kind of "something".

I'll remove this, and disallow queries that have no text but filter on package prefix and order on text match score.

Set<String> urls, int compare(PackageDocument a, PackageDocument b)) {
final List<PackageScore> list = urls
.map((url) =>
new PackageScore(url: url, package: _documents[url].package))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear why you are not adding a , score: values[url] here!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the use case for e.g. sort by updated, where we don't have scores (no similarity, no health/popularity).

textScore?.getKeys()?.toSet() ?? _documents.keys.toSet();

// filter on package prefix
if (query.packagePrefix != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future optimizations: There might be possibilities where one can avoid constructing the big maps in Score if we know only a small subset of keys will survive.

i.e. Instead of building up big datastructures and removing from them later, we could try filtering early on and construct smaller datastructures.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will address it in a follow-up.

..addValues(textScore?.values, 0.85)
..addValues(getPopularityScore(urls), 0.10)
..addValues(getHealthScore(urls), 0.05);
results = _flattenFromValues(overallScore.values);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of 'flatten' we could use 'rank', because what it really is doing is ordering/ranking the results.

switch (query.order ?? SearchOrder.overall) {
case SearchOrder.overall:
final Score overallScore = new Score()
..addValues(textScore?.values, 0.85)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could avoid creating a copy of textScore and modify the copy, by just modify the textScore itself.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I think it is cleaner to keep it separate, as it is explicit how the text match score relates to the rest.

results = _flattenFromValues(textScore.values);
break;
case SearchOrder.updated:
results = _flattenFromUrls(urls, _compareUpdated);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the function is misleading, it has notting to do with urls, it's ordering/ranking based on updated timestamp.

@isoos
Copy link
Collaborator Author

isoos commented Sep 20, 2017

Dart SDK API index

What is this?

#190 (we shall eventually do code search, including Dart SDK APIs).

@isoos isoos force-pushed the search_order_impl branch 2 times, most recently from b08dfba to 02f3f5b Compare September 20, 2017 10:17
@isoos
Copy link
Collaborator Author

isoos commented Sep 21, 2017

Merging this now, further comments will be addressed in a subsequent PR.

@isoos isoos merged commit 003f527 into dart-lang:master Sep 21, 2017
..addValues(_nameIndex.search(text), 0.82)
..addValues(_descrIndex.search(text), 0.12)
..addValues(_readmeIndex.search(text), 0.06);
// removes scores that are less than 5% of the best
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is confusing, since less than 5% of the best could be interpreted as we drop the lowest 95%. Maybe just consider removing these comments if the code itself is self-explaining.

@isoos isoos deleted the search_order_impl branch September 22, 2017 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants