Search order implementation with additional tests. #366

isoos · 2017-09-19T16:10:35Z

There is one side-effect for the text-search: there is a clipping where we remove low score values from the results, and this happens earlier (before adding the quality and health scores). The result is fewer results, which have roughly the same score as before.

Some examples:

isoos · 2017-09-19T16:13:46Z

I'm also planning to simplify the index in a subsequent PR: we could remove url as an id and use only package. When we introduce the Dart SDK API index, it is likely that it will get into a different structure than PackageDocument, so there is no reason to have it around now, and it will simplify some of the lookups in this current change.

mkustermann

Dart SDK API index

What is this?

mkustermann · 2017-09-20T09:14:25Z

app/lib/search/index_simple.dart

+  }
+
+  void removeLowScores(double fraction) {
+    final double maxValue = values.values.fold(0.0, max);


I think this is a big weird: Basically if I search for "foobar" and none of the packages are relevant, though our character n-grams returns some low-score results, we find the max here, and someone passes like fraction = 0.9 (meaning should be in top 10%) then we get results, even though none are relevant.

I think a better way of doing this is to always ensure that if a package is perfect match, then it's score needs to be 1.0 then you can just say

removeWhere((key) => values[key] < 0.9);

Then it's even possible to have optimizations which pushes this filter down to the things calling addValues so you can avoid work if you know already it can't pass the filter.

This may happen automatically, because we weight longer N-grams and prefixes more, and if there is a 6+ character match, we will most likely remove the low-quality matches anyway.
I'll experiment with this, and do a follow-up in a subsequent PR.

Maybe slightly said in another way: If a search query has only results with poor scores, we should display none of them - not even the highest scored one.

Yeah, that makes sense. I've added a low-score filter, but it is very conservative now. I would still experiment with it more, because it is not always the package name that people are searching for, and the rules around it may become complex.

mkustermann · 2017-09-20T09:19:01Z

app/lib/shared/search_service.g.dart

+      if (value != null) {
+        val[key] = value;
+      }
+    }


side note: not the most efficient code our "great json serializer source gen" is doing :-/

mkustermann · 2017-09-20T09:23:59Z

app/lib/search/index_simple.dart

+        ..addValues(_readmeIndex.search(text), 0.06);
+      textScore.removeLowScores(0.05);
+    } else if (packagePrefix != null) {
+      textScore = new Score()..addValues(_nameIndex.search(packagePrefix), 0.8);


Is this _nameIndex.search(packagePrefix) really searching for a prefix or does it match anywhere in the name?

If it doesn't search for a prefix, then we shouldn't use the name "prefix".

Yeah, this was tricky to write (it is there in the older version too), and now it may not be needed at all. The idea was that if you don't specify any text query, yet specify a package name prefix, we still need to order the results by something, and at that point we only had this kind of "something".

I'll remove this, and disallow queries that have no text but filter on package prefix and order on text match score.

mkustermann · 2017-09-20T09:27:14Z

app/lib/search/index_simple.dart

+      Set<String> urls, int compare(PackageDocument a, PackageDocument b)) {
+    final List<PackageScore> list = urls
+        .map((url) =>
+            new PackageScore(url: url, package: _documents[url].package))


It's not clear why you are not adding a , score: values[url] here!

This is the use case for e.g. sort by updated, where we don't have scores (no similarity, no health/popularity).

mkustermann · 2017-09-20T09:31:20Z

app/lib/search/index_simple.dart

+        textScore?.getKeys()?.toSet() ?? _documents.keys.toSet();
+
+    // filter on package prefix
+    if (query.packagePrefix != null) {


For future optimizations: There might be possibilities where one can avoid constructing the big maps in Score if we know only a small subset of keys will survive.

i.e. Instead of building up big datastructures and removing from them later, we could try filtering early on and construct smaller datastructures.

Good point, will address it in a follow-up.

mkustermann · 2017-09-20T09:33:43Z

app/lib/search/index_simple.dart

+          ..addValues(textScore?.values, 0.85)
+          ..addValues(getPopularityScore(urls), 0.10)
+          ..addValues(getHealthScore(urls), 0.05);
+        results = _flattenFromValues(overallScore.values);


Instead of 'flatten' we could use 'rank', because what it really is doing is ordering/ranking the results.

mkustermann · 2017-09-20T09:34:25Z

app/lib/search/index_simple.dart

+    switch (query.order ?? SearchOrder.overall) {
+      case SearchOrder.overall:
+        final Score overallScore = new Score()
+          ..addValues(textScore?.values, 0.85)


You could avoid creating a copy of textScore and modify the copy, by just modify the textScore itself.

In this case I think it is cleaner to keep it separate, as it is explicit how the text match score relates to the rest.

mkustermann · 2017-09-20T09:35:50Z

app/lib/search/index_simple.dart

+        results = _flattenFromValues(textScore.values);
+        break;
+      case SearchOrder.updated:
+        results = _flattenFromUrls(urls, _compareUpdated);


The name of the function is misleading, it has notting to do with urls, it's ordering/ranking based on updated timestamp.

isoos · 2017-09-20T09:44:34Z

Dart SDK API index

What is this?

#190 (we shall eventually do code search, including Dart SDK APIs).

isoos · 2017-09-21T07:55:43Z

Merging this now, further comments will be addressed in a subsequent PR.

mkustermann · 2017-09-21T07:58:22Z

app/lib/search/index_simple.dart

+        ..addValues(_nameIndex.search(text), 0.82)
+        ..addValues(_descrIndex.search(text), 0.12)
+        ..addValues(_readmeIndex.search(text), 0.06);
+      // removes scores that are less than 5% of the best


This comment is confusing, since less than 5% of the best could be interpreted as we drop the lowest 95%. Maybe just consider removing these comments if the code itself is self-explaining.

googlebot added the cla: yes label Sep 19, 2017

kevmoo approved these changes Sep 19, 2017

View reviewed changes

mkustermann reviewed Sep 20, 2017

View reviewed changes

isoos force-pushed the search_order_impl branch 2 times, most recently from b08dfba to 02f3f5b Compare September 20, 2017 10:17

Search order implementation with additional tests.

8717b8c

isoos force-pushed the search_order_impl branch from 02f3f5b to 8717b8c Compare September 20, 2017 10:28

isoos merged commit 003f527 into dart-lang:master Sep 21, 2017

mkustermann reviewed Sep 21, 2017

View reviewed changes

isoos deleted the search_order_impl branch September 22, 2017 15:31

Search order implementation with additional tests. #366

Search order implementation with additional tests. #366

Uh oh!

Conversation

isoos commented Sep 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isoos commented Sep 19, 2017

Uh oh!

mkustermann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isoos commented Sep 20, 2017

Uh oh!

isoos commented Sep 21, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

isoos commented Sep 19, 2017 •

edited

Loading