RDoc-3337 What affects vector search results #2040

Danielle9897 · 2025-05-18T12:33:28Z

Related issue:
https://issues.hibernatingrhinos.com/issue/RDoc-3337/Vector-search-queries-return-no-results-until-the-auto-index-is-rebuilt

Lwiel · 2025-05-19T08:55:11Z

...Documentation.Pages/ai-integration/vector-search/what-affects-vector-search-results.markdown

+    Slight variations in graph structure or search parameters can lead to different results.
+  * While HNSW offers fast search performance at scale and quickly finds points that are likely to be among the nearest neighbors,
+    it does not guarantee exact results — only approximate matches are returned.  
+    This behavior is expected in all ANN algorithms, not just HNSW or RavenDB.  


not just HNSW or not just HNSW used by RavenDB

ayende · 2025-05-21T09:03:22Z

...Documentation.Pages/ai-integration/vector-search/what-affects-vector-search-results.markdown

+* This article explains why vector search results might not always return what you expect, even when relevant documents exist.
+  It applies to both [dynamic vector search queries](../../ai-integration/vector-search/vector-search-using-dynamic-query) and
+  [static-index vector search queries](../../ai-integration/vector-search/vector-search-using-static-index).
+


Aren't we missing here a discussion on the fact that we also have Exact nearest search?

I know you mention that below, but it should be as soon as you mention this limitation

ayende · 2025-05-21T09:04:24Z

...Documentation.Pages/ai-integration/vector-search/what-affects-vector-search-results.markdown

+* **Insertion order effects**:  
+
+  * Because the HNSW graph is append-only and built incrementally,  
+    the order in which documents are added, updated, or deleted can affect the final graph structure.  


The order of updates / deletes don't matter - only inserts, because of the soft delete

ayende · 2025-05-21T09:05:00Z

...Documentation.Pages/ai-integration/vector-search/what-affects-vector-search-results.markdown

+  * HNSW uses a greedy search strategy to perform approximate nearest-neighbor (ANN) searches:  
+    The search starts at the top layer from an entry point.  
+    The algorithm then descends through the layers, always choosing the neighbor closest to the query vector.  
+  * The algorithm doesn't exhaustively explore all possible paths, so it can miss the true global nearest neighbors -  


Explain here why we do that, because it allows usually finding the right things very quickly.

ayende · 2025-05-21T09:06:15Z

...Documentation.Pages/ai-integration/vector-search/what-affects-vector-search-results.markdown

+They help keep memory usage and indexing time under control, but may also limit the graph’s ability to precisely represent all possible proximity relationships.
+
+* **Number of edges**:  
+


Need to mention that this is m in the HNSW paper

ayende · 2025-05-21T09:07:56Z

...Documentation.Pages/ai-integration/vector-search/what-affects-vector-search-results.markdown

+    This parameter (commonly referred to as _efConstruction_) controls how many neighboring vectors are considered during this process.
+    It defines the size of the candidate pool - the number of potential links evaluated for each insertion.  
+    From the candidate pool, HNSW selects up to the configured _number of edges_ for each node.
+  * A **larger** candidate pool increases the chance of finding better-connected neighbors,  


Explain the downside (higher indexing time, bigger index)

Added the downside of a larger value to both the "number of candidates" and "number of edges" sections.

ayende · 2025-05-21T09:08:38Z

...Documentation.Pages/ai-integration/vector-search/what-affects-vector-search-results.markdown

+    even if they would otherwise be among the top candidates.  
+    Use this to filter out marginal matches, especially when minimum semantic relevance is important.
+  * This param can be set directly in the query. For example, see this [Query example](../../ai-integration/vector-search/vector-search-using-dynamic-query#querying-raw-text).  
+    If not explicitly set, the value is taken from the [Indexing.Corax.VectorSearch.DefaultMinimumSimilarity](../../server/configuration/indexing-configuration#indexing.corax.vectorsearch.defaultminimumsimilarity) configuration key.


Mention the default value here?

ayende · 2025-05-21T09:09:07Z

...Documentation.Pages/ai-integration/vector-search/what-affects-vector-search-results.markdown

+      it is typically accurate and strongly recommended in most scenarios due to its performance.
+    * **Exact search**:  
+      Performs a full comparison against all indexed vectors to guarantee the closest matches.  
+      This method is more accurate but much slower - learn more in [Using exact search](../../ai-integration/vector-search/what-affects-vector-search-results#using-exact-search) below.


That actually depends on the size of the index. If you have a small one, that is reasonably performant, actually.

done:

removed this part from here

and expanded on that in the dedicated section below

RDoc-3337 What affects vector search results

dbc5622

Danielle9897 requested review from arekpalinski, maciejaszyk and Lwiel May 18, 2025 12:33

Lwiel reviewed May 19, 2025

View reviewed changes

arekpalinski approved these changes May 19, 2025

View reviewed changes

maciejaszyk approved these changes May 19, 2025

View reviewed changes

RDoc-3337 fix review comment

22944a5

Danielle9897 requested a review from Lwiel May 20, 2025 05:56

Lwiel approved these changes May 20, 2025

View reviewed changes

ayende reviewed May 21, 2025

View reviewed changes

Danielle9897 marked this pull request as draft May 21, 2025 09:54

RDoc-3337 fix Oren's comments

9bbea14

Danielle9897 marked this pull request as ready for review May 21, 2025 15:35

Danielle9897 requested a review from ayende May 21, 2025 15:35

ayende approved these changes May 22, 2025

View reviewed changes

ppekrol merged commit be8d424 into ravendb:master May 29, 2025
1 of 2 checks passed

		They help keep memory usage and indexing time under control, but may also limit the graph’s ability to precisely represent all possible proximity relationships.

		* Number of edges:

RDoc-3337 What affects vector search results #2040

RDoc-3337 What affects vector search results #2040

Uh oh!

Conversation

Danielle9897 commented May 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!