-
Notifications
You must be signed in to change notification settings - Fork 140
RDoc-3337 What affects vector search results #2040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
RDoc-3337 What affects vector search results #2040
Conversation
Slight variations in graph structure or search parameters can lead to different results. | ||
* While HNSW offers fast search performance at scale and quickly finds points that are likely to be among the nearest neighbors, | ||
it does not guarantee exact results — only approximate matches are returned. | ||
This behavior is expected in all ANN algorithms, not just HNSW or RavenDB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not just HNSW
or not just HNSW used by RavenDB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* This article explains why vector search results might not always return what you expect, even when relevant documents exist. | ||
It applies to both [dynamic vector search queries](../../ai-integration/vector-search/vector-search-using-dynamic-query) and | ||
[static-index vector search queries](../../ai-integration/vector-search/vector-search-using-static-index). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we missing here a discussion on the fact that we also have Exact nearest search?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you mention that below, but it should be as soon as you mention this limitation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* **Insertion order effects**: | ||
|
||
* Because the HNSW graph is append-only and built incrementally, | ||
the order in which documents are added, updated, or deleted can affect the final graph structure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of updates / deletes don't matter - only inserts, because of the soft delete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* HNSW uses a greedy search strategy to perform approximate nearest-neighbor (ANN) searches: | ||
The search starts at the top layer from an entry point. | ||
The algorithm then descends through the layers, always choosing the neighbor closest to the query vector. | ||
* The algorithm doesn't exhaustively explore all possible paths, so it can miss the true global nearest neighbors - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain here why we do that, because it allows usually finding the right things very quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
They help keep memory usage and indexing time under control, but may also limit the graph’s ability to precisely represent all possible proximity relationships. | ||
|
||
* **Number of edges**: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to mention that this is m
in the HNSW paper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
This parameter (commonly referred to as _efConstruction_) controls how many neighboring vectors are considered during this process. | ||
It defines the size of the candidate pool - the number of potential links evaluated for each insertion. | ||
From the candidate pool, HNSW selects up to the configured _number of edges_ for each node. | ||
* A **larger** candidate pool increases the chance of finding better-connected neighbors, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain the downside (higher indexing time, bigger index)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the downside of a larger value to both the "number of candidates" and "number of edges" sections.
even if they would otherwise be among the top candidates. | ||
Use this to filter out marginal matches, especially when minimum semantic relevance is important. | ||
* This param can be set directly in the query. For example, see this [Query example](../../ai-integration/vector-search/vector-search-using-dynamic-query#querying-raw-text). | ||
If not explicitly set, the value is taken from the [Indexing.Corax.VectorSearch.DefaultMinimumSimilarity](../../server/configuration/indexing-configuration#indexing.corax.vectorsearch.defaultminimumsimilarity) configuration key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention the default value here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
it is typically accurate and strongly recommended in most scenarios due to its performance. | ||
* **Exact search**: | ||
Performs a full comparison against all indexed vectors to guarantee the closest matches. | ||
This method is more accurate but much slower - learn more in [Using exact search](../../ai-integration/vector-search/what-affects-vector-search-results#using-exact-search) below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That actually depends on the size of the index. If you have a small one, that is reasonably performant, actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done:
- removed this part from here
- and expanded on that in the dedicated section below
Related issue:
https://issues.hibernatingrhinos.com/issue/RDoc-3337/Vector-search-queries-return-no-results-until-the-auto-index-is-rebuilt