Skip to content

RDoc-3337 What affects vector search results #2040

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Slight variations in graph structure or search parameters can lead to different results.
* While HNSW offers fast search performance at scale and quickly finds points that are likely to be among the nearest neighbors,
it does not guarantee exact results — only approximate matches are returned.
This behavior is expected in all ANN algorithms, not just HNSW or RavenDB.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not just HNSW or not just HNSW used by RavenDB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Danielle9897 Danielle9897 requested a review from Lwiel May 20, 2025 05:56
* This article explains why vector search results might not always return what you expect, even when relevant documents exist.
It applies to both [dynamic vector search queries](../../ai-integration/vector-search/vector-search-using-dynamic-query) and
[static-index vector search queries](../../ai-integration/vector-search/vector-search-using-static-index).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we missing here a discussion on the fact that we also have Exact nearest search?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you mention that below, but it should be as soon as you mention this limitation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* **Insertion order effects**:

* Because the HNSW graph is append-only and built incrementally,
the order in which documents are added, updated, or deleted can affect the final graph structure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of updates / deletes don't matter - only inserts, because of the soft delete

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* HNSW uses a greedy search strategy to perform approximate nearest-neighbor (ANN) searches:
The search starts at the top layer from an entry point.
The algorithm then descends through the layers, always choosing the neighbor closest to the query vector.
* The algorithm doesn't exhaustively explore all possible paths, so it can miss the true global nearest neighbors -
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain here why we do that, because it allows usually finding the right things very quickly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

They help keep memory usage and indexing time under control, but may also limit the graph’s ability to precisely represent all possible proximity relationships.

* **Number of edges**:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to mention that this is m in the HNSW paper

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

This parameter (commonly referred to as _efConstruction_) controls how many neighboring vectors are considered during this process.
It defines the size of the candidate pool - the number of potential links evaluated for each insertion.
From the candidate pool, HNSW selects up to the configured _number of edges_ for each node.
* A **larger** candidate pool increases the chance of finding better-connected neighbors,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain the downside (higher indexing time, bigger index)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the downside of a larger value to both the "number of candidates" and "number of edges" sections.

even if they would otherwise be among the top candidates.
Use this to filter out marginal matches, especially when minimum semantic relevance is important.
* This param can be set directly in the query. For example, see this [Query example](../../ai-integration/vector-search/vector-search-using-dynamic-query#querying-raw-text).
If not explicitly set, the value is taken from the [Indexing.Corax.VectorSearch.DefaultMinimumSimilarity](../../server/configuration/indexing-configuration#indexing.corax.vectorsearch.defaultminimumsimilarity) configuration key.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention the default value here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

it is typically accurate and strongly recommended in most scenarios due to its performance.
* **Exact search**:
Performs a full comparison against all indexed vectors to guarantee the closest matches.
This method is more accurate but much slower - learn more in [Using exact search](../../ai-integration/vector-search/what-affects-vector-search-results#using-exact-search) below.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That actually depends on the size of the index. If you have a small one, that is reasonably performant, actually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done:

  • removed this part from here
  • and expanded on that in the dedicated section below

@Danielle9897 Danielle9897 marked this pull request as draft May 21, 2025 09:54
@Danielle9897 Danielle9897 marked this pull request as ready for review May 21, 2025 15:35
@Danielle9897 Danielle9897 requested a review from ayende May 21, 2025 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants