-
Notifications
You must be signed in to change notification settings - Fork 1
Getting started / Search: Add new section (GenAI, unedited) #264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Warning Rate limit exceeded@amotl has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 12 minutes and 44 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (6)
WalkthroughIntroduces a new Search documentation section under docs/start/query/search with pages for full-text, geospatial, vector, and hybrid search, adds a section index, and updates the toctree link in docs/start/query/index.md to point to the new Search index. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant SQL as SQL Engine
participant VEC as Vector Index (HNSW)
participant TXT as Full-Text Index (BM25)
participant RES as Result Merger
User->>SQL: Submit hybrid search (CTEs: KNN_MATCH + MATCH)
SQL->>VEC: Run kNN on embeddings
SQL->>TXT: Run BM25 keyword search
VEC-->>SQL: Top-K vector results with _score
TXT-->>SQL: Text results with _score
SQL->>RES: Join results on id, compute hybrid_score
RES-->>User: Ranked rows by hybrid_score
note over RES: Fusion: weighted sum or RRF
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested labels
Suggested reviewers
Poem
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/start/query/index.md (1)
27-31
: Broken ref: grid card uses undefinedsearch-overview
anchor.The new index file defines
(start-search)=
, notsearch-overview
. Update the card link to point to the new label.Apply this diff:
-:::{grid-item-card} Search -:link: search-overview +:::{grid-item-card} Search +:link: start-search :link-type: ref Based on Apache Lucene, CrateDB offers native BM25 term search and vector search, all using SQL. By combining it, also using SQL, you can implement powerful single-query hybrid search. :::This aligns with the new
docs/start/query/search/index.md
. (cratedb.com)
🧹 Nitpick comments (20)
docs/start/query/search/fulltext.md (3)
94-101
: Clarify that MATCH requires a FULLTEXT index and show nested-field indexing.Examples that call
MATCH(payload['comment'], ...)
will only work if that field is indexed using FULLTEXT. Consider adding a quick index DDL before the example or switch the predicate to target an index identifier.Apply this augmentation right before the “Search Nested JSON” example:
+Before querying a nested field with MATCH, ensure it is FULLTEXT‑indexed: + +```sql +CREATE TABLE feedback ( + id INTEGER, + payload OBJECT(DYNAMIC), + INDEX comment_ft USING FULLTEXT (payload['comment']) +); +```Then update the query to target the index:
-WHERE MATCH(payload['comment'], 'battery life'); +WHERE MATCH(comment_ft, 'battery life');References: Full-text MATCH must target fulltext-indexed columns; examples of index identifiers and per‑query options. (cratedb.com)
Also applies to: 102-107
99-100
: DDL style nit: prefer explicit index clause for clarity.The inline
TEXT INDEX USING FULLTEXT WITH (analyzer='english')
is fine, but most CrateDB docs demonstrate named FULLTEXT indexes for discoverability and multi-column patterns. Consider:-CREATE TABLE docs ( id INTEGER, text TEXT INDEX USING FULLTEXT WITH (analyzer = 'english') ); +CREATE TABLE docs ( + id INTEGER, + text TEXT, + INDEX text_ft USING FULLTEXT (text) WITH (analyzer = 'english') +);This also pairs nicely with the updated MATCH examples targeting
text_ft
. (cratedb.com)
140-144
: Add working cross-references/links for “Learn More”.The bullets are placeholders. Convert them to Sphinx/MyST refs pointing to CrateDB reference pages (MATCH, analyzers, fulltext indices) and/or guide pages so readers can click through.
Example (adjust labels to your docs build):
-* Full-text Search Data Model -* MATCH Clause Documentation -* How CrateDB Differs from Elasticsearch -* Tutorial: Full-text Search on Logs +* {ref}`crate-reference:fulltext` (MATCH predicate) +* {ref}`crate-reference:create-analyzer` (Custom analyzers) +* {ref}`crate-guide:feature/search/fts/analyzer` (Analyzer guide) +* {ref}`crate-guide:feature/search/fts/index` (Full-text search tutorials)Refs: MATCH predicate; analyzer docs. (cratedb.com)
docs/start/query/search/geo.md (2)
111-114
: Call out that exact functions bypass indexes.You mention cost of exact computations; explicitly note that
within(...)
,intersects(...)
, anddistance(...)
do not use the geo index and can be slow on large result sets. Encourage combining them with prefilters or using MATCH first.Suggested addition after the paragraph:
+Note: `within(...)`, `intersects(...)`, and `distance(...)` are exact and +operate on the stored shapes without using the geo index; apply on narrowed +result sets or prefer `MATCH` for broad filtering.Reference: Geo exact queries guidance. (cratedb.com)
119-124
: Index type quoting/style nit.CrateDB examples typically use double quotes for index type literals (e.g.,
"quadtree"
) or omit quotes. Align with reference style for consistency across docs.- area GEO_SHAPE INDEX USING 'quadtree' + area GEO_SHAPE INDEX USING "quadtree"Reference style example. (cratedb.com)
docs/start/query/search/index.md (1)
4-11
: Nice, minimal toctree; consider adding short intro text.Optional: add one sentence below the H1 to orient readers (what “Search” covers: full‑text, geo, vector, hybrid).
docs/start/query/search/hybrid.md (2)
43-71
: Make the SQL runnable; avoid ellipses and consider broader join.
- Replace
[0.2, 0.1, ..., 0.3]
with a concrete vector; ellipses will break copy‑paste.- Optional: many apps want items that match only one modality. Consider a FULL OUTER JOIN with COALESCE and default scores for missing sides.
Apply this diff to the vector literal:
- WHERE KNN_MATCH(embedding, [0.2, 0.1, ..., 0.3], 10) + WHERE KNN_MATCH(embedding, [0.2, 0.1, 0.7, 0.3], 10)Alternative join pattern (illustrative):
WITH vector_results AS ( SELECT id, _score AS vector_score FROM documents WHERE knn_match(embedding, [0.2, 0.1, 0.7, 0.3], 50) ), bm25_results AS ( SELECT id, _score AS bm25_score FROM documents WHERE match(content, 'knn search') ) SELECT COALESCE(b.id, v.id) AS id, COALESCE(bm25_score, 0.0) AS bm25_score, COALESCE(vector_score, 0.0) AS vector_score, 0.5 * COALESCE(bm25_score, 0.0) + 0.5 * COALESCE(vector_score, 0.0) AS hybrid_score FROM bm25_results b FULL OUTER JOIN vector_results v ON v.id = b.id ORDER BY hybrid_score DESC LIMIT 10;References:
knn_match
usage and_score
; fulltextMATCH
in WHERE. (cratedb.com)
73-93
: RRF section: optionally include the formula for clarity.If space permits, add a one-liner:
RRF(d) = Σ_i 1 / (k + rank_i(d))
, with a typicalk
like 60. Helps readers reproduce the numbers.Happy to add a runnable SQL example computing RRF from two rank lists.
docs/start/query/search/vector.md (12)
13-19
: Fix table formatting and temper “immediately searchable” claim.
- Add a header row so the Markdown table renders reliably.
- “Immediately searchable” is misleading for near-real-time systems. Suggest calling out the default refresh interval instead.
- Escaping underscores in plain table text is unnecessary.
- | FLOAT\_VECTOR | Store embeddings up to 2048 dimensions | - | ------------------- | ------------------------------------------------------------ | - | KNN\_MATCH | SQL-native k-nearest neighbor function with `_score` support | - | VECTOR\_SIMILARITY | Compute similarity scores between vectors in queries | - | Real-time indexing | Fresh vectors are immediately searchable | - | Hybrid queries | Combine vector search with filters, full-text, and JSON | +| Feature | Description | +|------------------------|--------------------------------------------------------------| +| FLOAT_VECTOR | Store embeddings up to 2048 dimensions | +| KNN_MATCH | SQL-native k-nearest neighbor function with `_score` support | +| VECTOR_SIMILARITY | Compute similarity scores between vectors in queries | +| Near real-time indexing| Fresh vectors become searchable after a short refresh (≈1s) | +| Hybrid queries | Combine vector search with filters, full-text, and JSON |Note: Please verify the dimension limit (“up to 2048”) against the current CrateDB version you target. If that limit varies by version, consider adding a short “Compatibility” note.
22-31
: Add a minimal DDL so readers know the expected schema and vector length.KNN examples are clearer when the column type and vector dimensionality are explicit.
### K-Nearest Neighbors (KNN) Search +```sql +-- Example schema (4-dimensional vectors) +CREATE TABLE word_embeddings ( + id INT, + text TEXT, + embedding FLOAT_VECTOR(4) +); +``` + ```sql SELECT text, _score FROM word_embeddings WHERE KNN_MATCH(embedding, [0.3, 0.6, 0.0, 0.9], 3) ORDER BY _score DESC;If you prefer not to add the DDL here, add a one-liner note stating “embedding is FLOAT_VECTOR(4)”. Also, if “2048” above is not guaranteed, avoid mixing dimensions across samples. --- `35-41`: **Keep vector dimensionality consistent across examples.** This example switches to a 3-D vector. Either declare `features FLOAT_VECTOR(3)` or keep all examples 4-D for continuity. ```diff WHERE category = 'shoes' - AND KNN_MATCH(features, [0.2, 0.1, 0.3], 5) + AND KNN_MATCH(features, [0.2, 0.1, 0.3, 0.4], 5) ORDER BY _score DESC;
45-50
: Clarify placeholder usage and avoid redundant sorting signals.
- Define what
[q_vector]
stands for (e.g., a 4-D array bound as a parameter).- Since you compute
score
with VECTOR_SIMILARITY, order by that to make intent explicit.-SELECT id, VECTOR_SIMILARITY(emb, [q_vector]) AS score +-- q_vector is a 4-D array matching emb's FLOAT_VECTOR(4) +SELECT id, VECTOR_SIMILARITY(emb, [q_vector]) AS score FROM items -WHERE KNN_MATCH(emb, [q_vector], 10) -ORDER BY score DESC; +WHERE KNN_MATCH(emb, [q_vector], 10) +ORDER BY score DESC;Optionally add: “Higher scores indicate greater similarity” (assuming cosine or dot-product semantics in your target version).
58-63
: Cap examples with LIMIT for reproducibility.Most prior examples use small k; adding LIMIT mirrors real usage and avoids long result sets in docs output.
SELECT id, title FROM documents WHERE KNN_MATCH(embedding, [query_emb], 5) ORDER BY _score DESC; +-- LIMIT 5; -- optional; ORDER BY with KNN k=5 usually yields ≤ 5 rows
66-73
: Minor: Keep dimensions and naming aligned with earlier samples.If you settle on 4-D throughout, update
[user_emb]
to a 4-element vector for consistency, or add a note thatfeature_vec
isFLOAT_VECTOR(4)
.- AND KNN_MATCH(feature_vec, [user_emb], 4) + AND KNN_MATCH(feature_vec, [user_emb], 4) -- where user_emb is a 4-D vector matching feature_vec
75-83
: Consistency: add LIMIT and/or clarify vector length in chat example.Optional but keeps examples uniform and avoids confusion.
WHERE KNN_MATCH(vec, [query_emb], 3) ORDER BY _score DESC; +-- LIMIT 3;
95-104
: Make “HNSW index” guidance actionable and name concrete tuning knobs.The tips are good but abstract. Add a small DDL showing how to create an HNSW index and mention tuning parameters (e.g.,
ef_construction
,m
, and query-timeef_search
/num_candidates
), plus when/where they’re set.## Performance & Indexing Tips @@ -| Create HNSW index when supported | Enables fast ANN queries via Lucene | +| Create HNSW index for vectors | Enables fast ANN queries via Lucene HNSW | @@ -| Tune `KNN_MATCH` | Adjust neighbor count per shard or globally | +| Tune ANN parameters | Adjust k in `KNN_MATCH` and query-time knobs (e.g., ef) | +### Example: Create an HNSW index +```sql +-- Verify syntax/params against your target CrateDB version +CREATE INDEX idx_items_emb_hnsw +ON items (emb) +USING hnsw +WITH (m = 16, ef_construction = 128); +``` + +### Example: Tune query-time parameters +```sql +-- Pseudocode; replace with the correct setting mechanism for your version +SET SESSION search_ann_ef = 100; +SELECT id, _score +FROM items +WHERE KNN_MATCH(emb, [qvec], 10) +ORDER BY _score DESC; +```Please double-check the exact parameter names and how they’re set in the current release before merging.
105-114
: Add minimal version support note.State the minimum CrateDB version that ships FLOAT_VECTOR/KNN_MATCH so users know whether they can follow along.
## When to Use CrateDB for Vector Search + +> Note: Vector search features (FLOAT_VECTOR, KNN_MATCH, VECTOR_SIMILARITY) require CrateDB ≥ X.Y. Confirm version compatibility before use.
115-124
: Cross-link “Hybrid search” to the sibling page in this PR.Make it easy to jump to the new Hybrid guide.
-| Hybrid search | Combine ANN search with full-text, geo, JSON | +| Hybrid search | Combine ANN search with full-text, geo, JSON (see [Hybrid search](../hybrid.md)) |
125-131
: Add direct links for function references.You mention a “
KNN_MATCH
&VECTOR_SIMILARITY
reference” but there’s no URL. Link to the canonical SQL reference pages.* [Vector Search Guide](https://cratedb.com/docs/guide/feature/search/vector/index.html) -* `KNN_MATCH` & `VECTOR_SIMILARITY` reference +* `KNN_MATCH` & `VECTOR_SIMILARITY` reference: add links to the official SQL docs * [Intro Blog: Vector support & KNN search in CrateDB](https://cratedb.com/blog/unlocking-the-power-of-vector-support-and-knn-search-in-cratedb) * [LangChain & Vector Store integration](https://cratedb.com/docs/guide/domain/ml/index.html)If you want, I can locate and insert the exact doc URLs.
3-10
: Minor: add a quick “How it works” sentence.One sentence on how
_score
is produced (e.g., cosine similarity) helps readers reason about ordering, thresholds, and anomaly logic.CrateDB supports **native vector search**, enabling you to perform **similarity-based retrieval** directly in SQL, without needing a separate vector database or search engine. @@ -Vector search retrieves the most semantically similar items to a query vector using **Approximate Nearest Neighbor (ANN)** algorithms (e.g., HNSW via Lucene). CrateDB provides unified SQL support for this via `KNN_MATCH`. +Vector search retrieves the most semantically similar items to a query vector using **Approximate Nearest Neighbor (ANN)** algorithms (e.g., HNSW via Lucene). CrateDB exposes this via `KNN_MATCH`, which computes an internal `_score` (higher = more similar) usable in `ORDER BY`.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
docs/start/query/index.md
(1 hunks)docs/start/query/search/fulltext.md
(1 hunks)docs/start/query/search/geo.md
(1 hunks)docs/start/query/search/hybrid.md
(1 hunks)docs/start/query/search/index.md
(1 hunks)docs/start/query/search/vector.md
(1 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/start/query/search/fulltext.md
[grammar] ~15-~15: There might be a mistake here.
Context: ... | | --------------------- | --------------...
(QB_NEW_EN)
[grammar] ~16-~16: There might be a mistake here.
Context: ...-------------------------------------- | | Full-text indexing | Tokenized, lan...
(QB_NEW_EN)
[grammar] ~17-~17: There might be a mistake here.
Context: ...language-aware search on any text | | SQL + search | Combine struct...
(QB_NEW_EN)
[grammar] ~18-~18: There might be a mistake here.
Context: ...uctured filters with keyword queries | | JSON support | Search within ...
(QB_NEW_EN)
[grammar] ~19-~19: There might be a mistake here.
Context: ...in nested object fields | | Real-time ingestion | Search new dat...
(QB_NEW_EN)
[grammar] ~20-~20: There might be a mistake here.
Context: ...data immediately—no sync delay | | Scalable architecture | Built to handl...
(QB_NEW_EN)
[grammar] ~110-~110: There might be a mistake here.
Context: ... It Helps | | -------------------------------- | ---...
(QB_NEW_EN)
[grammar] ~111-~111: There might be a mistake here.
Context: ...-------------------------------------- | | Use TEXT
with FULLTEXT
index | Ena...
(QB_NEW_EN)
[grammar] ~112-~112: There might be a mistake here.
Context: ...bles tokenized search | | Index only needed fields | Red...
(QB_NEW_EN)
[grammar] ~113-~113: There might be a mistake here.
Context: ...uce indexing overhead | | Pick appropriate analyzer | Mat...
(QB_NEW_EN)
[grammar] ~114-~114: There might be a mistake here.
Context: ...ch the language and context | | Use MATCH()
not LIKE
| Ful...
(QB_NEW_EN)
[grammar] ~115-~115: There might be a mistake here.
Context: ...l-text is more performant and relevant | | Combine with filters | Boo...
(QB_NEW_EN)
[grammar] ~130-~130: There might be a mistake here.
Context: ... | | --------------------- | --------------...
(QB_NEW_EN)
[grammar] ~131-~131: There might be a mistake here.
Context: ...-------------------------------------- | | Language analyzers | Built-in suppo...
(QB_NEW_EN)
[grammar] ~132-~132: There might be a mistake here.
Context: ...rt for many languages | | JSON object support | Index and sear...
(QB_NEW_EN)
[grammar] ~133-~133: There might be a mistake here.
Context: ...ch nested fields | | SQL + full-text | Unified querie...
(QB_NEW_EN)
[grammar] ~134-~134: There might be a mistake here.
Context: ...s for structured and unstructured data | | Distributed execution | Fast, scalable...
(QB_NEW_EN)
[grammar] ~135-~135: There might be a mistake here.
Context: ... search across nodes | | Aggregations | Group and anal...
(QB_NEW_EN)
[grammar] ~140-~140: There might be a mistake here.
Context: ...earn More * Full-text Search Data Model * MATCH Clause Documentation * How CrateDB...
(QB_NEW_EN)
[grammar] ~141-~141: There might be a mistake here.
Context: ... Data Model * MATCH Clause Documentation * How CrateDB Differs from Elasticsearch *...
(QB_NEW_EN)
[grammar] ~142-~142: There might be a mistake here.
Context: ...* How CrateDB Differs from Elasticsearch * Tutorial: Full-text Search on Logs ## S...
(QB_NEW_EN)
docs/start/query/search/vector.md
[grammar] ~13-~13: There might be a mistake here.
Context: ... 2048 dimensions | | ------------------- | ----------------...
(QB_NEW_EN)
[grammar] ~14-~14: There might be a mistake here.
Context: ...-------------------------------------- | | KNN_MATCH | SQL-native k-nea...
(QB_NEW_EN)
[grammar] ~15-~15: There might be a mistake here.
Context: ...eighbor function with _score
support | | VECTOR_SIMILARITY | Compute similari...
(QB_NEW_EN)
[grammar] ~16-~16: There might be a mistake here.
Context: ...res between vectors in queries | | Real-time indexing | Fresh vectors ar...
(QB_NEW_EN)
[grammar] ~17-~17: There might be a mistake here.
Context: ...diately searchable | | Hybrid queries | Combine vector s...
(QB_NEW_EN)
[grammar] ~97-~97: There might be a mistake here.
Context: ... | | ---------------------------------- | -...
(QB_NEW_EN)
[grammar] ~98-~98: There might be a mistake here.
Context: ...-------------------------------------- | | Use FLOAT_VECTOR
| E...
(QB_NEW_EN)
[grammar] ~99-~99: There might be a mistake here.
Context: ...ixed-size arrays up to 2048 dimensions | | Create HNSW index when supported | E...
(QB_NEW_EN)
[grammar] ~100-~100: There might be a mistake here.
Context: ...queries via Lucene | | Consistent vector length | A...
(QB_NEW_EN)
[grammar] ~101-~101: There might be a mistake here.
Context: ...st match column definition | | Pre-filter with structured filters | R...
(QB_NEW_EN)
[grammar] ~102-~102: There might be a mistake here.
Context: ...overhead | | Tune KNN_MATCH
| A...
(QB_NEW_EN)
[grammar] ~117-~117: There might be a mistake here.
Context: ...on | | ------------------ | -----------------...
(QB_NEW_EN)
[grammar] ~118-~118: There might be a mistake here.
Context: ...-------------------------------------- | | FLOAT_VECTOR | Native support fo...
(QB_NEW_EN)
[grammar] ~119-~119: There might be a mistake here.
Context: ...pport for high-dimensional arrays | | KNN_MATCH | Core SQL predicat...
(QB_NEW_EN)
[grammar] ~120-~120: There might be a mistake here.
Context: ...predicate for vector similarity search | | VECTOR_SIMILARITY | Compute proximity...
(QB_NEW_EN)
[grammar] ~121-~121: There might be a mistake here.
Context: ...roximity scores in SQL | | Lucene HNSW ANN | Efficient graph-b...
(QB_NEW_EN)
[grammar] ~122-~122: There might be a mistake here.
Context: ... graph-based search engine | | Hybrid search | Combine ANN searc...
(QB_NEW_EN)
[grammar] ~128-~128: There might be a mistake here.
Context: ...N_MATCH&
VECTOR_SIMILARITY` reference * [Intro Blog: Vector support & KNN search ...
(QB_NEW_EN)
[grammar] ~129-~129: There might be a mistake here.
Context: ...: Vector support & KNN search in CrateDB](https://cratedb.com/blog/unlocking-the-power-of-vector-support-and-knn-search-in-cratedb) * [LangChain & Vector Store integration](ht...
(QB_NEW_EN)
docs/start/query/search/hybrid.md
[grammar] ~21-~21: There might be a mistake here.
Context: ...cally: * BM25 for keyword relevance * kNN for semantic proximity in vector s...
(QB_NEW_EN)
[grammar] ~26-~26: There might be a mistake here.
Context: ...x combination** (weighted sum of scores) * Reciprocal Rank Fusion (RRF) ## Suppo...
(QB_NEW_EN)
[grammar] ~31-~31: There might be a mistake here.
Context: ...ion | | --------------------- | ------------- ...
(QB_NEW_EN)
[grammar] ~32-~32: There might be a mistake here.
Context: ...-------------------------------------- | | Vector search | KNN_MATCH()
...
(QB_NEW_EN)
[grammar] ~33-~33: There might be a mistake here.
Context: ...ctors closest to a given vector | | Full-text search | MATCH()
...
(QB_NEW_EN)
[grammar] ~34-~34: There might be a mistake here.
Context: ...ene's BM25 scoring | | Geospatial search | MATCH()
...
(QB_NEW_EN)
[grammar] ~79-~79: There might be a mistake here.
Context: ... | | ------------- | ----------- | --------...
(QB_NEW_EN)
[grammar] ~80-~80: There might be a mistake here.
Context: ...-------------------------------------- | | 0.7440 | 1.0000 | 0.5734 ...
(QB_NEW_EN)
[grammar] ~81-~81: There might be a mistake here.
Context: ...tch(float_vector, float_vector, int) | | 0.4868 | 0.5512 | 0.4439 ...
(QB_NEW_EN)
[grammar] ~82-~82: There might be a mistake here.
Context: ...ng On Multiple Columns | | 0.4716 | 0.5694 | 0.4064 ...
(QB_NEW_EN)
[grammar] ~87-~87: There might be a mistake here.
Context: ... | | ----------- | ---------- | -----------...
(QB_NEW_EN)
[grammar] ~88-~88: There might be a mistake here.
Context: ...-------------------------------------- | | 0.03278 | 1 | 1 ...
(QB_NEW_EN)
[grammar] ~89-~89: There might be a mistake here.
Context: ...tch(float_vector, float_vector, int) | | 0.03105 | 7 | 2 ...
(QB_NEW_EN)
[grammar] ~90-~90: There might be a mistake here.
Context: ...ng On Multiple Columns | | 0.03057 | 8 | 3 ...
(QB_NEW_EN)
[grammar] ~97-~97: There might be a mistake here.
Context: ... | | ------------------------- | ----------...
(QB_NEW_EN)
[grammar] ~98-~98: There might be a mistake here.
Context: ...-------------------------------------- | | 🔍 Improved relevance | Combines s...
(QB_NEW_EN)
[grammar] ~99-~99: There might be a mistake here.
Context: ...d-based matches | | ⚙️ Pure SQL | No DSLs or ...
(QB_NEW_EN)
[grammar] ~100-~100: There might be a mistake here.
Context: ...—runs directly in CrateDB | | ⚡ High performance | Built on Ap...
(QB_NEW_EN)
[grammar] ~101-~101: There might be a mistake here.
Context: ...CrateDB’s distributed SQL engine | | 🔄 Flexible ranking | Use scoring...
(QB_NEW_EN)
[grammar] ~104-~104: There might be a mistake here.
Context: ...RF, etc.) based on use case needs | ## Usage in Applications Hybrid search is pa...
(QB_NEW_EN)
[grammar] ~108-~108: There might be a mistake here.
Context: ...arly effective for: * Knowledge bases * Product or document search * **Multili...
(QB_NEW_EN)
[grammar] ~109-~109: There might be a mistake here.
Context: ...e bases** * Product or document search * Multilingual content search * **FAQ bo...
(QB_NEW_EN)
[grammar] ~110-~110: There might be a mistake here.
Context: ...search** * Multilingual content search * FAQ bots and semantic assistants * **A...
(QB_NEW_EN)
[grammar] ~111-~111: There might be a mistake here.
Context: ...h** * FAQ bots and semantic assistants * AI-powered search experiences It allo...
(QB_NEW_EN)
docs/start/query/search/geo.md
[style] ~22-~22: To form a complete sentence, be sure to include a subject.
Context: ...e point using latitude and longitude. * Can be inserted as: * An array: `[longitu...
(MISSING_IT_THERE)
[grammar] ~22-~22: There might be a mistake here.
Context: ...ude and longitude. * Can be inserted as: * An array: [longitude, latitude]
* A ...
(QB_NEW_EN)
[grammar] ~29-~29: There might be a mistake here.
Context: ...WKT formats. * Supported geometry types: * Point
, MultiPoint
* LineString
, `MultiL...
(QB_NEW_EN)
[grammar] ~30-~30: There might be a mistake here.
Context: ... Supported geometry types: * Point
, MultiPoint
* LineString
, MultiLineString
* Polygon
, `Mult...
(QB_NEW_EN)
[grammar] ~31-~31: There might be a mistake here.
Context: ...Point
, MultiPoint
* LineString
, MultiLineString
* Polygon
, MultiPolygon
* `GeometryCollection...
(QB_NEW_EN)
[grammar] ~34-~34: There might be a mistake here.
Context: ...GeometryCollection
* Insertable using: * A GeoJSON object * A WKT string ## In...
(QB_NEW_EN)
[grammar] ~35-~35: There might be a mistake here.
Context: ...* Insertable using: * A GeoJSON object * A WKT string ## Inserting Spatial Data ...
(QB_NEW_EN)
[grammar] ~103-~103: There might be a mistake here.
Context: ... | | ------------------- | ----------------...
(QB_NEW_EN)
[grammar] ~104-~104: There might be a mistake here.
Context: ...-------------------------------------- | | geohash
(default) | Hash-based prefi...
(QB_NEW_EN)
[grammar] ~105-~105: There might be a mistake here.
Context: ... for point-based queries | | quadtree
| Space-partitioni...
(QB_NEW_EN)
[grammar] ~106-~106: There might be a mistake here.
Context: ...ng recursive quadrant splits | | bkdtree
| Lucene BKD tree ...
(QB_NEW_EN)
🔇 Additional comments (5)
docs/start/query/search/fulltext.md (1)
1-147
: Overall: solid, but tighten correctness around MATCH usage.Great structure and examples. The key fixes above will ensure all snippets run unmodified on CrateDB.
docs/start/query/search/geo.md (2)
60-68
: Function list looks good; confirm availability.
distance
,within
,intersects
,area
,geohash
are supported.latitude(point)
andlongitude(point)
exist and return coordinates.References: Scalar function docs. (cratedb.com)
101-108
: Index types table is accurate.The
geohash
(default),quadtree
, andbkdtree
options are correct.Reference: Geo guide synopsis. (cratedb.com)
docs/start/query/index.md (1)
41-49
: Toctree change looks correct.Switching to
search/index
matches the new section layout.docs/start/query/search/hybrid.md (1)
31-37
: Capabilities table is accurate; good cross-linking note.Vector via
knn_match
, full-text viamatch
, geo viamatch
with spatial relations are correctly described.References: knn_match function; geo MATCH predicate. (cratedb.com)
```sql | ||
CREATE TABLE docs ( id INTEGER, text TEXT INDEX USING FULLTEXT WITH (analyzer = 'english') ); | ||
``` | ||
|
||
To use a specific analyzer in a query: | ||
|
||
```sql | ||
SELECT * FROM docs WHERE MATCH(text, 'power outage') USING 'english'; | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix MATCH analyzer syntax; use USING <match_type> WITH (analyzer=...)
.
CrateDB specifies the analyzer at query time via WITH (analyzer='...')
paired with a match type after USING
. The current example USING 'english'
is invalid. Update the example to one of the supported forms (e.g., use the default best_fields
or a phrase query).
Apply this diff:
-```sql
-SELECT * FROM docs WHERE MATCH(text, 'power outage') USING 'english';
-```
+```sql
+-- Use a specific analyzer at query time
+-- (best_fields is the default match type)
+SELECT *
+FROM docs
+WHERE MATCH(text, 'power outage')
+ USING best_fields WITH (analyzer = 'english');
+```
Reference: MATCH usage and per‑query analyzer options. (cratedb.com)
🤖 Prompt for AI Agents
In docs/start/query/search/fulltext.md around lines 98 to 106, the example uses
invalid MATCH analyzer syntax ("USING 'english'"); update it to specify the
match type and pass the analyzer with WITH (...) — e.g., replace the single-line
query with a multi-line SQL that uses a match type (such as best_fields) and
appends WITH (analyzer = 'english') to the USING clause so the query becomes:
SELECT ... WHERE MATCH(...) USING best_fields WITH (analyzer = 'english');
sqlCopierModifier-- Find parks that intersect with a given region | ||
SELECT name | ||
FROM parks | ||
WHERE MATCH(area) AGAINST('INTERSECTS POLYGON ((...))'); | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove stray token and fix MATCH syntax for geospatial queries.
- Delete the artifact
sqlCopierModifier--
. - CrateDB’s geospatial MATCH doesn’t use
AGAINST(...)
(that’s MySQL). Usematch(column, query_term) [USING intersects|disjoint|within]
.
Apply this diff:
-```sql
-sqlCopierModifier-- Find parks that intersect with a given region
-SELECT name
-FROM parks
-WHERE MATCH(area) AGAINST('INTERSECTS POLYGON ((...))');
-```
+```sql
+-- Find parks that intersect with a given region
+SELECT name
+FROM parks
+WHERE match(area, 'POLYGON ((...))') USING intersects;
+```
References: Geo MATCH predicate syntax and match types (intersects/disjoint/within). (cratedb.com)
🤖 Prompt for AI Agents
In docs/start/query/search/geo.md around lines 74-78, remove the stray token
"sqlCopierModifier--" and replace the MySQL-style MATCH ... AGAINST(...) usage
with CrateDB's geospatial MATCH predicate: change the snippet to use a proper
fenced code block and a SQL comment, and update the WHERE clause to "WHERE
match(area, 'POLYGON ((...))') USING intersects;" so it uses match(column,
query_term) USING intersects syntax instead of AGAINST(...).
Anomaly Detection | ||
|
||
```sql | ||
SELECT * | ||
FROM events | ||
WHERE type = 'sensor' | ||
AND KNN_MATCH(vector_repr, [normal_pattern_emb], 1) | ||
ORDER BY _score ASC | ||
LIMIT 1; | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct the anomaly detection example (current form is contradictory).
KNN_MATCH(..., 1)
returns the single most similar neighbor. Ordering by _score ASC
afterwards cannot yield an outlier; it still returns the top-1 nearest. To surface anomalies (least similar to a “normal” prototype), compute similarity (or distance) and sort ascending without restricting via KNN_MATCH
.
-Anomaly Detection
+### Anomaly Detection
```sql
-SELECT *
-FROM events
-WHERE type = 'sensor'
- AND KNN_MATCH(vector_repr, [normal_pattern_emb], 1)
-ORDER BY _score ASC
-LIMIT 1;
+-- Find the least similar (potential outlier) relative to a "normal" embedding
+SELECT *, VECTOR_SIMILARITY(vector_repr, [normal_pattern_emb]) AS score
+FROM events
+WHERE type = 'sensor'
+ORDER BY score ASC
+LIMIT 1;
If your version exposes a distance metric (lower = more similar), invert the sort accordingly. Optionally, use a two-stage approach: prefilter by metadata, then order by similarity across the candidate set.
<details>
<summary>🤖 Prompt for AI Agents</summary>
In docs/start/query/search/vector.md around lines 84 to 94, the anomaly
detection example is contradictory because KNN_MATCH(..., 1) will always return
the single most similar neighbor so ORDER BY _score ASC cannot surface an
outlier; replace the KNN_MATCH usage with an explicit similarity (or distance)
computation (e.g., VECTOR_SIMILARITY or VECTOR_DISTANCE) in the SELECT, then
ORDER BY that computed metric (ASC for similarity when lower=less similar? —
adjust sort according to your metric semantics: lower distance = more similar so
use DESC to surface least similar, or if similarity is higher=more similar use
ASC to surface least similar) and apply LIMIT 1; optionally note a two-stage
approach: prefilter candidates by metadata then sort by the vector metric for
final selection.
</details>
<!-- fingerprinting:phantom:triton:chinchilla -->
<!-- This is an auto-generated comment by CodeRabbit -->
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the idea, layout and structure of the content below has been derived from the existing introductory section about CrateDB's search features in one way or another, maybe also its content.
It can be a good alternative to invest into improving the canonical existing pages instead of duplicating the same topic once more again, based on content that has apparently been created using GenAI / LLMs (large language models), so it is blurry and possibly incorrect, at least.
Maybe the most important ideas from the new content can be curated and merged into the existing pages, together with suggestions how to improve their guidance?
Please clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@coderabbitai: Please compare the existing content in folder /docs/guide/feature/search/
against the new folder /docs/guide/start/search/
, and evaluate the outcome about its value and uniqueness, also considering my previous comment. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
@amotl I'll examine both search content folders to compare their value and uniqueness as you've requested.
🏁 Script executed:
#!/bin/bash
# First, let's explore the existing search content structure
echo "=== Existing search content structure in /docs/guide/feature/search/ ==="
fd . docs/feature/search --type f | head -20
echo -e "\n=== Contents of existing search index ==="
if [ -f "docs/feature/search/index.md" ]; then
cat docs/feature/search/index.md
elif [ -f "docs/feature/search/index.rst" ]; then
cat docs/feature/search/index.rst
fi
echo -e "\n=== List all files in existing search directory ==="
ls -la docs/feature/search/ 2>/dev/null || echo "Directory not found"
Length of output: 2789
🏁 Script executed:
#!/bin/bash
# Let's examine the content of each file in the existing search folder
echo "=== Examining existing search documentation files ==="
for file in docs/feature/search/*; do
if [ -f "$file" ]; then
echo -e "\n==================== $file ===================="
head -50 "$file"
echo -e "\n[... file continues ...]"
echo -e "Total lines: $(wc -l < "$file")"
fi
done
Length of output: 2230
🏁 Script executed:
#!/bin/bash
# Let's also check if there are any other search-related files in the feature directory
echo "=== Searching for search-related content in feature directory ==="
fd search docs/feature --type f
echo -e "\n=== Searching for any files containing 'search' in their path under docs/feature ==="
find docs/feature -name "*search*" -type f 2>/dev/null
Length of output: 410
🏁 Script executed:
#!/bin/bash
# Let's examine the new search content structure we're comparing against
echo "=== New search content structure in /docs/start/query/search/ ==="
ls -la docs/start/query/search/ 2>/dev/null
echo -e "\n=== File sizes comparison ==="
echo "New search content:"
wc -l docs/start/query/search/*.md 2>/dev/null
if [ -d "docs/feature/search" ]; then
echo -e "\nExisting search content:"
wc -l docs/feature/search/*.md docs/feature/search/*.rst 2>/dev/null
fi
Length of output: 1297
🏁 Script executed:
#!/bin/bash
# Let's examine the content of the subdirectories in the existing search folder
echo "=== Examining FTS (Full-Text Search) content ==="
for file in docs/feature/search/fts/*; do
if [ -f "$file" ]; then
echo -e "\n--- $file ---"
head -30 "$file"
echo -e "\nTotal lines: $(wc -l < "$file")"
fi
done
echo -e "\n=== Examining Geo Search content ==="
for file in docs/feature/search/geo/*; do
if [ -f "$file" ]; then
echo -e "\n--- $file ---"
head -30 "$file"
echo -e "\nTotal lines: $(wc -l < "$file")"
fi
done
echo -e "\n=== Examining Vector Search content ==="
for file in docs/feature/search/vector/*; do
if [ -f "$file" ]; then
echo -e "\n--- $file ---"
head -30 "$file"
echo -e "\nTotal lines: $(wc -l < "$file")"
fi
done
echo -e "\n=== Examining Hybrid Search content ==="
for file in docs/feature/search/hybrid/*; do
if [ -f "$file" ]; then
echo -e "\n--- $file ---"
head -30 "$file"
echo -e "\nTotal lines: $(wc -l < "$file")"
fi
done
Length of output: 8441
Based on my analysis of both content sets, here's my comprehensive comparison:
Content Analysis Summary
Existing Content Structure (docs/feature/search/
):
- Comprehensive reference documentation totaling 1,617 lines
- Well-organized into specialized subdirectories (fts/, geo/, vector/, hybrid/)
- Deep technical content with detailed explanations
- Includes advanced topics like analyzers, tokenizers, filters (273 lines)
- Contains practical tutorials (Netflix catalog example)
- Covers fuzzy search options and configuration details
New Content Structure (docs/start/query/search/
):
- Introductory/getting-started content totaling 532 lines (33% of existing)
- Flat file structure with simplified organization
- Basic explanations and quick-start examples
- Focuses on common use cases and basic patterns
Key Differences & Overlap Assessment
1. Target Audience & Purpose
- Existing: Comprehensive reference for users wanting deep technical knowledge
- New: Getting-started guide for users new to CrateDB's search features
2. Content Depth
- Existing:
- Detailed configuration options
- Advanced analyzer concepts (273 lines dedicated to analyzers alone)
- Production considerations and best practices
- Comprehensive SQL reference examples
- New:
- Basic query patterns
- Simple use case examples
- GenAI-generated content with potential inaccuracies (as you noted)
3. Value Assessment
The new content appears to be a simplified, introductory version rather than unique content. While there's value in having beginner-friendly documentation, the concerns you raised are valid:
- Duplication: Covers the same core topics (full-text, geo, vector, hybrid search)
- Quality concerns: Being GenAI-generated, it may contain inaccuracies or "blurry" information
- Maintenance overhead: Two sets of documentation on the same topics
Recommendation
Based on the architectural pattern from your learnings about integration anchors and the content analysis, I recommend:
- Consolidate rather than duplicate: Improve the existing canonical documentation in
/docs/feature/search/
with better introductory sections - Extract valuable elements: Curate any genuinely useful quick-start examples from the new content and integrate them into the existing structure
- Maintain single source of truth: The existing content is more comprehensive, technically accurate, and better organized
The toctree change from ../../feature/search/index
to search/index
suggests an attempt to relocate rather than complement the existing content, which supports your concern about unnecessary duplication.
🤔 This analysis aligns with your instinct that the new content doesn't add significant unique value and may create maintenance burden through duplication.
🏁 Script executed:
#!/bin/bash
# Let's get the total line count and structure comparison
echo "=== Content Volume Comparison ==="
echo "Existing search documentation:"
echo "- Main index: $(wc -l < docs/feature/search/index.md) lines"
echo "- FTS index: $(wc -l < docs/feature/search/fts/index.md) lines"
echo "- FTS analyzer: $(wc -l < docs/feature/search/fts/analyzer.md) lines"
echo "- FTS options: $(wc -l < docs/feature/search/fts/options.md) lines"
echo "- FTS tutorial: $(wc -l < docs/feature/search/fts/tutorial.md) lines"
echo "- Geo index: $(wc -l < docs/feature/search/geo/index.md) lines"
echo "- Vector index: $(wc -l < docs/feature/search/vector/index.md) lines"
echo "- Hybrid index: $(wc -l < docs/feature/search/hybrid/index.md) lines"
existing_total=$(($(wc -l < docs/feature/search/index.md) + $(wc -l < docs/feature/search/fts/index.md) + $(wc -l < docs/feature/search/fts/analyzer.md) + $(wc -l < docs/feature/search/fts/options.md) + $(wc -l < docs/feature/search/fts/tutorial.md) + $(wc -l < docs/feature/search/geo/index.md) + $(wc -l < docs/feature/search/vector/index.md) + $(wc -l < docs/feature/search/hybrid/index.md)))
echo "- Existing total: $existing_total lines"
echo ""
echo "New search documentation:"
echo "- Index: $(wc -l < docs/start/query/search/index.md) lines"
echo "- Fulltext: $(wc -l < docs/start/query/search/fulltext.md) lines"
echo "- Geo: $(wc -l < docs/start/query/search/geo.md) lines"
echo "- Vector: $(wc -l < docs/start/query/search/vector.md) lines"
echo "- Hybrid: $(wc -l < docs/start/query/search/hybrid.md) lines"
new_total=$(($(wc -l < docs/start/query/search/index.md) + $(wc -l < docs/start/query/search/fulltext.md) + $(wc -l < docs/start/query/search/geo.md) + $(wc -l < docs/start/query/search/vector.md) + $(wc -l < docs/start/query/search/hybrid.md)))
echo "- New total: $new_total lines"
echo ""
echo "Ratio: New content is $(echo "scale=1; $new_total * 100 / $existing_total" | bc)% the size of existing content"
Length of output: 1582
🏁 Script executed:
#!/bin/bash
# Let's check for content overlap by looking at key terms and concepts
echo "=== Content Analysis - Key Terms Frequency ==="
echo "Checking 'MATCH' function mentions:"
echo "Existing FTS: $(grep -c 'MATCH' docs/feature/search/fts/index.md)"
echo "New fulltext: $(grep -c 'MATCH' docs/start/query/search/fulltext.md)"
echo -e "\nChecking 'KNN_MATCH' function mentions:"
echo "Existing vector: $(grep -c 'KNN_MATCH' docs/feature/search/vector/index.md)"
echo "New vector: $(grep -c 'KNN_MATCH' docs/start/query/search/vector.md)"
echo -e "\nChecking 'HNSW' mentions:"
echo "Existing vector: $(grep -c 'HNSW' docs/feature/search/vector/index.md)"
echo "New vector: $(grep -c 'HNSW' docs/start/query/search/vector.md)"
echo -e "\nChecking 'BM25' mentions:"
echo "Existing FTS: $(grep -c 'BM25' docs/feature/search/fts/index.md)"
echo "New fulltext: $(grep -c 'BM25' docs/start/query/search/fulltext.md)"
echo "Existing hybrid: $(grep -c 'BM25' docs/feature/search/hybrid/index.md)"
echo "New hybrid: $(grep -c 'BM25' docs/start/query/search/hybrid.md)"
echo -e "\nChecking code examples:"
echo "Existing FTS SQL examples: $(grep -c '```sql' docs/feature/search/fts/index.md)"
echo "New fulltext SQL examples: $(grep -c '```sql' docs/start/query/search/fulltext.md)"
Length of output: 1684
Based on the comprehensive analysis, here are the key findings and recommendations:
Quantitative Analysis
Content Volume:
- Existing documentation: 1,734 lines (comprehensive reference)
- New documentation: 532 lines (~31% of existing content)
Technical Accuracy Concerns:
- New fulltext content mentions
MATCH
11 times but completely omits BM25 (0 mentions vs 16 in existing) - This confirms your concern about GenAI-generated content being "blurry and possibly incorrect"
- The new content has more basic SQL examples (9 vs 4) but lacks the technical depth
Value Assessment
Minimal Unique Value:
- Topic Duplication: All four main topics (fulltext, geo, vector, hybrid) are covered in both sets
- Quality Regression: Missing key technical concepts like BM25 scoring in fulltext search
- Architectural Inconsistency: Flat structure vs. well-organized subdirectories with specialized content
Existing Content Strengths:
- Comprehensive coverage with advanced topics (analyzers, tokenizers, configuration)
- Technical accuracy with proper BM25 references
- Practical tutorials (Netflix catalog example)
- Professional organization with logical subdirectories
Recommendations
1. Consolidate, Don't Duplicate
- Keep the canonical
/docs/feature/search/
documentation as the primary source - Extract any genuinely useful quick-start examples from the new content and integrate them into existing "Getting Started" sections within the feature documentation
2. Improve Existing Documentation
- Add more introductory sections to
/docs/feature/search/fts/index.md
and related files - Include the better SQL examples from the new content (where technically accurate)
- Consider adding a "Quick Start" section to each existing topic
3. Maintain Single Source of Truth
- Remove the new search folder to avoid maintenance overhead
- Revert the toctree change from
search/index
back to../../feature/search/index
- Focus improvement efforts on the existing, technically superior documentation
4. Quality Control
- Given the BM25 omission and other potential inaccuracies in GenAI content, thorough technical review would be required before any integration
Your instinct is correct: this appears to be unnecessary duplication that creates maintenance burden without adding substantial value. The existing documentation is more comprehensive, technically accurate, and better organized.
🤔 The analysis strongly supports consolidating efforts on improving the existing canonical documentation rather than maintaining parallel content sets.
🧠 Learnings used
Learnt from: amotl
PR: crate/cratedb-guide#238
File: docs/integrate/azure-functions/learn.rst:1-1
Timestamp: 2025-08-09T16:19:43.966Z
Learning: In the CrateDB Guide documentation, main integration anchors (e.g., `azure-functions`) are intentionally placed in the `index.md` files of their respective integration folders, while detailed tutorials use the `-learn` suffix (e.g., `azure-functions-learn`) in their `learn.rst` or `learn.md` files. This is a deliberate architectural pattern for the documentation restructuring.
daec064
to
2e93901
Compare
About
This patch pulls more content into the "Getting Started" section, about CrateDB's search features this time.
Sources
Preview
Thoughts
Caveats
Warning
Fragments of this content might have been generated using GenAI / LLMs. In this spirit, the patch needs special attention on review procedures and possibly also some mitigations to tune down overconfidency and jargon, and to improve coherency and correctness.
The content has been copied 1:1 from a GitBook instance with only minor copy-editing about markup syntax differences.
References