Skip to content

Conversation

amotl
Copy link
Member

@amotl amotl commented Aug 23, 2025

About

This patch pulls more content into the "Getting Started" section, about CrateDB's search features this time.

Sources

Preview

Thoughts

It looks like this section has been significantly derived from an existing section, see #264 (review).

Caveats

Warning

Fragments of this content might have been generated using GenAI / LLMs. In this spirit, the patch needs special attention on review procedures and possibly also some mitigations to tune down overconfidency and jargon, and to improve coherency and correctness.

The content has been copied 1:1 from a GitBook instance with only minor copy-editing about markup syntax differences.

References

Copy link

coderabbitai bot commented Aug 23, 2025

Warning

Rate limit exceeded

@amotl has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 12 minutes and 44 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between daec064 and 2e93901.

📒 Files selected for processing (6)
  • docs/start/query/index.md (1 hunks)
  • docs/start/query/search/fulltext.md (1 hunks)
  • docs/start/query/search/geo.md (1 hunks)
  • docs/start/query/search/hybrid.md (1 hunks)
  • docs/start/query/search/index.md (1 hunks)
  • docs/start/query/search/vector.md (1 hunks)

Walkthrough

Introduces a new Search documentation section under docs/start/query/search with pages for full-text, geospatial, vector, and hybrid search, adds a section index, and updates the toctree link in docs/start/query/index.md to point to the new Search index.

Changes

Cohort / File(s) Summary
Search docs: new section and pages
docs/start/query/search/index.md, docs/start/query/search/fulltext.md, docs/start/query/search/geo.md, docs/start/query/search/vector.md, docs/start/query/search/hybrid.md
Adds Search section index and four new pages covering full-text, geospatial, vector, and hybrid search, including SQL examples, concepts, and tips. Note: a stray token appears in a geo example code block.
Navigation update
docs/start/query/index.md
Updates toctree entry to reference search/index instead of ../../feature/search/index.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor User
    participant SQL as SQL Engine
    participant VEC as Vector Index (HNSW)
    participant TXT as Full-Text Index (BM25)
    participant RES as Result Merger

    User->>SQL: Submit hybrid search (CTEs: KNN_MATCH + MATCH)
    SQL->>VEC: Run kNN on embeddings
    SQL->>TXT: Run BM25 keyword search
    VEC-->>SQL: Top-K vector results with _score
    TXT-->>SQL: Text results with _score
    SQL->>RES: Join results on id, compute hybrid_score
    RES-->>User: Ranked rows by hybrid_score
    note over RES: Fusion: weighted sum or RRF
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

guidance

Suggested reviewers

  • hammerhead
  • karynzv
  • surister
  • bmunkholm

Poem

A rabbit taps keys with a whiskered grin,
New search trails mapped—let queries begin!
Vectors and words hop side by side,
Geo burrows guide where cities hide.
Docs bloom like clover, crisp and bright—
Hybrid scores soar, what a delight!
Thump-thump: shipped just right.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch getting-started-search

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/start/query/index.md (1)

27-31: Broken ref: grid card uses undefined search-overview anchor.

The new index file defines (start-search)=, not search-overview. Update the card link to point to the new label.

Apply this diff:

-:::{grid-item-card} Search
-:link: search-overview
+:::{grid-item-card} Search
+:link: start-search
 :link-type: ref
 Based on Apache Lucene, CrateDB offers native BM25 term search and vector search, all using SQL. By combining it, also using SQL, you can implement powerful single-query hybrid search.
 :::

This aligns with the new docs/start/query/search/index.md. (cratedb.com)

🧹 Nitpick comments (20)
docs/start/query/search/fulltext.md (3)

94-101: Clarify that MATCH requires a FULLTEXT index and show nested-field indexing.

Examples that call MATCH(payload['comment'], ...) will only work if that field is indexed using FULLTEXT. Consider adding a quick index DDL before the example or switch the predicate to target an index identifier.

Apply this augmentation right before the “Search Nested JSON” example:

+Before querying a nested field with MATCH, ensure it is FULLTEXT‑indexed:
+
+```sql
+CREATE TABLE feedback (
+  id INTEGER,
+  payload OBJECT(DYNAMIC),
+  INDEX comment_ft USING FULLTEXT (payload['comment'])
+);
+```

Then update the query to target the index:

-WHERE MATCH(payload['comment'], 'battery life');
+WHERE MATCH(comment_ft, 'battery life');

References: Full-text MATCH must target fulltext-indexed columns; examples of index identifiers and per‑query options. (cratedb.com)

Also applies to: 102-107


99-100: DDL style nit: prefer explicit index clause for clarity.

The inline TEXT INDEX USING FULLTEXT WITH (analyzer='english') is fine, but most CrateDB docs demonstrate named FULLTEXT indexes for discoverability and multi-column patterns. Consider:

-CREATE TABLE docs ( id INTEGER, text TEXT INDEX USING FULLTEXT WITH (analyzer = 'english') ); 
+CREATE TABLE docs (
+  id   INTEGER,
+  text TEXT,
+  INDEX text_ft USING FULLTEXT (text) WITH (analyzer = 'english')
+);

This also pairs nicely with the updated MATCH examples targeting text_ft. (cratedb.com)


140-144: Add working cross-references/links for “Learn More”.

The bullets are placeholders. Convert them to Sphinx/MyST refs pointing to CrateDB reference pages (MATCH, analyzers, fulltext indices) and/or guide pages so readers can click through.

Example (adjust labels to your docs build):

-* Full-text Search Data Model
-* MATCH Clause Documentation
-* How CrateDB Differs from Elasticsearch
-* Tutorial: Full-text Search on Logs
+* {ref}`crate-reference:fulltext` (MATCH predicate)
+* {ref}`crate-reference:create-analyzer` (Custom analyzers)
+* {ref}`crate-guide:feature/search/fts/analyzer` (Analyzer guide)
+* {ref}`crate-guide:feature/search/fts/index` (Full-text search tutorials)

Refs: MATCH predicate; analyzer docs. (cratedb.com)

docs/start/query/search/geo.md (2)

111-114: Call out that exact functions bypass indexes.

You mention cost of exact computations; explicitly note that within(...), intersects(...), and distance(...) do not use the geo index and can be slow on large result sets. Encourage combining them with prefilters or using MATCH first.

Suggested addition after the paragraph:

+Note: `within(...)`, `intersects(...)`, and `distance(...)` are exact and
+operate on the stored shapes without using the geo index; apply on narrowed
+result sets or prefer `MATCH` for broad filtering.

Reference: Geo exact queries guidance. (cratedb.com)


119-124: Index type quoting/style nit.

CrateDB examples typically use double quotes for index type literals (e.g., "quadtree") or omit quotes. Align with reference style for consistency across docs.

-  area GEO_SHAPE INDEX USING 'quadtree'
+  area GEO_SHAPE INDEX USING "quadtree"

Reference style example. (cratedb.com)

docs/start/query/search/index.md (1)

4-11: Nice, minimal toctree; consider adding short intro text.

Optional: add one sentence below the H1 to orient readers (what “Search” covers: full‑text, geo, vector, hybrid).

docs/start/query/search/hybrid.md (2)

43-71: Make the SQL runnable; avoid ellipses and consider broader join.

  • Replace [0.2, 0.1, ..., 0.3] with a concrete vector; ellipses will break copy‑paste.
  • Optional: many apps want items that match only one modality. Consider a FULL OUTER JOIN with COALESCE and default scores for missing sides.

Apply this diff to the vector literal:

-        WHERE KNN_MATCH(embedding, [0.2, 0.1, ..., 0.3], 10)
+        WHERE KNN_MATCH(embedding, [0.2, 0.1, 0.7, 0.3], 10)

Alternative join pattern (illustrative):

WITH
  vector_results AS (
    SELECT id, _score AS vector_score
    FROM documents
    WHERE knn_match(embedding, [0.2, 0.1, 0.7, 0.3], 50)
  ),
  bm25_results AS (
    SELECT id, _score AS bm25_score
    FROM documents
    WHERE match(content, 'knn search')
  )
SELECT
  COALESCE(b.id, v.id) AS id,
  COALESCE(bm25_score, 0.0) AS bm25_score,
  COALESCE(vector_score, 0.0) AS vector_score,
  0.5 * COALESCE(bm25_score, 0.0) + 0.5 * COALESCE(vector_score, 0.0) AS hybrid_score
FROM bm25_results b
FULL OUTER JOIN vector_results v ON v.id = b.id
ORDER BY hybrid_score DESC
LIMIT 10;

References: knn_match usage and _score; fulltext MATCH in WHERE. (cratedb.com)


73-93: RRF section: optionally include the formula for clarity.

If space permits, add a one-liner: RRF(d) = Σ_i 1 / (k + rank_i(d)), with a typical k like 60. Helps readers reproduce the numbers.

Happy to add a runnable SQL example computing RRF from two rank lists.

docs/start/query/search/vector.md (12)

13-19: Fix table formatting and temper “immediately searchable” claim.

  • Add a header row so the Markdown table renders reliably.
  • “Immediately searchable” is misleading for near-real-time systems. Suggest calling out the default refresh interval instead.
  • Escaping underscores in plain table text is unnecessary.
- | FLOAT\_VECTOR       | Store embeddings up to 2048 dimensions                       |
- | ------------------- | ------------------------------------------------------------ |
- | KNN\_MATCH          | SQL-native k-nearest neighbor function with `_score` support |
- | VECTOR\_SIMILARITY  | Compute similarity scores between vectors in queries         |
- | Real-time indexing  | Fresh vectors are immediately searchable                     |
- | Hybrid queries      | Combine vector search with filters, full-text, and JSON      |
+| Feature                | Description                                                  |
+|------------------------|--------------------------------------------------------------|
+| FLOAT_VECTOR           | Store embeddings up to 2048 dimensions                       |
+| KNN_MATCH              | SQL-native k-nearest neighbor function with `_score` support |
+| VECTOR_SIMILARITY      | Compute similarity scores between vectors in queries         |
+| Near real-time indexing| Fresh vectors become searchable after a short refresh (≈1s)  |
+| Hybrid queries         | Combine vector search with filters, full-text, and JSON      |

Note: Please verify the dimension limit (“up to 2048”) against the current CrateDB version you target. If that limit varies by version, consider adding a short “Compatibility” note.


22-31: Add a minimal DDL so readers know the expected schema and vector length.

KNN examples are clearer when the column type and vector dimensionality are explicit.

 ### K-Nearest Neighbors (KNN) Search

+```sql
+-- Example schema (4-dimensional vectors)
+CREATE TABLE word_embeddings (
+  id INT,
+  text TEXT,
+  embedding FLOAT_VECTOR(4)
+);
+```
+
 ```sql
 SELECT text, _score
 FROM word_embeddings
 WHERE KNN_MATCH(embedding, [0.3, 0.6, 0.0, 0.9], 3)
 ORDER BY _score DESC;

If you prefer not to add the DDL here, add a one-liner note stating “embedding is FLOAT_VECTOR(4)”. Also, if “2048” above is not guaranteed, avoid mixing dimensions across samples.

---

`35-41`: **Keep vector dimensionality consistent across examples.**

This example switches to a 3-D vector. Either declare `features FLOAT_VECTOR(3)` or keep all examples 4-D for continuity.



```diff
 WHERE category = 'shoes'
-  AND KNN_MATCH(features, [0.2, 0.1, 0.3], 5)
+  AND KNN_MATCH(features, [0.2, 0.1, 0.3, 0.4], 5)
 ORDER BY _score DESC;

45-50: Clarify placeholder usage and avoid redundant sorting signals.

  • Define what [q_vector] stands for (e.g., a 4-D array bound as a parameter).
  • Since you compute score with VECTOR_SIMILARITY, order by that to make intent explicit.
-SELECT id, VECTOR_SIMILARITY(emb, [q_vector]) AS score
+-- q_vector is a 4-D array matching emb's FLOAT_VECTOR(4)
+SELECT id, VECTOR_SIMILARITY(emb, [q_vector]) AS score
 FROM items
-WHERE KNN_MATCH(emb, [q_vector], 10)
-ORDER BY score DESC;
+WHERE KNN_MATCH(emb, [q_vector], 10)
+ORDER BY score DESC;

Optionally add: “Higher scores indicate greater similarity” (assuming cosine or dot-product semantics in your target version).


58-63: Cap examples with LIMIT for reproducibility.

Most prior examples use small k; adding LIMIT mirrors real usage and avoids long result sets in docs output.

 SELECT id, title
 FROM documents
 WHERE KNN_MATCH(embedding, [query_emb], 5)
 ORDER BY _score DESC;
+-- LIMIT 5;  -- optional; ORDER BY with KNN k=5 usually yields ≤ 5 rows

66-73: Minor: Keep dimensions and naming aligned with earlier samples.

If you settle on 4-D throughout, update [user_emb] to a 4-element vector for consistency, or add a note that feature_vec is FLOAT_VECTOR(4).

-  AND KNN_MATCH(feature_vec, [user_emb], 4)
+  AND KNN_MATCH(feature_vec, [user_emb], 4)
 -- where user_emb is a 4-D vector matching feature_vec

75-83: Consistency: add LIMIT and/or clarify vector length in chat example.

Optional but keeps examples uniform and avoids confusion.

 WHERE KNN_MATCH(vec, [query_emb], 3)
 ORDER BY _score DESC;
+-- LIMIT 3;

95-104: Make “HNSW index” guidance actionable and name concrete tuning knobs.

The tips are good but abstract. Add a small DDL showing how to create an HNSW index and mention tuning parameters (e.g., ef_construction, m, and query-time ef_search/num_candidates), plus when/where they’re set.

 ## Performance & Indexing Tips
@@
-| Create HNSW index when supported   | Enables fast ANN queries via Lucene                     |
+| Create HNSW index for vectors      | Enables fast ANN queries via Lucene HNSW                |
@@
-| Tune `KNN_MATCH`                   | Adjust neighbor count per shard or globally             |
+| Tune ANN parameters                | Adjust k in `KNN_MATCH` and query-time knobs (e.g., ef) |

+### Example: Create an HNSW index
+```sql
+-- Verify syntax/params against your target CrateDB version
+CREATE INDEX idx_items_emb_hnsw
+ON items (emb)
+USING hnsw
+WITH (m = 16, ef_construction = 128);
+```
+
+### Example: Tune query-time parameters
+```sql
+-- Pseudocode; replace with the correct setting mechanism for your version
+SET SESSION search_ann_ef = 100;
+SELECT id, _score
+FROM items
+WHERE KNN_MATCH(emb, [qvec], 10)
+ORDER BY _score DESC;
+```

Please double-check the exact parameter names and how they’re set in the current release before merging.


105-114: Add minimal version support note.

State the minimum CrateDB version that ships FLOAT_VECTOR/KNN_MATCH so users know whether they can follow along.

 ## When to Use CrateDB for Vector Search
+
+> Note: Vector search features (FLOAT_VECTOR, KNN_MATCH, VECTOR_SIMILARITY) require CrateDB ≥ X.Y. Confirm version compatibility before use.

115-124: Cross-link “Hybrid search” to the sibling page in this PR.

Make it easy to jump to the new Hybrid guide.

-| Hybrid search      | Combine ANN search with full-text, geo, JSON    |
+| Hybrid search      | Combine ANN search with full-text, geo, JSON (see [Hybrid search](../hybrid.md)) |

125-131: Add direct links for function references.

You mention a “KNN_MATCH & VECTOR_SIMILARITY reference” but there’s no URL. Link to the canonical SQL reference pages.

 * [Vector Search Guide](https://cratedb.com/docs/guide/feature/search/vector/index.html) 
-* `KNN_MATCH` & `VECTOR_SIMILARITY` reference
+* `KNN_MATCH` & `VECTOR_SIMILARITY` reference: add links to the official SQL docs
 * [Intro Blog: Vector support & KNN search in CrateDB](https://cratedb.com/blog/unlocking-the-power-of-vector-support-and-knn-search-in-cratedb)
 * [LangChain & Vector Store integration](https://cratedb.com/docs/guide/domain/ml/index.html)

If you want, I can locate and insert the exact doc URLs.


3-10: Minor: add a quick “How it works” sentence.

One sentence on how _score is produced (e.g., cosine similarity) helps readers reason about ordering, thresholds, and anomaly logic.

 CrateDB supports **native vector search**, enabling you to perform **similarity-based retrieval** directly in SQL, without needing a separate vector database or search engine.
 
@@
-Vector search retrieves the most semantically similar items to a query vector using **Approximate Nearest Neighbor (ANN)** algorithms (e.g., HNSW via Lucene). CrateDB provides unified SQL support for this via `KNN_MATCH`.
+Vector search retrieves the most semantically similar items to a query vector using **Approximate Nearest Neighbor (ANN)** algorithms (e.g., HNSW via Lucene). CrateDB exposes this via `KNN_MATCH`, which computes an internal `_score` (higher = more similar) usable in `ORDER BY`.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 055ae17 and daec064.

📒 Files selected for processing (6)
  • docs/start/query/index.md (1 hunks)
  • docs/start/query/search/fulltext.md (1 hunks)
  • docs/start/query/search/geo.md (1 hunks)
  • docs/start/query/search/hybrid.md (1 hunks)
  • docs/start/query/search/index.md (1 hunks)
  • docs/start/query/search/vector.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/start/query/search/fulltext.md

[grammar] ~15-~15: There might be a mistake here.
Context: ... | | --------------------- | --------------...

(QB_NEW_EN)


[grammar] ~16-~16: There might be a mistake here.
Context: ...-------------------------------------- | | Full-text indexing | Tokenized, lan...

(QB_NEW_EN)


[grammar] ~17-~17: There might be a mistake here.
Context: ...language-aware search on any text | | SQL + search | Combine struct...

(QB_NEW_EN)


[grammar] ~18-~18: There might be a mistake here.
Context: ...uctured filters with keyword queries | | JSON support | Search within ...

(QB_NEW_EN)


[grammar] ~19-~19: There might be a mistake here.
Context: ...in nested object fields | | Real-time ingestion | Search new dat...

(QB_NEW_EN)


[grammar] ~20-~20: There might be a mistake here.
Context: ...data immediately—no sync delay | | Scalable architecture | Built to handl...

(QB_NEW_EN)


[grammar] ~110-~110: There might be a mistake here.
Context: ... It Helps | | -------------------------------- | ---...

(QB_NEW_EN)


[grammar] ~111-~111: There might be a mistake here.
Context: ...-------------------------------------- | | Use TEXT with FULLTEXT index | Ena...

(QB_NEW_EN)


[grammar] ~112-~112: There might be a mistake here.
Context: ...bles tokenized search | | Index only needed fields | Red...

(QB_NEW_EN)


[grammar] ~113-~113: There might be a mistake here.
Context: ...uce indexing overhead | | Pick appropriate analyzer | Mat...

(QB_NEW_EN)


[grammar] ~114-~114: There might be a mistake here.
Context: ...ch the language and context | | Use MATCH() not LIKE | Ful...

(QB_NEW_EN)


[grammar] ~115-~115: There might be a mistake here.
Context: ...l-text is more performant and relevant | | Combine with filters | Boo...

(QB_NEW_EN)


[grammar] ~130-~130: There might be a mistake here.
Context: ... | | --------------------- | --------------...

(QB_NEW_EN)


[grammar] ~131-~131: There might be a mistake here.
Context: ...-------------------------------------- | | Language analyzers | Built-in suppo...

(QB_NEW_EN)


[grammar] ~132-~132: There might be a mistake here.
Context: ...rt for many languages | | JSON object support | Index and sear...

(QB_NEW_EN)


[grammar] ~133-~133: There might be a mistake here.
Context: ...ch nested fields | | SQL + full-text | Unified querie...

(QB_NEW_EN)


[grammar] ~134-~134: There might be a mistake here.
Context: ...s for structured and unstructured data | | Distributed execution | Fast, scalable...

(QB_NEW_EN)


[grammar] ~135-~135: There might be a mistake here.
Context: ... search across nodes | | Aggregations | Group and anal...

(QB_NEW_EN)


[grammar] ~140-~140: There might be a mistake here.
Context: ...earn More * Full-text Search Data Model * MATCH Clause Documentation * How CrateDB...

(QB_NEW_EN)


[grammar] ~141-~141: There might be a mistake here.
Context: ... Data Model * MATCH Clause Documentation * How CrateDB Differs from Elasticsearch *...

(QB_NEW_EN)


[grammar] ~142-~142: There might be a mistake here.
Context: ...* How CrateDB Differs from Elasticsearch * Tutorial: Full-text Search on Logs ## S...

(QB_NEW_EN)

docs/start/query/search/vector.md

[grammar] ~13-~13: There might be a mistake here.
Context: ... 2048 dimensions | | ------------------- | ----------------...

(QB_NEW_EN)


[grammar] ~14-~14: There might be a mistake here.
Context: ...-------------------------------------- | | KNN_MATCH | SQL-native k-nea...

(QB_NEW_EN)


[grammar] ~15-~15: There might be a mistake here.
Context: ...eighbor function with _score support | | VECTOR_SIMILARITY | Compute similari...

(QB_NEW_EN)


[grammar] ~16-~16: There might be a mistake here.
Context: ...res between vectors in queries | | Real-time indexing | Fresh vectors ar...

(QB_NEW_EN)


[grammar] ~17-~17: There might be a mistake here.
Context: ...diately searchable | | Hybrid queries | Combine vector s...

(QB_NEW_EN)


[grammar] ~97-~97: There might be a mistake here.
Context: ... | | ---------------------------------- | -...

(QB_NEW_EN)


[grammar] ~98-~98: There might be a mistake here.
Context: ...-------------------------------------- | | Use FLOAT_VECTOR | E...

(QB_NEW_EN)


[grammar] ~99-~99: There might be a mistake here.
Context: ...ixed-size arrays up to 2048 dimensions | | Create HNSW index when supported | E...

(QB_NEW_EN)


[grammar] ~100-~100: There might be a mistake here.
Context: ...queries via Lucene | | Consistent vector length | A...

(QB_NEW_EN)


[grammar] ~101-~101: There might be a mistake here.
Context: ...st match column definition | | Pre-filter with structured filters | R...

(QB_NEW_EN)


[grammar] ~102-~102: There might be a mistake here.
Context: ...overhead | | Tune KNN_MATCH | A...

(QB_NEW_EN)


[grammar] ~117-~117: There might be a mistake here.
Context: ...on | | ------------------ | -----------------...

(QB_NEW_EN)


[grammar] ~118-~118: There might be a mistake here.
Context: ...-------------------------------------- | | FLOAT_VECTOR | Native support fo...

(QB_NEW_EN)


[grammar] ~119-~119: There might be a mistake here.
Context: ...pport for high-dimensional arrays | | KNN_MATCH | Core SQL predicat...

(QB_NEW_EN)


[grammar] ~120-~120: There might be a mistake here.
Context: ...predicate for vector similarity search | | VECTOR_SIMILARITY | Compute proximity...

(QB_NEW_EN)


[grammar] ~121-~121: There might be a mistake here.
Context: ...roximity scores in SQL | | Lucene HNSW ANN | Efficient graph-b...

(QB_NEW_EN)


[grammar] ~122-~122: There might be a mistake here.
Context: ... graph-based search engine | | Hybrid search | Combine ANN searc...

(QB_NEW_EN)


[grammar] ~128-~128: There might be a mistake here.
Context: ...N_MATCH&VECTOR_SIMILARITY` reference * [Intro Blog: Vector support & KNN search ...

(QB_NEW_EN)


[grammar] ~129-~129: There might be a mistake here.
Context: ...: Vector support & KNN search in CrateDB](https://cratedb.com/blog/unlocking-the-power-of-vector-support-and-knn-search-in-cratedb) * [LangChain & Vector Store integration](ht...

(QB_NEW_EN)

docs/start/query/search/hybrid.md

[grammar] ~21-~21: There might be a mistake here.
Context: ...cally: * BM25 for keyword relevance * kNN for semantic proximity in vector s...

(QB_NEW_EN)


[grammar] ~26-~26: There might be a mistake here.
Context: ...x combination** (weighted sum of scores) * Reciprocal Rank Fusion (RRF) ## Suppo...

(QB_NEW_EN)


[grammar] ~31-~31: There might be a mistake here.
Context: ...ion | | --------------------- | ------------- ...

(QB_NEW_EN)


[grammar] ~32-~32: There might be a mistake here.
Context: ...-------------------------------------- | | Vector search | KNN_MATCH() ...

(QB_NEW_EN)


[grammar] ~33-~33: There might be a mistake here.
Context: ...ctors closest to a given vector | | Full-text search | MATCH() ...

(QB_NEW_EN)


[grammar] ~34-~34: There might be a mistake here.
Context: ...ene's BM25 scoring | | Geospatial search | MATCH() ...

(QB_NEW_EN)


[grammar] ~79-~79: There might be a mistake here.
Context: ... | | ------------- | ----------- | --------...

(QB_NEW_EN)


[grammar] ~80-~80: There might be a mistake here.
Context: ...-------------------------------------- | | 0.7440 | 1.0000 | 0.5734 ...

(QB_NEW_EN)


[grammar] ~81-~81: There might be a mistake here.
Context: ...tch(float_vector, float_vector, int) | | 0.4868 | 0.5512 | 0.4439 ...

(QB_NEW_EN)


[grammar] ~82-~82: There might be a mistake here.
Context: ...ng On Multiple Columns | | 0.4716 | 0.5694 | 0.4064 ...

(QB_NEW_EN)


[grammar] ~87-~87: There might be a mistake here.
Context: ... | | ----------- | ---------- | -----------...

(QB_NEW_EN)


[grammar] ~88-~88: There might be a mistake here.
Context: ...-------------------------------------- | | 0.03278 | 1 | 1 ...

(QB_NEW_EN)


[grammar] ~89-~89: There might be a mistake here.
Context: ...tch(float_vector, float_vector, int) | | 0.03105 | 7 | 2 ...

(QB_NEW_EN)


[grammar] ~90-~90: There might be a mistake here.
Context: ...ng On Multiple Columns | | 0.03057 | 8 | 3 ...

(QB_NEW_EN)


[grammar] ~97-~97: There might be a mistake here.
Context: ... | | ------------------------- | ----------...

(QB_NEW_EN)


[grammar] ~98-~98: There might be a mistake here.
Context: ...-------------------------------------- | | 🔍 Improved relevance | Combines s...

(QB_NEW_EN)


[grammar] ~99-~99: There might be a mistake here.
Context: ...d-based matches | | ⚙️ Pure SQL | No DSLs or ...

(QB_NEW_EN)


[grammar] ~100-~100: There might be a mistake here.
Context: ...—runs directly in CrateDB | | ⚡ High performance | Built on Ap...

(QB_NEW_EN)


[grammar] ~101-~101: There might be a mistake here.
Context: ...CrateDB’s distributed SQL engine | | 🔄 Flexible ranking | Use scoring...

(QB_NEW_EN)


[grammar] ~104-~104: There might be a mistake here.
Context: ...RF, etc.) based on use case needs | ## Usage in Applications Hybrid search is pa...

(QB_NEW_EN)


[grammar] ~108-~108: There might be a mistake here.
Context: ...arly effective for: * Knowledge bases * Product or document search * **Multili...

(QB_NEW_EN)


[grammar] ~109-~109: There might be a mistake here.
Context: ...e bases** * Product or document search * Multilingual content search * **FAQ bo...

(QB_NEW_EN)


[grammar] ~110-~110: There might be a mistake here.
Context: ...search** * Multilingual content search * FAQ bots and semantic assistants * **A...

(QB_NEW_EN)


[grammar] ~111-~111: There might be a mistake here.
Context: ...h** * FAQ bots and semantic assistants * AI-powered search experiences It allo...

(QB_NEW_EN)

docs/start/query/search/geo.md

[style] ~22-~22: To form a complete sentence, be sure to include a subject.
Context: ...e point using latitude and longitude. * Can be inserted as: * An array: `[longitu...

(MISSING_IT_THERE)


[grammar] ~22-~22: There might be a mistake here.
Context: ...ude and longitude. * Can be inserted as: * An array: [longitude, latitude] * A ...

(QB_NEW_EN)


[grammar] ~29-~29: There might be a mistake here.
Context: ...WKT formats. * Supported geometry types: * Point, MultiPoint * LineString, `MultiL...

(QB_NEW_EN)


[grammar] ~30-~30: There might be a mistake here.
Context: ... Supported geometry types: * Point, MultiPoint * LineString, MultiLineString * Polygon, `Mult...

(QB_NEW_EN)


[grammar] ~31-~31: There might be a mistake here.
Context: ...Point, MultiPoint * LineString, MultiLineString * Polygon, MultiPolygon * `GeometryCollection...

(QB_NEW_EN)


[grammar] ~34-~34: There might be a mistake here.
Context: ...GeometryCollection * Insertable using: * A GeoJSON object * A WKT string ## In...

(QB_NEW_EN)


[grammar] ~35-~35: There might be a mistake here.
Context: ...* Insertable using: * A GeoJSON object * A WKT string ## Inserting Spatial Data ...

(QB_NEW_EN)


[grammar] ~103-~103: There might be a mistake here.
Context: ... | | ------------------- | ----------------...

(QB_NEW_EN)


[grammar] ~104-~104: There might be a mistake here.
Context: ...-------------------------------------- | | geohash (default) | Hash-based prefi...

(QB_NEW_EN)


[grammar] ~105-~105: There might be a mistake here.
Context: ... for point-based queries | | quadtree | Space-partitioni...

(QB_NEW_EN)


[grammar] ~106-~106: There might be a mistake here.
Context: ...ng recursive quadrant splits | | bkdtree | Lucene BKD tree ...

(QB_NEW_EN)

🔇 Additional comments (5)
docs/start/query/search/fulltext.md (1)

1-147: Overall: solid, but tighten correctness around MATCH usage.

Great structure and examples. The key fixes above will ensure all snippets run unmodified on CrateDB.

docs/start/query/search/geo.md (2)

60-68: Function list looks good; confirm availability.

distance, within, intersects, area, geohash are supported. latitude(point) and longitude(point) exist and return coordinates.

References: Scalar function docs. (cratedb.com)


101-108: Index types table is accurate.

The geohash (default), quadtree, and bkdtree options are correct.

Reference: Geo guide synopsis. (cratedb.com)

docs/start/query/index.md (1)

41-49: Toctree change looks correct.

Switching to search/index matches the new section layout.

docs/start/query/search/hybrid.md (1)

31-37: Capabilities table is accurate; good cross-linking note.

Vector via knn_match, full-text via match, geo via match with spatial relations are correctly described.

References: knn_match function; geo MATCH predicate. (cratedb.com)

Comment on lines +98 to +106
```sql
CREATE TABLE docs ( id INTEGER, text TEXT INDEX USING FULLTEXT WITH (analyzer = 'english') );
```

To use a specific analyzer in a query:

```sql
SELECT * FROM docs WHERE MATCH(text, 'power outage') USING 'english';
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fix MATCH analyzer syntax; use USING <match_type> WITH (analyzer=...).

CrateDB specifies the analyzer at query time via WITH (analyzer='...') paired with a match type after USING. The current example USING 'english' is invalid. Update the example to one of the supported forms (e.g., use the default best_fields or a phrase query).

Apply this diff:

-```sql
-SELECT * FROM docs WHERE MATCH(text, 'power outage') USING 'english';
-```
+```sql
+-- Use a specific analyzer at query time
+-- (best_fields is the default match type)
+SELECT *
+FROM docs
+WHERE MATCH(text, 'power outage')
+  USING best_fields WITH (analyzer = 'english');
+```

Reference: MATCH usage and per‑query analyzer options. (cratedb.com)

🤖 Prompt for AI Agents
In docs/start/query/search/fulltext.md around lines 98 to 106, the example uses
invalid MATCH analyzer syntax ("USING 'english'"); update it to specify the
match type and pass the analyzer with WITH (...) — e.g., replace the single-line
query with a multi-line SQL that uses a match type (such as best_fields) and
appends WITH (analyzer = 'english') to the USING clause so the query becomes:
SELECT ... WHERE MATCH(...) USING best_fields WITH (analyzer = 'english');

Comment on lines +74 to +78
sqlCopierModifier-- Find parks that intersect with a given region
SELECT name
FROM parks
WHERE MATCH(area) AGAINST('INTERSECTS POLYGON ((...))');
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove stray token and fix MATCH syntax for geospatial queries.

  • Delete the artifact sqlCopierModifier--.
  • CrateDB’s geospatial MATCH doesn’t use AGAINST(...) (that’s MySQL). Use match(column, query_term) [USING intersects|disjoint|within].

Apply this diff:

-```sql
-sqlCopierModifier-- Find parks that intersect with a given region
-SELECT name
-FROM parks
-WHERE MATCH(area) AGAINST('INTERSECTS POLYGON ((...))');
-```
+```sql
+-- Find parks that intersect with a given region
+SELECT name
+FROM parks
+WHERE match(area, 'POLYGON ((...))') USING intersects;
+```

References: Geo MATCH predicate syntax and match types (intersects/disjoint/within). (cratedb.com)

🤖 Prompt for AI Agents
In docs/start/query/search/geo.md around lines 74-78, remove the stray token
"sqlCopierModifier--" and replace the MySQL-style MATCH ... AGAINST(...) usage
with CrateDB's geospatial MATCH predicate: change the snippet to use a proper
fenced code block and a SQL comment, and update the WHERE clause to "WHERE
match(area, 'POLYGON ((...))') USING intersects;" so it uses match(column,
query_term) USING intersects syntax instead of AGAINST(...).

Comment on lines +84 to +94
Anomaly Detection

```sql
SELECT *
FROM events
WHERE type = 'sensor'
AND KNN_MATCH(vector_repr, [normal_pattern_emb], 1)
ORDER BY _score ASC
LIMIT 1;
```

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Correct the anomaly detection example (current form is contradictory).

KNN_MATCH(..., 1) returns the single most similar neighbor. Ordering by _score ASC afterwards cannot yield an outlier; it still returns the top-1 nearest. To surface anomalies (least similar to a “normal” prototype), compute similarity (or distance) and sort ascending without restricting via KNN_MATCH.

-Anomaly Detection
+### Anomaly Detection
 ```sql
-SELECT *
-FROM events
-WHERE type = 'sensor'
-  AND KNN_MATCH(vector_repr, [normal_pattern_emb], 1)
-ORDER BY _score ASC
-LIMIT 1;
+-- Find the least similar (potential outlier) relative to a "normal" embedding
+SELECT *, VECTOR_SIMILARITY(vector_repr, [normal_pattern_emb]) AS score
+FROM events
+WHERE type = 'sensor'
+ORDER BY score ASC
+LIMIT 1;

If your version exposes a distance metric (lower = more similar), invert the sort accordingly. Optionally, use a two-stage approach: prefilter by metadata, then order by similarity across the candidate set.

<details>
<summary>🤖 Prompt for AI Agents</summary>

In docs/start/query/search/vector.md around lines 84 to 94, the anomaly
detection example is contradictory because KNN_MATCH(..., 1) will always return
the single most similar neighbor so ORDER BY _score ASC cannot surface an
outlier; replace the KNN_MATCH usage with an explicit similarity (or distance)
computation (e.g., VECTOR_SIMILARITY or VECTOR_DISTANCE) in the SELECT, then
ORDER BY that computed metric (ASC for similarity when lower=less similar? —
adjust sort according to your metric semantics: lower distance = more similar so
use DESC to surface least similar, or if similarity is higher=more similar use
ASC to surface least similar) and apply LIMIT 1; optionally note a two-stage
approach: prefilter candidates by metadata then sort by the vector metric for
final selection.


</details>

<!-- fingerprinting:phantom:triton:chinchilla -->

<!-- This is an auto-generated comment by CodeRabbit -->

Copy link
Member Author

@amotl amotl Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the idea, layout and structure of the content below has been derived from the existing introductory section about CrateDB's search features in one way or another, maybe also its content.

It can be a good alternative to invest into improving the canonical existing pages instead of duplicating the same topic once more again, based on content that has apparently been created using GenAI / LLMs (large language models), so it is blurry and possibly incorrect, at least.

Maybe the most important ideas from the new content can be curated and merged into the existing pages, together with suggestions how to improve their guidance?

Please clarify.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai: Please compare the existing content in folder /docs/guide/feature/search/ against the new folder /docs/guide/start/search/, and evaluate the outcome about its value and uniqueness, also considering my previous comment. Thanks!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

@amotl I'll examine both search content folders to compare their value and uniqueness as you've requested.


🏁 Script executed:

#!/bin/bash
# First, let's explore the existing search content structure
echo "=== Existing search content structure in /docs/guide/feature/search/ ==="
fd . docs/feature/search --type f | head -20

echo -e "\n=== Contents of existing search index ==="
if [ -f "docs/feature/search/index.md" ]; then
    cat docs/feature/search/index.md
elif [ -f "docs/feature/search/index.rst" ]; then
    cat docs/feature/search/index.rst
fi

echo -e "\n=== List all files in existing search directory ==="
ls -la docs/feature/search/ 2>/dev/null || echo "Directory not found"

Length of output: 2789


🏁 Script executed:

#!/bin/bash
# Let's examine the content of each file in the existing search folder
echo "=== Examining existing search documentation files ==="

for file in docs/feature/search/*; do
    if [ -f "$file" ]; then
        echo -e "\n==================== $file ===================="
        head -50 "$file"
        echo -e "\n[... file continues ...]"
        echo -e "Total lines: $(wc -l < "$file")"
    fi
done

Length of output: 2230


🏁 Script executed:

#!/bin/bash
# Let's also check if there are any other search-related files in the feature directory
echo "=== Searching for search-related content in feature directory ==="
fd search docs/feature --type f

echo -e "\n=== Searching for any files containing 'search' in their path under docs/feature ==="
find docs/feature -name "*search*" -type f 2>/dev/null

Length of output: 410


🏁 Script executed:

#!/bin/bash
# Let's examine the new search content structure we're comparing against
echo "=== New search content structure in /docs/start/query/search/ ==="
ls -la docs/start/query/search/ 2>/dev/null

echo -e "\n=== File sizes comparison ==="
echo "New search content:"
wc -l docs/start/query/search/*.md 2>/dev/null

if [ -d "docs/feature/search" ]; then
    echo -e "\nExisting search content:"
    wc -l docs/feature/search/*.md docs/feature/search/*.rst 2>/dev/null
fi

Length of output: 1297


🏁 Script executed:

#!/bin/bash
# Let's examine the content of the subdirectories in the existing search folder
echo "=== Examining FTS (Full-Text Search) content ==="
for file in docs/feature/search/fts/*; do
    if [ -f "$file" ]; then
        echo -e "\n--- $file ---"
        head -30 "$file"
        echo -e "\nTotal lines: $(wc -l < "$file")"
    fi
done

echo -e "\n=== Examining Geo Search content ==="
for file in docs/feature/search/geo/*; do
    if [ -f "$file" ]; then
        echo -e "\n--- $file ---"
        head -30 "$file"
        echo -e "\nTotal lines: $(wc -l < "$file")"
    fi
done

echo -e "\n=== Examining Vector Search content ==="
for file in docs/feature/search/vector/*; do
    if [ -f "$file" ]; then
        echo -e "\n--- $file ---"
        head -30 "$file"
        echo -e "\nTotal lines: $(wc -l < "$file")"
    fi
done

echo -e "\n=== Examining Hybrid Search content ==="
for file in docs/feature/search/hybrid/*; do
    if [ -f "$file" ]; then
        echo -e "\n--- $file ---"
        head -30 "$file"
        echo -e "\nTotal lines: $(wc -l < "$file")"
    fi
done

Length of output: 8441


Based on my analysis of both content sets, here's my comprehensive comparison:

Content Analysis Summary

Existing Content Structure (docs/feature/search/):

  • Comprehensive reference documentation totaling 1,617 lines
  • Well-organized into specialized subdirectories (fts/, geo/, vector/, hybrid/)
  • Deep technical content with detailed explanations
  • Includes advanced topics like analyzers, tokenizers, filters (273 lines)
  • Contains practical tutorials (Netflix catalog example)
  • Covers fuzzy search options and configuration details

New Content Structure (docs/start/query/search/):

  • Introductory/getting-started content totaling 532 lines (33% of existing)
  • Flat file structure with simplified organization
  • Basic explanations and quick-start examples
  • Focuses on common use cases and basic patterns

Key Differences & Overlap Assessment

1. Target Audience & Purpose

  • Existing: Comprehensive reference for users wanting deep technical knowledge
  • New: Getting-started guide for users new to CrateDB's search features

2. Content Depth

  • Existing:
    • Detailed configuration options
    • Advanced analyzer concepts (273 lines dedicated to analyzers alone)
    • Production considerations and best practices
    • Comprehensive SQL reference examples
  • New:
    • Basic query patterns
    • Simple use case examples
    • GenAI-generated content with potential inaccuracies (as you noted)

3. Value Assessment

The new content appears to be a simplified, introductory version rather than unique content. While there's value in having beginner-friendly documentation, the concerns you raised are valid:

  • Duplication: Covers the same core topics (full-text, geo, vector, hybrid search)
  • Quality concerns: Being GenAI-generated, it may contain inaccuracies or "blurry" information
  • Maintenance overhead: Two sets of documentation on the same topics

Recommendation

Based on the architectural pattern from your learnings about integration anchors and the content analysis, I recommend:

  1. Consolidate rather than duplicate: Improve the existing canonical documentation in /docs/feature/search/ with better introductory sections
  2. Extract valuable elements: Curate any genuinely useful quick-start examples from the new content and integrate them into the existing structure
  3. Maintain single source of truth: The existing content is more comprehensive, technically accurate, and better organized

The toctree change from ../../feature/search/index to search/index suggests an attempt to relocate rather than complement the existing content, which supports your concern about unnecessary duplication.

🤔 This analysis aligns with your instinct that the new content doesn't add significant unique value and may create maintenance burden through duplication.


🏁 Script executed:

#!/bin/bash
# Let's get the total line count and structure comparison
echo "=== Content Volume Comparison ==="
echo "Existing search documentation:"
echo "- Main index: $(wc -l < docs/feature/search/index.md) lines"
echo "- FTS index: $(wc -l < docs/feature/search/fts/index.md) lines"  
echo "- FTS analyzer: $(wc -l < docs/feature/search/fts/analyzer.md) lines"
echo "- FTS options: $(wc -l < docs/feature/search/fts/options.md) lines"
echo "- FTS tutorial: $(wc -l < docs/feature/search/fts/tutorial.md) lines"
echo "- Geo index: $(wc -l < docs/feature/search/geo/index.md) lines"
echo "- Vector index: $(wc -l < docs/feature/search/vector/index.md) lines"
echo "- Hybrid index: $(wc -l < docs/feature/search/hybrid/index.md) lines"

existing_total=$(($(wc -l < docs/feature/search/index.md) + $(wc -l < docs/feature/search/fts/index.md) + $(wc -l < docs/feature/search/fts/analyzer.md) + $(wc -l < docs/feature/search/fts/options.md) + $(wc -l < docs/feature/search/fts/tutorial.md) + $(wc -l < docs/feature/search/geo/index.md) + $(wc -l < docs/feature/search/vector/index.md) + $(wc -l < docs/feature/search/hybrid/index.md)))

echo "- Existing total: $existing_total lines"
echo ""
echo "New search documentation:"
echo "- Index: $(wc -l < docs/start/query/search/index.md) lines"
echo "- Fulltext: $(wc -l < docs/start/query/search/fulltext.md) lines"
echo "- Geo: $(wc -l < docs/start/query/search/geo.md) lines" 
echo "- Vector: $(wc -l < docs/start/query/search/vector.md) lines"
echo "- Hybrid: $(wc -l < docs/start/query/search/hybrid.md) lines"

new_total=$(($(wc -l < docs/start/query/search/index.md) + $(wc -l < docs/start/query/search/fulltext.md) + $(wc -l < docs/start/query/search/geo.md) + $(wc -l < docs/start/query/search/vector.md) + $(wc -l < docs/start/query/search/hybrid.md)))

echo "- New total: $new_total lines"
echo ""
echo "Ratio: New content is $(echo "scale=1; $new_total * 100 / $existing_total" | bc)% the size of existing content"

Length of output: 1582


🏁 Script executed:

#!/bin/bash
# Let's check for content overlap by looking at key terms and concepts
echo "=== Content Analysis - Key Terms Frequency ==="

echo "Checking 'MATCH' function mentions:"
echo "Existing FTS: $(grep -c 'MATCH' docs/feature/search/fts/index.md)"
echo "New fulltext: $(grep -c 'MATCH' docs/start/query/search/fulltext.md)"

echo -e "\nChecking 'KNN_MATCH' function mentions:"
echo "Existing vector: $(grep -c 'KNN_MATCH' docs/feature/search/vector/index.md)"
echo "New vector: $(grep -c 'KNN_MATCH' docs/start/query/search/vector.md)"

echo -e "\nChecking 'HNSW' mentions:"
echo "Existing vector: $(grep -c 'HNSW' docs/feature/search/vector/index.md)"
echo "New vector: $(grep -c 'HNSW' docs/start/query/search/vector.md)"

echo -e "\nChecking 'BM25' mentions:"
echo "Existing FTS: $(grep -c 'BM25' docs/feature/search/fts/index.md)"
echo "New fulltext: $(grep -c 'BM25' docs/start/query/search/fulltext.md)"
echo "Existing hybrid: $(grep -c 'BM25' docs/feature/search/hybrid/index.md)"
echo "New hybrid: $(grep -c 'BM25' docs/start/query/search/hybrid.md)"

echo -e "\nChecking code examples:"
echo "Existing FTS SQL examples: $(grep -c '```sql' docs/feature/search/fts/index.md)"
echo "New fulltext SQL examples: $(grep -c '```sql' docs/start/query/search/fulltext.md)"

Length of output: 1684


Based on the comprehensive analysis, here are the key findings and recommendations:

Quantitative Analysis

Content Volume:

  • Existing documentation: 1,734 lines (comprehensive reference)
  • New documentation: 532 lines (~31% of existing content)

Technical Accuracy Concerns:

  • New fulltext content mentions MATCH 11 times but completely omits BM25 (0 mentions vs 16 in existing)
  • This confirms your concern about GenAI-generated content being "blurry and possibly incorrect"
  • The new content has more basic SQL examples (9 vs 4) but lacks the technical depth

Value Assessment

Minimal Unique Value:

  1. Topic Duplication: All four main topics (fulltext, geo, vector, hybrid) are covered in both sets
  2. Quality Regression: Missing key technical concepts like BM25 scoring in fulltext search
  3. Architectural Inconsistency: Flat structure vs. well-organized subdirectories with specialized content

Existing Content Strengths:

  • Comprehensive coverage with advanced topics (analyzers, tokenizers, configuration)
  • Technical accuracy with proper BM25 references
  • Practical tutorials (Netflix catalog example)
  • Professional organization with logical subdirectories

Recommendations

1. Consolidate, Don't Duplicate

  • Keep the canonical /docs/feature/search/ documentation as the primary source
  • Extract any genuinely useful quick-start examples from the new content and integrate them into existing "Getting Started" sections within the feature documentation

2. Improve Existing Documentation

  • Add more introductory sections to /docs/feature/search/fts/index.md and related files
  • Include the better SQL examples from the new content (where technically accurate)
  • Consider adding a "Quick Start" section to each existing topic

3. Maintain Single Source of Truth

  • Remove the new search folder to avoid maintenance overhead
  • Revert the toctree change from search/index back to ../../feature/search/index
  • Focus improvement efforts on the existing, technically superior documentation

4. Quality Control

  • Given the BM25 omission and other potential inaccuracies in GenAI content, thorough technical review would be required before any integration

Your instinct is correct: this appears to be unnecessary duplication that creates maintenance burden without adding substantial value. The existing documentation is more comprehensive, technically accurate, and better organized.

🤔 The analysis strongly supports consolidating efforts on improving the existing canonical documentation rather than maintaining parallel content sets.


🧠 Learnings used
Learnt from: amotl
PR: crate/cratedb-guide#238
File: docs/integrate/azure-functions/learn.rst:1-1
Timestamp: 2025-08-09T16:19:43.966Z
Learning: In the CrateDB Guide documentation, main integration anchors (e.g., `azure-functions`) are intentionally placed in the `index.md` files of their respective integration folders, while detailed tutorials use the `-learn` suffix (e.g., `azure-functions-learn`) in their `learn.rst` or `learn.md` files. This is a deliberate architectural pattern for the documentation restructuring.

@amotl amotl added the question Further information is requested label Aug 24, 2025
Base automatically changed from getting-started to main August 26, 2025 22:09
@amotl amotl force-pushed the getting-started-search branch from daec064 to 2e93901 Compare August 26, 2025 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

question Further information is requested

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants