Skip to content

Conversation

ebembi-crdb
Copy link
Contributor

@ebembi-crdb ebembi-crdb commented Sep 8, 2025

Algolia Search Migration: Jekyll to Python

Replaces the Jekyll Algolia gem with a custom Python indexing
system that provides intelligent content extraction,
incremental updates, and production-ready CI/CD integration.

Key Benefits

  • 3x Faster: Incremental updates (2-3 min) vs full rebuilds
    (15-20 min)
  • 75% Smaller Index: 40K records vs 157K with intelligent bloat
    removal
  • True Incremental: Only index changed content with deletion
    support
  • TeamCity Ready: Zero-configuration deployment with smart
    decision logic
  • Production Parity: 96%+ search quality match with existing
    index

Performance Improvements

Metric Jekyll Gem New Python System Improvement
Index Size ~157K records ~40K records 75% reduction
Full Rebuild 15-20 minutes 8-10 minutes 50% faster
Incremental Not supported 2-3 minutes New capability
Content Quality Includes UI bloat Intelligent filtering

Intelligent Features

Smart Decision Logic

Automatically chooses full vs incremental indexing based on:

  • Git commits affecting source files
  • Configuration changes (_config_cockroachdb.yml)
  • State file age and integrity
  • Force full override capability

Intelligent Bloat Removal

  • Removes: 117K+ duplicate records, UI spam, table bloat,
    download repetition
  • Preserves: All SQL commands, technical docs, error messages,
    release notes
  • Pattern-based filtering instead of naive content extraction

Dynamic Version Detection

Automatically reads from _config_cockroachdb.yml

versions:
stable: v25.3 # Detected and used for filtering

Files Changed

New Production Files

  • algolia_indexing_wrapper.py - Smart orchestration for
    TeamCity
  • algolia_index_intelligent_bloat_removal.py - Core indexer
    with bloat removal
  • algolia_parity_test.py - Production validation suite
  • README_ALGOLIA_MIGRATION.md - Comprehensive documentation

Modified Files

  • _config_cockroachdb.yml - Version configuration for dynamic
    detection
  • Gemfile - Updated dependencies

Removed Legacy Files

  • algolia_index_prod_match.py - Development prototype
  • check_ranking_parity.py - Superseded by parity test
  • compare_to_prod_explain.py - Development analysis tool
  • test_all_files.py - Development validation

TeamCity Integration

Simple Deployment

Build Steps

  1. bundle exec jekyll build --config _config_cockroachdb.yml
  2. python3 algolia_indexing_wrapper.py

Environment Variables

ALGOLIA_APP_ID=7RXZLDVR5F
ALGOLIA_ADMIN_API_KEY=
ALGOLIA_INDEX_ENVIRONMENT=staging|production

Zero-Configuration Operation

  • First run: Automatically does full indexing
  • Subsequent runs: Smart incremental based on content changes
  • Force full: ALGOLIA_FORCE_FULL=true override
  • State persistence: External files (no git commits)

Comprehensive Testing

  • 100% Test Coverage: 10 wrapper scenarios, incremental
    validation, parity testing
  • Production Validation: 96%+ search overlap, 90%+ URL
    coverage, full field compatibility
  • Performance benchmarks exceed all targets

Copy link

netlify bot commented Sep 8, 2025

Deploy Preview for cockroachdb-api-docs canceled.

Name Link
🔨 Latest commit 8d7812f
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-api-docs/deploys/68befa22104e430008d2b165

Copy link

netlify bot commented Sep 8, 2025

Deploy Preview for cockroachdb-interactivetutorials-docs canceled.

Name Link
🔨 Latest commit 8d7812f
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-interactivetutorials-docs/deploys/68befa2260eb330008567e0c

Copy link

github-actions bot commented Sep 8, 2025

Files changed:

Copy link

netlify bot commented Sep 8, 2025

Deploy Preview for cockroachdb-api-docs canceled.

Name Link
🔨 Latest commit e482b2f
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-api-docs/deploys/68c025934d05ff00083441f0

Copy link

netlify bot commented Sep 8, 2025

Deploy Preview for cockroachdb-interactivetutorials-docs canceled.

Name Link
🔨 Latest commit e482b2f
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-interactivetutorials-docs/deploys/68c0259369b07f0008f2a47f

Copy link

netlify bot commented Sep 8, 2025

Netlify Preview

Name Link
🔨 Latest commit 8d7812f
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-docs/deploys/68befa229dd97600088c579e
😎 Deploy Preview https://deploy-preview-20302--cockroachdb-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link

netlify bot commented Sep 8, 2025

Netlify Preview

Name Link
🔨 Latest commit e482b2f
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-docs/deploys/68c02593f60dc900087ea5be
😎 Deploy Preview https://deploy-preview-20302--cockroachdb-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@ebembi-crdb ebembi-crdb requested a review from a team as a code owner September 9, 2025 12:38
| **`check_ranking_parity.py`** | Production parity verification | ❌ Optional validation |
| **`compare_to_prod_explain.py`** | Index comparison analysis | ❌ Optional analysis |
| **`test_all_files.py`** | File processing validation | ❌ Dev only |
| **`algolia_index_prod_match.py`** | Legacy production matcher | ❌ Reference only |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean going forward? Will this become obsolete when the new indexing version is released?

**Indexing Rules:**
- ✅ Always include: `/releases/`, `/cockroachcloud/`, `/advisories/`, `/molt/`
- ✅ Include stable version files: Files containing `v25.3`
- ❌ Exclude old versions: `v24.x`, `v23.x`, etc.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would corrections to old versions of code be indexed if changes are detected?

## 🧠 Intelligent Bloat Removal

### What Gets Removed
- **85K+ Duplicate Records**: Content deduplication using MD5 hashing
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this process run every time? Will we need to run this after the first re-indexing?

2. **Force Override**: `ALGOLIA_FORCE_FULL=true`
3. **Corrupted State**: Invalid state file
4. **Stale State**: State file >7 days old
5. **Content Changes**: Git commits affecting source files
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean? Affecting which source files?


## 📊 Performance Metrics

### Size Optimization
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do these numbers represent?

- **Cause**: First run or state file was deleted
- **Solution**: Normal - will do full indexing automatically

**❌ "Git commits detected"**
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what changes result in incremental indexing?

- ✅ Comprehensive test coverage (100% pass rate)
- ✅ Performance optimization and bloat removal

### Phase 2: Staging Deployment (Next)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this complete?

### 5. **Zero-Downtime Deployment**
Incremental indexing allows continuous updates without search interruption.

## 📞 Support
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List contacts or a channel for support.

@@ -6,7 +6,7 @@ algolia:
- search.html
- src/current/v23.1/**
- v23.1/**
index_name: cockroachcloud_docs
index_name: stage_cockroach_docs
search_api_key: 372a10456f4ed7042c531ff3a658771b
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might consider making this an env var rather than including directly in the config in plain text.

]

# Content that should ALWAYS be preserved (even if short)
self.preserve_patterns = [
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure we've captured all SQL commands and keywords?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants