Skip to content

Scalable web scraper with Node.js and sharded SQLite database that intelligently prioritizes high-value links using custom heuristics, featuring a RESTful API with rate limiting and comprehensive test coverage

License

Notifications You must be signed in to change notification settings

amjedidiah/web-scrapper

Repository files navigation

High-Value Link Scraper

Project Status: WIP

A web scraper that identifies and prioritizes high-value links on web pages, focusing on extracting relevant contacts and specific files.

Table of Contents

Features

  • Custom heuristic-based link scoring and classification
  • Efficient bot detection handling via Crawlee
  • Sharded SQLite database for optimized performance
  • RESTful API with rate limiting and pagination
  • Comprehensive test coverage

Why Crawlee?

After evaluating several options including ScrapingBee and Browserless.io, I chose Crawlee for the following reasons:

  • Open-source with active community
  • Built-in anti-blocking features
  • Automatic proxy rotation and scaling
  • Seamless switching between HTTP and browser-based scraping
  • TypeScript-first development

Environment Setup

  1. Install dependencies

    pnpm install
  2. Set up environment variables

    # Production (.env)
    SQLITE_DB_NAME=links.db
    PORT=8000
    DEBUG=info
    
    # Development (.env.local)
    SQLITE_DB_NAME=links_local.db
    PORT=8008
    DEBUG=info
    
    # Test (.env.test)
    SQLITE_DB_NAME=links_test.db
    PORT=3000
    DEBUG=debug

API Documentation

GET /links

Retrieve scraped links with filtering and pagination.

GET /links?minScore=0.7&keyword=budget&page=1

Query Parameters:

  • minScore (number, optional): Minimum relevance score (0-1). Defaults to 0.
  • keyword (string, optional): Filter by keyword
  • page (number, optional): Page number for pagination. Defaults to 1
  • parentUrl (string, optional): Filter by parent URL

Sample Response:

{
  "error": false,
  "message": "Successfully retrieved links",
  "data": {
    "page": 1,
    "totalPages": 5,
    "totalResultsCount": 48,
    "results": [
      {
      "id": "01HXYZABCDEFGHJKLMNOPQRST",
      "url": "https://example.com/budget",
      "anchor_text": "Annual Budget",
      "score": 0.85,
      "keywords": ["budget", "finance"],
      "parent_url": "https://example.com",
      "type": "document",
      "crawled_at": "2025-02-13 22:18:38"
      }
      {...}
    ]
  }
}

GET /links/:id

Retrieve a specific link by ID.

GET /links/01HXYZABCDEFGHJKLMNOPQRST

Sample Response:

{
  "error": false,
  "message": "Successfully retrieved link",
  "data": {
    "id": "01JKZQNHWV5KSW9JFA05J40PNW",
    "url": "https://vercel.com/contact/sales?utm_source=next-site&utm_medium=footer&utm_campaign=home",
    "anchor_text": "Contact Sales",
    "score": 3,
    "keywords": [
        "contact"
    ],
    "parent_url": "https://www.nextjs.org",
    "type": "contact",
    "crawled_at": "2025-02-13 13:23:44"
  }
}

POST /scrape

Trigger a new scrape job.

POST /scrape
Content-Type: application/json
{
"url": "<https://example.com>"
}

Sample Response:

{
  "data": {
    "processed": 42,
    "estimatedScore": 25.5
  },
  "message": "Scrape job completed",
  "error": false
}

Link Scoring Configuration

The scraper uses a weighted keyword system to score links. Configure weights in src/core/scrapper.ts:

private readonly KEYWORD_WEIGHTS = {
  acfr: 3, // Highest priority
  budget: 2.5, // High priority
  "finance director": 2, // Medium-high priority
  contact: 2, // Medium-high priority
  document: 1.5, // Medium priority
};

To modify scoring:

  1. Add new keywords with weights (1-3 recommended)
  2. Higher weights increase priority
  3. Compound terms (e.g., "finance director") are supported
  4. Restart service after changes

Performance Tuning

Database Optimization

  • Uses table sharding based on score ranges:
    • High: score >= 0.7
    • Medium: 0.3 <= score < 0.7
    • Low: score < 0.3
  • Implements connection pooling
  • WAL journal mode enabled

Scraping Optimization

Production settings src/config/scale.ts:

const scale = {
  rateLimiting: {
    windowMs: 60_000,
    maxRequests: 1000,
  },
  database: {
    poolSize: 100,
    timeout: 30_000,
  },
  scraping: {
    maxConcurrent: 100,
    maxRequestRetries: 3,
  },
};

Memory Management

  • Adjust maxConcurrent based on available RAM
  • Consider implementing Redis for queue management at scale

More Scaling Considerations

  1. Database: Switching to PostgreSQL using same SQL schema
  2. Queue: Adding BullMQ with Redis for job management
  3. Cache: Implementing Redis caching for frequent queries
  4. Cluster: Using PM2 for process management

Testing

# Run all tests
pnpm test

# Watch mode
pnpm test:watch

# Single test file
pnpm test src/__tests__/core/scrapper.test.ts

MacOS Testing Note

When running tests on MacOS, you might encounter a prompt:

"headless_shell wants to use your confidential information stored in 'Chromium Safe Storage' in your keychain"

This is related to Chromium's security features. You can handle this in three ways:

  1. Recommended: Allow access when prompted

    • Click "Allow" or "Always Allow"
    • This is the most secure approach
  2. Alternative: Disable keychain prompts

    # Add to your shell profile (.zshrc, .bashrc, etc.)
    export CRAWLEE_HEADLESS=1
    export PLAYWRIGHT_SKIP_BROWSER_KEYCHAIN=1

    Then restart your terminal or run:

    source ~/.zshrc  # or your shell profile

API Testing

Import our Postman Collection for API testing and examples.

Future Improvements

Testing Enhancements

  • Add integration tests for POST /scrape route

    // Planned test structure
    describe('POST /scrape', () => {
      test('should handle valid URLs', async () => {
        const response = await request(app)
          .post('/scrape')
          .send({ url: 'https://example.com' });
        
        expect(response.status).toBe(202);
        expect(response.body.data.processed).toBeGreaterThan(0);
      });
    });

Database Management

  • Implement Prisma ORM
    • Type-safe database queries

    • Automated migration management

    • Better schema versioning

    • Example migration:

      model Link {
        id         String   @id
        url        String   @unique
        anchorText String
        score      Float
        keywords   String[]
        parentUrl  String
        type       LinkType
        crawledAt  DateTime @default(now())
      }
      
      enum LinkType {
        DOCUMENT
        CONTACT
        GENERAL
      }

CLI Tool

  • Add command-line interface for crawler

    # Planned usage
    pnpm crawl https://example.com --min-score 0.7 --output json
    // Planned implementation
    #!/usr/bin/env node
    import { program } from 'commander';
    import { LinkScraper } from './core/scrapper';
    
    program
      .argument('<url>', 'URL to crawl')
      .option('--min-score <number>', 'Minimum score threshold', '0.5')
      .option('--output <format>', 'Output format (json|csv)', 'json')
      .action(async (url, options) => {
        const scraper = new LinkScraper();
        const results = await scraper.scrape(url);
        // Output handling
      });

License

MIT

About

Scalable web scraper with Node.js and sharded SQLite database that intelligently prioritizes high-value links using custom heuristics, featuring a RESTful API with rate limiting and comprehensive test coverage

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published