High-Value Link Scraper

A web scraper that identifies and prioritizes high-value links on web pages, focusing on extracting relevant contacts and specific files.

Features

Custom heuristic-based link scoring and classification
Efficient bot detection handling via Crawlee
Sharded SQLite database for optimized performance
RESTful API with rate limiting and pagination
Comprehensive test coverage

Why Crawlee?

After evaluating several options including ScrapingBee and Browserless.io, I chose Crawlee for the following reasons:

Open-source with active community
Built-in anti-blocking features
Automatic proxy rotation and scaling
Seamless switching between HTTP and browser-based scraping
TypeScript-first development

Environment Setup

Install dependencies
```
pnpm install
```

Set up environment variables

# Production (.env)
SQLITE_DB_NAME=links.db
PORT=8000
DEBUG=info

# Development (.env.local)
SQLITE_DB_NAME=links_local.db
PORT=8008
DEBUG=info

# Test (.env.test)
SQLITE_DB_NAME=links_test.db
PORT=3000
DEBUG=debug

API Documentation

`GET /links`

Retrieve scraped links with filtering and pagination.

GET /links?minScore=0.7&keyword=budget&page=1

Query Parameters:

minScore (number, optional): Minimum relevance score (0-1). Defaults to 0.
keyword (string, optional): Filter by keyword
page (number, optional): Page number for pagination. Defaults to 1
parentUrl (string, optional): Filter by parent URL

Sample Response:

{
  "error": false,
  "message": "Successfully retrieved links",
  "data": {
    "page": 1,
    "totalPages": 5,
    "totalResultsCount": 48,
    "results": [
      {
      "id": "01HXYZABCDEFGHJKLMNOPQRST",
      "url": "https://example.com/budget",
      "anchor_text": "Annual Budget",
      "score": 0.85,
      "keywords": ["budget", "finance"],
      "parent_url": "https://example.com",
      "type": "document",
      "crawled_at": "2025-02-13 22:18:38"
      }
      {...}
    ]
  }
}

`GET /links/:id`

Retrieve a specific link by ID.

GET /links/01HXYZABCDEFGHJKLMNOPQRST

Sample Response:

{
  "error": false,
  "message": "Successfully retrieved link",
  "data": {
    "id": "01JKZQNHWV5KSW9JFA05J40PNW",
    "url": "https://vercel.com/contact/sales?utm_source=next-site&utm_medium=footer&utm_campaign=home",
    "anchor_text": "Contact Sales",
    "score": 3,
    "keywords": [
        "contact"
    ],
    "parent_url": "https://www.nextjs.org",
    "type": "contact",
    "crawled_at": "2025-02-13 13:23:44"
  }
}

`POST /scrape`

Trigger a new scrape job.

POST /scrape
Content-Type: application/json
{
"url": "<https://example.com>"
}

Sample Response:

{
  "data": {
    "processed": 42,
    "estimatedScore": 25.5
  },
  "message": "Scrape job completed",
  "error": false
}

Link Scoring Configuration

The scraper uses a weighted keyword system to score links. Configure weights in src/core/scrapper.ts:

private readonly KEYWORD_WEIGHTS = {
  acfr: 3, // Highest priority
  budget: 2.5, // High priority
  "finance director": 2, // Medium-high priority
  contact: 2, // Medium-high priority
  document: 1.5, // Medium priority
};

To modify scoring:

Add new keywords with weights (1-3 recommended)
Higher weights increase priority
Compound terms (e.g., "finance director") are supported
Restart service after changes

Performance Tuning

Database Optimization

Uses table sharding based on score ranges:
- High: score >= 0.7
- Medium: 0.3 <= score < 0.7
- Low: score < 0.3
Implements connection pooling
WAL journal mode enabled

Scraping Optimization

Production settings src/config/scale.ts:

const scale = {
  rateLimiting: {
    windowMs: 60_000,
    maxRequests: 1000,
  },
  database: {
    poolSize: 100,
    timeout: 30_000,
  },
  scraping: {
    maxConcurrent: 100,
    maxRequestRetries: 3,
  },
};

Memory Management

Adjust maxConcurrent based on available RAM
Consider implementing Redis for queue management at scale

More Scaling Considerations

Database: Switching to PostgreSQL using same SQL schema
Queue: Adding BullMQ with Redis for job management
Cache: Implementing Redis caching for frequent queries
Cluster: Using PM2 for process management

Testing

# Run all tests
pnpm test

# Watch mode
pnpm test:watch

# Single test file
pnpm test src/__tests__/core/scrapper.test.ts

MacOS Testing Note

When running tests on MacOS, you might encounter a prompt:

"headless_shell wants to use your confidential information stored in 'Chromium Safe Storage' in your keychain"

This is related to Chromium's security features. You can handle this in three ways:

Recommended: Allow access when prompted
- Click "Allow" or "Always Allow"
- This is the most secure approach

Alternative: Disable keychain prompts

# Add to your shell profile (.zshrc, .bashrc, etc.)
export CRAWLEE_HEADLESS=1
export PLAYWRIGHT_SKIP_BROWSER_KEYCHAIN=1

Then restart your terminal or run:

source ~/.zshrc  # or your shell profile

API Testing

Import our Postman Collection for API testing and examples.

Future Improvements

Testing Enhancements

Add integration tests for POST /scrape route

// Planned test structure
describe('POST /scrape', () => {
  test('should handle valid URLs', async () => {
    const response = await request(app)
      .post('/scrape')
      .send({ url: 'https://example.com' });
    
    expect(response.status).toBe(202);
    expect(response.body.data.processed).toBeGreaterThan(0);
  });
});

Database Management

Implement Prisma ORM

Type-safe database queries
Automated migration management
Better schema versioning

Example migration:

model Link {
  id         String   @id
  url        String   @unique
  anchorText String
  score      Float
  keywords   String[]
  parentUrl  String
  type       LinkType
  crawledAt  DateTime @default(now())
}

enum LinkType {
  DOCUMENT
  CONTACT
  GENERAL
}

CLI Tool

Add command-line interface for crawler

# Planned usage
pnpm crawl https://example.com --min-score 0.7 --output json

// Planned implementation
#!/usr/bin/env node
import { program } from 'commander';
import { LinkScraper } from './core/scrapper';

program
  .argument('<url>', 'URL to crawl')
  .option('--min-score <number>', 'Minimum score threshold', '0.5')
  .option('--output <format>', 'Output format (json|csv)', 'json')
  .action(async (url, options) => {
    const scraper = new LinkScraper();
    const results = await scraper.scrape(url);
    // Output handling
  });

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.husky		.husky
src		src
.gitignore		.gitignore
.nvmrc		.nvmrc
.prettierrc		.prettierrc
LICENSE.md		LICENSE.md
Link Scrapper.postman_collection.json		Link Scrapper.postman_collection.json
README.md		README.md
eslint.config.mjs		eslint.config.mjs
jest.config.js		jest.config.js
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
software-requirement-specification.md		software-requirement-specification.md
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

High-Value Link Scraper

Table of Contents

Features

Why Crawlee?

Environment Setup

API Documentation

`GET /links`

`GET /links/:id`

`POST /scrape`

Link Scoring Configuration

Performance Tuning

Database Optimization

Scraping Optimization

Memory Management

More Scaling Considerations

Testing

MacOS Testing Note

API Testing

Future Improvements

Testing Enhancements

Database Management

CLI Tool

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

amjedidiah/web-scrapper

Folders and files

Latest commit

History

Repository files navigation

High-Value Link Scraper

Table of Contents

Features

Why Crawlee?

Environment Setup

API Documentation

GET /links

GET /links/:id

POST /scrape

Link Scoring Configuration

Performance Tuning

Database Optimization

Scraping Optimization

Memory Management

More Scaling Considerations

Testing

MacOS Testing Note

API Testing

Future Improvements

Testing Enhancements

Database Management

CLI Tool

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`GET /links`

`GET /links/:id`

`POST /scrape`

Packages