A web scraper that identifies and prioritizes high-value links on web pages, focusing on extracting relevant contacts and specific files.
- Features
- Why Crawlee?
- Environment Setup
- API Documentation
- Link Scoring Configuration
- Performance Tuning
- Testing
- Future Improvements
- License
- Custom heuristic-based link scoring and classification
- Efficient bot detection handling via Crawlee
- Sharded SQLite database for optimized performance
- RESTful API with rate limiting and pagination
- Comprehensive test coverage
After evaluating several options including ScrapingBee and Browserless.io, I chose Crawlee for the following reasons:
- Open-source with active community
- Built-in anti-blocking features
- Automatic proxy rotation and scaling
- Seamless switching between HTTP and browser-based scraping
- TypeScript-first development
-
Install dependencies
pnpm install
-
Set up environment variables
# Production (.env) SQLITE_DB_NAME=links.db PORT=8000 DEBUG=info # Development (.env.local) SQLITE_DB_NAME=links_local.db PORT=8008 DEBUG=info # Test (.env.test) SQLITE_DB_NAME=links_test.db PORT=3000 DEBUG=debug
Retrieve scraped links with filtering and pagination.
GET /links?minScore=0.7&keyword=budget&page=1
Query Parameters:
minScore
(number, optional): Minimum relevance score (0-1). Defaults to 0.keyword
(string, optional): Filter by keywordpage
(number, optional): Page number for pagination. Defaults to 1parentUrl
(string, optional): Filter by parent URL
Sample Response:
{
"error": false,
"message": "Successfully retrieved links",
"data": {
"page": 1,
"totalPages": 5,
"totalResultsCount": 48,
"results": [
{
"id": "01HXYZABCDEFGHJKLMNOPQRST",
"url": "https://example.com/budget",
"anchor_text": "Annual Budget",
"score": 0.85,
"keywords": ["budget", "finance"],
"parent_url": "https://example.com",
"type": "document",
"crawled_at": "2025-02-13 22:18:38"
}
{...}
]
}
}
Retrieve a specific link by ID.
GET /links/01HXYZABCDEFGHJKLMNOPQRST
Sample Response:
{
"error": false,
"message": "Successfully retrieved link",
"data": {
"id": "01JKZQNHWV5KSW9JFA05J40PNW",
"url": "https://vercel.com/contact/sales?utm_source=next-site&utm_medium=footer&utm_campaign=home",
"anchor_text": "Contact Sales",
"score": 3,
"keywords": [
"contact"
],
"parent_url": "https://www.nextjs.org",
"type": "contact",
"crawled_at": "2025-02-13 13:23:44"
}
}
Trigger a new scrape job.
POST /scrape
Content-Type: application/json
{
"url": "<https://example.com>"
}
Sample Response:
{
"data": {
"processed": 42,
"estimatedScore": 25.5
},
"message": "Scrape job completed",
"error": false
}
The scraper uses a weighted keyword system to score links. Configure weights in src/core/scrapper.ts
:
private readonly KEYWORD_WEIGHTS = {
acfr: 3, // Highest priority
budget: 2.5, // High priority
"finance director": 2, // Medium-high priority
contact: 2, // Medium-high priority
document: 1.5, // Medium priority
};
To modify scoring:
- Add new keywords with weights (1-3 recommended)
- Higher weights increase priority
- Compound terms (e.g., "finance director") are supported
- Restart service after changes
- Uses table sharding based on score ranges:
- High: score >= 0.7
- Medium: 0.3 <= score < 0.7
- Low: score < 0.3
- Implements connection pooling
- WAL journal mode enabled
Production settings src/config/scale.ts
:
const scale = {
rateLimiting: {
windowMs: 60_000,
maxRequests: 1000,
},
database: {
poolSize: 100,
timeout: 30_000,
},
scraping: {
maxConcurrent: 100,
maxRequestRetries: 3,
},
};
- Adjust
maxConcurrent
based on available RAM - Consider implementing Redis for queue management at scale
- Database: Switching to PostgreSQL using same SQL schema
- Queue: Adding BullMQ with Redis for job management
- Cache: Implementing Redis caching for frequent queries
- Cluster: Using PM2 for process management
# Run all tests
pnpm test
# Watch mode
pnpm test:watch
# Single test file
pnpm test src/__tests__/core/scrapper.test.ts
When running tests on MacOS, you might encounter a prompt:
"headless_shell wants to use your confidential information stored in 'Chromium Safe Storage' in your keychain"
This is related to Chromium's security features. You can handle this in three ways:
-
Recommended: Allow access when prompted
- Click "Allow" or "Always Allow"
- This is the most secure approach
-
Alternative: Disable keychain prompts
# Add to your shell profile (.zshrc, .bashrc, etc.) export CRAWLEE_HEADLESS=1 export PLAYWRIGHT_SKIP_BROWSER_KEYCHAIN=1
Then restart your terminal or run:
source ~/.zshrc # or your shell profile
Import our Postman Collection for API testing and examples.
-
Add integration tests for
POST /scrape
route// Planned test structure describe('POST /scrape', () => { test('should handle valid URLs', async () => { const response = await request(app) .post('/scrape') .send({ url: 'https://example.com' }); expect(response.status).toBe(202); expect(response.body.data.processed).toBeGreaterThan(0); }); });
- Implement Prisma ORM
-
Type-safe database queries
-
Automated migration management
-
Better schema versioning
-
Example migration:
model Link { id String @id url String @unique anchorText String score Float keywords String[] parentUrl String type LinkType crawledAt DateTime @default(now()) } enum LinkType { DOCUMENT CONTACT GENERAL }
-
-
Add command-line interface for crawler
# Planned usage pnpm crawl https://example.com --min-score 0.7 --output json
// Planned implementation #!/usr/bin/env node import { program } from 'commander'; import { LinkScraper } from './core/scrapper'; program .argument('<url>', 'URL to crawl') .option('--min-score <number>', 'Minimum score threshold', '0.5') .option('--output <format>', 'Output format (json|csv)', 'json') .action(async (url, options) => { const scraper = new LinkScraper(); const results = await scraper.scrape(url); // Output handling });