A powerful web scraping library built with Playwright that provides a declarative, step-by-step approach to web automation and data extraction.
- 🚀 Declarative Scraping: Define scraping workflows using JSON templates
- 🔄 Pagination Support: Built-in support for next button and scroll-based pagination
- 📊 Data Collection: Extract text, HTML, values, and files from web pages
- 🔗 Multi-tab Support: Handle multiple tabs and complex navigation flows
- 📄 PDF Generation: Save pages as PDFs or trigger print-to-PDF actions
- 📥 File Downloads: Download files with automatic directory creation
- 🔁 Looping & Iteration: ForEach loops for processing multiple elements
- 📡 Streaming Results: Real-time result processing with callbacks
- 🎯 Error Handling: Graceful error handling with configurable termination
- 🔧 Flexible Selectors: Support for ID, class, tag, and XPath selectors
# Using pnpm (recommended)
pnpm add stepwright
# Using npm
npm install stepwright
# Using yarn
yarn add stepwright
import { runScraper } from 'stepwright';
const templates = [
{
tab: 'example',
steps: [
{
id: 'navigate',
action: 'navigate',
value: 'https://example.com'
},
{
id: 'get_title',
action: 'data',
object_type: 'tag',
object: 'h1',
key: 'title',
data_type: 'text'
}
]
}
];
const results = await runScraper(templates);
console.log(results);
The repository includes basic examples demonstrating core functionality:
- Basic Usage (TypeScript):
examples/basic-usage.ts
- Simple navigation and data extraction - Basic Usage (JavaScript):
examples/basic-usage.js
- Same example in JavaScript
Run the examples:
# Run all examples
./examples/run-examples.sh
# Or run individual examples
node examples/basic-usage.js
npx tsx examples/basic-usage.ts
For more complex scenarios, check out:
- Advanced Usage (TypeScript):
examples/advanced-usage.ts
- Pagination, file downloads, and multi-tab handling - Advanced Usage (JavaScript):
examples/advanced-usage.js
- Same advanced features in JavaScript
Main function to execute scraping templates.
Parameters:
templates
: Array ofTabTemplate
objectsoptions
: OptionalRunOptions
object
Returns: Promise<Record<string, any>[]>
Execute scraping with streaming results via callback.
Parameters:
templates
: Array ofTabTemplate
objectsonResult
: Callback function for each resultoptions
: OptionalRunOptions
object
interface TabTemplate {
tab: string;
initSteps?: BaseStep[]; // Steps executed once before pagination
perPageSteps?: BaseStep[]; // Steps executed for each page
steps?: BaseStep[]; // Legacy single steps array
pagination?: PaginationConfig;
}
interface BaseStep {
id: string;
description?: string;
object_type?: SelectorType; // 'id' | 'class' | 'tag' | 'xpath'
object?: string;
action: 'navigate' | 'input' | 'click' | 'data' | 'scroll' | 'download' | 'foreach' | 'open' | 'savePDF' | 'printToPDF';
value?: string;
key?: string;
data_type?: DataType; // 'text' | 'html' | 'value' | 'default'
wait?: number;
terminateonerror?: boolean;
subSteps?: BaseStep[];
}
interface RunOptions {
browser?: LaunchOptions;
onResult?: (result: Record<string, any>, index: number) => void | Promise<void>;
}
Navigate to a URL.
{
id: 'go_to_page',
action: 'navigate',
value: 'https://example.com'
}
Fill form fields.
{
id: 'search',
action: 'input',
object_type: 'id',
object: 'search-box',
value: 'search term'
}
Click on elements.
{
id: 'submit',
action: 'click',
object_type: 'class',
object: 'submit-button'
}
Extract data from elements.
{
id: 'get_title',
action: 'data',
object_type: 'tag',
object: 'h1',
key: 'title',
data_type: 'text'
}
Process multiple elements.
{
id: 'process_items',
action: 'foreach',
object_type: 'class',
object: 'item',
subSteps: [
// Steps to execute for each item
]
}
{
id: 'download_file',
action: 'download',
object_type: 'class',
object: 'download-link',
value: './downloads/file.pdf',
key: 'downloaded_file'
}
{
id: 'save_pdf',
action: 'savePDF',
value: './output/page.pdf',
key: 'pdf_file'
}
{
id: 'print_pdf',
action: 'printToPDF',
object_type: 'id',
object: 'print-button',
value: './output/printed.pdf',
key: 'printed_file'
}
pagination: {
strategy: 'next',
nextButton: {
object_type: 'class',
object: 'next-page',
wait: 2000
},
maxPages: 10
}
pagination: {
strategy: 'scroll',
scroll: {
offset: 800,
delay: 1500
},
maxPages: 5
}
const results = await runScraper(templates, {
browser: {
proxy: {
server: 'http://proxy-server:8080',
username: 'user',
password: 'pass'
}
}
});
const results = await runScraper(templates, {
browser: {
headless: false,
slowMo: 1000,
args: ['--no-sandbox', '--disable-setuid-sandbox']
}
});
await runScraperWithCallback(templates, async (result, index) => {
console.log(`Result ${index}:`, result);
// Process result immediately
}, {
browser: { headless: true }
});
Use collected data in subsequent steps:
{
id: 'save_with_title',
action: 'savePDF',
value: './output/{{meeting_title}}.pdf',
key: 'meeting_pdf'
}
Steps can be configured to terminate on error:
{
id: 'critical_step',
action: 'click',
object_type: 'id',
object: 'important-button',
terminateonerror: true
}
# Install dependencies
pnpm install
# Build the project
pnpm build
# Run tests
pnpm test
# Run tests in watch mode
pnpm test:watch
# Lint code
pnpm lint
# Format code
pnpm format
# Run all tests
pnpm test
# Run tests with coverage
pnpm test:coverage
# Run specific test file
pnpm test scraper.test.ts
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Run the test suite
- Submit a pull request
MIT License - see LICENSE file for details.
- 🐛 Issues: GitHub Issues