Skip to content

A powerful web scraping library built with Playwright that provides a declarative, step-by-step approach to web automation and data extraction.

License

Notifications You must be signed in to change notification settings

Framework-Island/stepwright

Repository files navigation

StepWright

A powerful web scraping library built with Playwright that provides a declarative, step-by-step approach to web automation and data extraction.

Features

  • 🚀 Declarative Scraping: Define scraping workflows using JSON templates
  • 🔄 Pagination Support: Built-in support for next button and scroll-based pagination
  • 📊 Data Collection: Extract text, HTML, values, and files from web pages
  • 🔗 Multi-tab Support: Handle multiple tabs and complex navigation flows
  • 📄 PDF Generation: Save pages as PDFs or trigger print-to-PDF actions
  • 📥 File Downloads: Download files with automatic directory creation
  • 🔁 Looping & Iteration: ForEach loops for processing multiple elements
  • 📡 Streaming Results: Real-time result processing with callbacks
  • 🎯 Error Handling: Graceful error handling with configurable termination
  • 🔧 Flexible Selectors: Support for ID, class, tag, and XPath selectors

Installation

# Using pnpm (recommended)
pnpm add stepwright

# Using npm
npm install stepwright

# Using yarn
yarn add stepwright

Quick Start

Basic Usage

import { runScraper } from 'stepwright';

const templates = [
  {
    tab: 'example',
    steps: [
      {
        id: 'navigate',
        action: 'navigate',
        value: 'https://example.com'
      },
      {
        id: 'get_title',
        action: 'data',
        object_type: 'tag',
        object: 'h1',
        key: 'title',
        data_type: 'text'
      }
    ]
  }
];

const results = await runScraper(templates);
console.log(results);

Examples

Basic Examples

The repository includes basic examples demonstrating core functionality:

  • Basic Usage (TypeScript): examples/basic-usage.ts - Simple navigation and data extraction
  • Basic Usage (JavaScript): examples/basic-usage.js - Same example in JavaScript

Run the examples:

# Run all examples
./examples/run-examples.sh

# Or run individual examples
node examples/basic-usage.js
npx tsx examples/basic-usage.ts

Advanced Examples

For more complex scenarios, check out:

  • Advanced Usage (TypeScript): examples/advanced-usage.ts - Pagination, file downloads, and multi-tab handling
  • Advanced Usage (JavaScript): examples/advanced-usage.js - Same advanced features in JavaScript

API Reference

Core Functions

runScraper(templates, options?)

Main function to execute scraping templates.

Parameters:

  • templates: Array of TabTemplate objects
  • options: Optional RunOptions object

Returns: Promise<Record<string, any>[]>

runScraperWithCallback(templates, onResult, options?)

Execute scraping with streaming results via callback.

Parameters:

  • templates: Array of TabTemplate objects
  • onResult: Callback function for each result
  • options: Optional RunOptions object

Types

TabTemplate

interface TabTemplate {
  tab: string;
  initSteps?: BaseStep[];      // Steps executed once before pagination
  perPageSteps?: BaseStep[];   // Steps executed for each page
  steps?: BaseStep[];          // Legacy single steps array
  pagination?: PaginationConfig;
}

BaseStep

interface BaseStep {
  id: string;
  description?: string;
  object_type?: SelectorType;  // 'id' | 'class' | 'tag' | 'xpath'
  object?: string;
  action: 'navigate' | 'input' | 'click' | 'data' | 'scroll' | 'download' | 'foreach' | 'open' | 'savePDF' | 'printToPDF';
  value?: string;
  key?: string;
  data_type?: DataType;        // 'text' | 'html' | 'value' | 'default'
  wait?: number;
  terminateonerror?: boolean;
  subSteps?: BaseStep[];
}

RunOptions

interface RunOptions {
  browser?: LaunchOptions;
  onResult?: (result: Record<string, any>, index: number) => void | Promise<void>;
}

Step Actions

Navigate

Navigate to a URL.

{
  id: 'go_to_page',
  action: 'navigate',
  value: 'https://example.com'
}

Input

Fill form fields.

{
  id: 'search',
  action: 'input',
  object_type: 'id',
  object: 'search-box',
  value: 'search term'
}

Click

Click on elements.

{
  id: 'submit',
  action: 'click',
  object_type: 'class',
  object: 'submit-button'
}

Data Extraction

Extract data from elements.

{
  id: 'get_title',
  action: 'data',
  object_type: 'tag',
  object: 'h1',
  key: 'title',
  data_type: 'text'
}

ForEach Loop

Process multiple elements.

{
  id: 'process_items',
  action: 'foreach',
  object_type: 'class',
  object: 'item',
  subSteps: [
    // Steps to execute for each item
  ]
}

File Operations

Download

{
  id: 'download_file',
  action: 'download',
  object_type: 'class',
  object: 'download-link',
  value: './downloads/file.pdf',
  key: 'downloaded_file'
}

Save PDF

{
  id: 'save_pdf',
  action: 'savePDF',
  value: './output/page.pdf',
  key: 'pdf_file'
}

Print to PDF

{
  id: 'print_pdf',
  action: 'printToPDF',
  object_type: 'id',
  object: 'print-button',
  value: './output/printed.pdf',
  key: 'printed_file'
}

Pagination

Next Button Pagination

pagination: {
  strategy: 'next',
  nextButton: {
    object_type: 'class',
    object: 'next-page',
    wait: 2000
  },
  maxPages: 10
}

Scroll Pagination

pagination: {
  strategy: 'scroll',
  scroll: {
    offset: 800,
    delay: 1500
  },
  maxPages: 5
}

Advanced Features

Proxy Support

const results = await runScraper(templates, {
  browser: {
    proxy: {
      server: 'http://proxy-server:8080',
      username: 'user',
      password: 'pass'
    }
  }
});

Custom Browser Options

const results = await runScraper(templates, {
  browser: {
    headless: false,
    slowMo: 1000,
    args: ['--no-sandbox', '--disable-setuid-sandbox']
  }
});

Streaming Results

await runScraperWithCallback(templates, async (result, index) => {
  console.log(`Result ${index}:`, result);
  // Process result immediately
}, {
  browser: { headless: true }
});

Data Placeholders

Use collected data in subsequent steps:

{
  id: 'save_with_title',
  action: 'savePDF',
  value: './output/{{meeting_title}}.pdf',
  key: 'meeting_pdf'
}

Error Handling

Steps can be configured to terminate on error:

{
  id: 'critical_step',
  action: 'click',
  object_type: 'id',
  object: 'important-button',
  terminateonerror: true
}

Development

Setup

# Install dependencies
pnpm install

# Build the project
pnpm build

# Run tests
pnpm test

# Run tests in watch mode
pnpm test:watch

# Lint code
pnpm lint

# Format code
pnpm format

Testing

# Run all tests
pnpm test

# Run tests with coverage
pnpm test:coverage

# Run specific test file
pnpm test scraper.test.ts

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Run the test suite
  6. Submit a pull request

License

MIT License - see LICENSE file for details.

Support