Skip to content

A Python Reddit scraper with dual-mode architecture: simple requests for small jobs, async + proxy rotation for large-scale scraping. Features captcha solving, rich CLI, and smart job-size detection.

License

Notifications You must be signed in to change notification settings

proxidize/reddit-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reddit Scraper

Python 3.8+ License: MIT Version Code Style: Black

Features

Multiple Scraping Methods

  • JSON Endpoint Scraper - Fast scraping using Reddit's .json endpoints (no authentication required)
  • Advanced Requests Scraper - Custom pagination and bulk scraping capabilities

Advanced Capabilities

  • Proxy Rotation - Automatic proxy switching with health monitoring
  • Captcha Solving - Automated captcha handling using Capsolver API
  • User Agent Rotation - Realistic browser simulation
  • Rate Limiting - Respectful request throttling
  • Rich CLI Interface - Beautiful command-line interface with progress bars
  • Multiple Export Formats - JSON and CSV output with full comment thread data

Installation

Using uv (Recommended)

git clone https://github.com/proxidize/reddit-scraper.git
cd reddit-scraper

uv venv
source .venv/bin/activate 

uv pip install -e .

Using pip

git clone https://github.com/proxidize/reddit-scraper.git
cd reddit-scraper

python -m venv .venv
source .venv/bin/activate 

pip install -e .

Development

Setup for Development

pip install -e .[dev]

uv pip install -e .[dev]

Running Tests

python tests/run_tests.py

pytest tests/ -v --cov=reddit_scraper

pytest tests/unit/ -v -m unit        
pytest tests/integration/ -v -m integration  
pytest tests/ -v -m "not slow"     

pytest tests/ --cov=reddit_scraper --cov-report=html

Test Markers

  • unit - Fast unit tests
  • integration - Integration tests that may hit external APIs
  • slow - Slow tests that should be skipped in CI

Docker Support

Building and Running with Docker

docker build -t reddit-scraper .

docker run -v $(pwd)/config.json:/app/config.json reddit-scraper interactive --config config.json

docker run -v $(pwd)/config.json:/app/config.json reddit-scraper json subreddit python --limit 10 --config config.json

docker run -v $(pwd)/config.json:/app/config.json -v $(pwd)/output:/app/output reddit-scraper json subreddit python --limit 10 --output output/posts.json --config config.json

Quick Start

1. Interactive Mode (Recommended)

python3 -m reddit_scraper.cli interactive

python3 -m reddit_scraper.cli interactive --config config.json

2. Direct Commands

python3 -m reddit_scraper.cli json subreddit python --limit 10

python3 -m reddit_scraper.cli json subreddit technology --config config.json --limit 50

Note: If you've properly installed the package with pip install -e ., you can use reddit-scraper directly instead of python3 -m reddit_scraper.cli

Configuration

The scraper uses a JSON configuration file to manage all settings including proxies, captcha solvers, and scraping preferences.

Copy config.example.json to config.json and edit:

{
  "proxies": [
    {
      "host": "proxy1.example.com",
      "port": 8080,
      "username": "your_proxy_username",
      "password": "your_proxy_password",
      "proxy_type": "http"
    },
    {
      "host": "proxy2.example.com",
      "port": 1080,
      "username": "your_proxy_username",
      "password": "your_proxy_password",
      "proxy_type": "socks5"
    }
  ],
  "captcha_solvers": [
    {
      "api_key": "CAP-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
      "provider": "capsolver",
      "site_keys": {
        "reddit.com": "6Le-wvkSAAAAAPBMRTvw0Q4Muexq9bi0DJwx_mJ-",
        "www.reddit.com": "6Le-wvkSAAAAAPBMRTvw0Q4Muexq9bi0DJwx_mJ-"
      }
    }
  ],
  "scraping": {
    "default_delay": 1.0,
    "max_retries": 3,
    "requests_per_minute": 60,
    "user_agent": "RedditScraper/1.0.0",
    "rotate_user_agents": true
  }
}

Key Features

  • Multiple Proxies: Add multiple HTTP and SOCKS5 proxies for automatic rotation
  • Captcha Solving: Integrate with Capsolver for automated captcha handling with custom site keys
  • Input Validation: Automatic validation of subreddit names, usernames, and other inputs
  • Flexible Configuration: Easy JSON-based configuration management with validation
  • Health Monitoring: Built-in proxy health checking and performance monitoring

Setup Commands

cp config.example.json config.json

nano config.json  

python3 -m reddit_scraper.cli status --config config.json

Data Validation & Processing

The scraper includes robust input validation and data processing capabilities:

Input Validation

  • Subreddit Names: Validates format, length (1-21 chars), and checks for reserved names
  • Usernames: Validates Reddit username format (3-20 chars, alphanumeric plus underscore/hyphen)
  • Post IDs: Ensures proper Reddit post ID format
  • URLs: Validates and normalizes Reddit URLs

Data Processing

  • Comment Threading: Maintains proper parent-child relationships in comment trees
  • Data Cleaning: Removes unnecessary metadata while preserving essential information
  • Field Standardization: Consistent field names and data types across all scraped content

Error Handling

from reddit_scraper import ValidationError

try:
    posts = scraper.scrape_subreddit("invalid-name!", "hot", 10)
except ValidationError as e:
    print(f"Validation error: {e}")

Available Commands

Interactive Mode

python3 -m reddit_scraper.cli interactive [--config CONFIG_FILE]

Core Scraping Commands

JSON Scraper (Fastest)

python3 -m reddit_scraper.cli json subreddit SUBREDDIT_NAME [--config CONFIG_FILE] [options]

python3 -m reddit_scraper.cli json user USERNAME [options]

python3 -m reddit_scraper.cli json comments SUBREDDIT POST_ID [options]

python3 -m reddit_scraper.cli json subreddit-with-comments SUBREDDIT_NAME [options]

Comment Scraping

Extract rich comment data with full thread structure:

python3 -m reddit_scraper.cli json subreddit-with-comments python --limit 10 --include-comments --comment-limit 20 --output posts_with_comments.json

python3 -m reddit_scraper.cli json comments python POST_ID --sort best --output single_post_comments.json

python3 -m reddit_scraper.cli json user username --limit 25 --sort top --output user_posts.json

Comment Data Includes:

  • Author information and scores
  • Full comment text and timestamps
  • Nested reply structure
  • Thread hierarchy and relationships
  • Community engagement metrics

Real Example (Actual Scraped Data):

{
  "title": "A simple home server to wirelessly stream any video file",
  "author": "Enzo10091",
  "score": 8,
  "num_comments": 1,
  "comment_count_scraped": 1,
  "comments": [
    {
      "id": "lwg8h3x",
      "author": "ismail_the_whale",
      "body": "nice, but you really have to clean this up. i guess you're not a python dev.\n\n- use snake_case\n- use a pyproject.toml file",
      "score": 2,
      "created_utc": 1755262448.0,
      "parent_id": "t3_1mqw7zr",
      "replies": []
    }
  ]
}

Advanced Requests Scraper (Best for Bulk)

python3 -m reddit_scraper.cli requests paginated SUBREDDIT_NAME [options]

Utility Commands

System Health & Status

python3 -m reddit_scraper.cli status --config config.json

python3 -m reddit_scraper.cli test-proxies --config config.json --test-urls 3

Setup and Configuration

cp config.example.json config.json
nano config.json

python3 -m reddit_scraper.cli status --config config.json

Global Search

python3 -m reddit_scraper.cli search "python tips" --subreddit python

python3 -m reddit_scraper.cli search "neural networks" --subreddit MachineLearning

Current Reddit Restrictions

Reddit has some protection against automated scraping:

  • Some subreddits may trigger captcha challenges (r/webscraping, etc.)
  • Large bulk requests may hit rate limits
  • Search endpoints work but may be slower than direct scraping

Recommended approach:

  • Use interactive mode for best success rate
  • Start with popular, stable subreddits like python, technology
  • Use proxies and captcha solving for reliable large-scale scraping
  • Search functionality works well for targeted queries

Working Examples (Tested)

python3 -m reddit_scraper.cli interactive --config config.json

python3 -m reddit_scraper.cli json subreddit python --limit 10
python3 -m reddit_scraper.cli json subreddit technology --config config.json --limit 50

python3 -m reddit_scraper.cli search "python tips" --subreddit python

python3 -m reddit_scraper.cli requests paginated python --max-posts 100

python3 -m reddit_scraper.cli status --config config.json
python3 -m reddit_scraper.cli test-proxies --config config.json

Subreddits that work well:

  • python, programming, technology
  • news, todayilearned
  • entrepreneur, startups

Command Options

Common Options

  • --config, -c - Path to configuration file
  • --output, -o - Output file path
  • --format - Output format (json, csv)
  • --limit - Number of items to fetch
  • --sort - Sort method (hot, new, top, rising, etc.)
  • --delay - Delay between requests (seconds)

Python API

Basic Usage

from reddit_scraper import JSONScraper, get_config_manager

scraper = JSONScraper()
posts = scraper.scrape_subreddit("python", "hot", 50)

config_manager = get_config_manager("config.json")
proxy_manager, captcha_solver = setup_advanced_features(config_manager)

advanced_scraper = JSONScraper(
    proxy_manager=proxy_manager,
    captcha_solver=captcha_solver,
    delay=config_manager.get_scraping_config().default_delay
)

posts = advanced_scraper.scrape_subreddit("MachineLearning", "top", 1000)

Proxy Management

from reddit_scraper import ProxyManager

proxy_manager = ProxyManager()
proxy_manager.add_proxy("proxy.example.com", 8080, "user", "pass", "http")

proxy_manager.health_check_all()
stats = proxy_manager.get_proxy_stats()
print(f"Healthy proxies: {stats['healthy_proxies']}/{stats['total_proxies']}")

Captcha Solving

from reddit_scraper import CaptchaSolverManager

solver = CaptchaSolverManager("YOUR_CAPSOLVER_API_KEY")

solution = solver.check_balance_and_solve(
    solver.solver.solve_recaptcha_v2,
    "https://reddit.com",
    "site_key_here"
)

if solution.success:
    print(f"Captcha solved: {solution.solution}")

Best Practices

Ethical Scraping

  • Always respect Reddit's Terms of Service
  • Don't overload Reddit's servers
  • Consider using the official API for commercial use

Rate Limiting

  • Default: 1 second delay between requests
  • Use appropriate delays between requests
  • Increase delay for large-scale operations
  • Monitor proxy health to avoid IP bans

Data Usage

  • Store scraped data responsibly
  • Respect user privacy
  • Don't republish personal information

Troubleshooting

Common Issues

"No healthy proxies available"

reddit-scraper test-proxies

reddit-scraper status

"Captcha solver balance error"

reddit-scraper status

Rate limiting errors

  • Increase --delay parameter
  • Use configuration file with multiple proxies
  • Reduce --limit per request

API Documentation

Capsolver Integration

This project integrates with capsolver for automated captcha solving, supporting:

  • reCAPTCHA v2/v3
  • hCaptcha
  • FunCaptcha
  • Image-to-text captchas

Reddit API Compatibility

Compatible with Reddit's public JSON endpoints for FREE data access.

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Submit a pull request

License

This project is for educational and research purposes. Please respect Reddit's Terms of Service and robots.txt.

Blog Post

For a detailed walkthrough of how this Reddit scraper was built, including the challenges faced and solutions implemented, read our comprehensive blog post:

Reddit Scraper: How to Scrape Reddit for Free

The blog post covers:

  • Why Python was chosen for this project
  • How pagination problems were solved
  • Different approaches for small vs large scraping jobs
  • Proxy rotation and error handling strategies
  • Real-world examples and use cases

Support

For issues, questions, or feature requests, please open an issue on GitHub or contact [email protected].


Note: This tool is designed for ethical data collection and research purposes. Always comply with Reddit's Terms of Service and respect rate limits.

About

A Python Reddit scraper with dual-mode architecture: simple requests for small jobs, async + proxy rotation for large-scale scraping. Features captcha solving, rich CLI, and smart job-size detection.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •