Reddit Scraper

Features

Multiple Scraping Methods

JSON Endpoint Scraper - Fast scraping using Reddit's .json endpoints (no authentication required)
Advanced Requests Scraper - Custom pagination and bulk scraping capabilities

Advanced Capabilities

Proxy Rotation - Automatic proxy switching with health monitoring
Captcha Solving - Automated captcha handling using Capsolver API
User Agent Rotation - Realistic browser simulation
Rate Limiting - Respectful request throttling
Rich CLI Interface - Beautiful command-line interface with progress bars
Multiple Export Formats - JSON and CSV output with full comment thread data

Installation

Using uv (Recommended)

git clone https://github.com/proxidize/reddit-scraper.git
cd reddit-scraper

uv venv
source .venv/bin/activate 

uv pip install -e .

Using pip

git clone https://github.com/proxidize/reddit-scraper.git
cd reddit-scraper

python -m venv .venv
source .venv/bin/activate 

pip install -e .

Development

Setup for Development

pip install -e .[dev]

uv pip install -e .[dev]

Running Tests

python tests/run_tests.py

pytest tests/ -v --cov=reddit_scraper

pytest tests/unit/ -v -m unit        
pytest tests/integration/ -v -m integration  
pytest tests/ -v -m "not slow"     

pytest tests/ --cov=reddit_scraper --cov-report=html

Test Markers

unit - Fast unit tests
integration - Integration tests that may hit external APIs
slow - Slow tests that should be skipped in CI

Docker Support

Building and Running with Docker

docker build -t reddit-scraper .

docker run -v $(pwd)/config.json:/app/config.json reddit-scraper interactive --config config.json

docker run -v $(pwd)/config.json:/app/config.json reddit-scraper json subreddit python --limit 10 --config config.json

docker run -v $(pwd)/config.json:/app/config.json -v $(pwd)/output:/app/output reddit-scraper json subreddit python --limit 10 --output output/posts.json --config config.json

Quick Start

1. Interactive Mode (Recommended)

python3 -m reddit_scraper.cli interactive

python3 -m reddit_scraper.cli interactive --config config.json

2. Direct Commands

python3 -m reddit_scraper.cli json subreddit python --limit 10

python3 -m reddit_scraper.cli json subreddit technology --config config.json --limit 50

Note: If you've properly installed the package with pip install -e ., you can use reddit-scraper directly instead of python3 -m reddit_scraper.cli

Configuration

The scraper uses a JSON configuration file to manage all settings including proxies, captcha solvers, and scraping preferences.

Copy config.example.json to config.json and edit:

{
  "proxies": [
    {
      "host": "proxy1.example.com",
      "port": 8080,
      "username": "your_proxy_username",
      "password": "your_proxy_password",
      "proxy_type": "http"
    },
    {
      "host": "proxy2.example.com",
      "port": 1080,
      "username": "your_proxy_username",
      "password": "your_proxy_password",
      "proxy_type": "socks5"
    }
  ],
  "captcha_solvers": [
    {
      "api_key": "CAP-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
      "provider": "capsolver",
      "site_keys": {
        "reddit.com": "6Le-wvkSAAAAAPBMRTvw0Q4Muexq9bi0DJwx_mJ-",
        "www.reddit.com": "6Le-wvkSAAAAAPBMRTvw0Q4Muexq9bi0DJwx_mJ-"
      }
    }
  ],
  "scraping": {
    "default_delay": 1.0,
    "max_retries": 3,
    "requests_per_minute": 60,
    "user_agent": "RedditScraper/1.0.0",
    "rotate_user_agents": true
  }
}

Key Features

Multiple Proxies: Add multiple HTTP and SOCKS5 proxies for automatic rotation
Captcha Solving: Integrate with Capsolver for automated captcha handling with custom site keys
Input Validation: Automatic validation of subreddit names, usernames, and other inputs
Flexible Configuration: Easy JSON-based configuration management with validation
Health Monitoring: Built-in proxy health checking and performance monitoring

Setup Commands

cp config.example.json config.json

nano config.json  

python3 -m reddit_scraper.cli status --config config.json

Data Validation & Processing

The scraper includes robust input validation and data processing capabilities:

Input Validation

Subreddit Names: Validates format, length (1-21 chars), and checks for reserved names
Usernames: Validates Reddit username format (3-20 chars, alphanumeric plus underscore/hyphen)
Post IDs: Ensures proper Reddit post ID format
URLs: Validates and normalizes Reddit URLs

Data Processing

Comment Threading: Maintains proper parent-child relationships in comment trees
Data Cleaning: Removes unnecessary metadata while preserving essential information
Field Standardization: Consistent field names and data types across all scraped content

Error Handling

from reddit_scraper import ValidationError

try:
    posts = scraper.scrape_subreddit("invalid-name!", "hot", 10)
except ValidationError as e:
    print(f"Validation error: {e}")

Available Commands

Interactive Mode

python3 -m reddit_scraper.cli interactive [--config CONFIG_FILE]

Core Scraping Commands

JSON Scraper (Fastest)

python3 -m reddit_scraper.cli json subreddit SUBREDDIT_NAME [--config CONFIG_FILE] [options]

python3 -m reddit_scraper.cli json user USERNAME [options]

python3 -m reddit_scraper.cli json comments SUBREDDIT POST_ID [options]

python3 -m reddit_scraper.cli json subreddit-with-comments SUBREDDIT_NAME [options]

Comment Scraping

Extract rich comment data with full thread structure:

python3 -m reddit_scraper.cli json subreddit-with-comments python --limit 10 --include-comments --comment-limit 20 --output posts_with_comments.json

python3 -m reddit_scraper.cli json comments python POST_ID --sort best --output single_post_comments.json

python3 -m reddit_scraper.cli json user username --limit 25 --sort top --output user_posts.json

Comment Data Includes:

Author information and scores
Full comment text and timestamps
Nested reply structure
Thread hierarchy and relationships
Community engagement metrics

Real Example (Actual Scraped Data):

{
  "title": "A simple home server to wirelessly stream any video file",
  "author": "Enzo10091",
  "score": 8,
  "num_comments": 1,
  "comment_count_scraped": 1,
  "comments": [
    {
      "id": "lwg8h3x",
      "author": "ismail_the_whale",
      "body": "nice, but you really have to clean this up. i guess you're not a python dev.\n\n- use snake_case\n- use a pyproject.toml file",
      "score": 2,
      "created_utc": 1755262448.0,
      "parent_id": "t3_1mqw7zr",
      "replies": []
    }
  ]
}

Advanced Requests Scraper (Best for Bulk)

python3 -m reddit_scraper.cli requests paginated SUBREDDIT_NAME [options]

Utility Commands

System Health & Status

python3 -m reddit_scraper.cli status --config config.json

python3 -m reddit_scraper.cli test-proxies --config config.json --test-urls 3

Setup and Configuration

cp config.example.json config.json
nano config.json

python3 -m reddit_scraper.cli status --config config.json

Global Search

python3 -m reddit_scraper.cli search "python tips" --subreddit python

python3 -m reddit_scraper.cli search "neural networks" --subreddit MachineLearning

Current Reddit Restrictions

Reddit has some protection against automated scraping:

Some subreddits may trigger captcha challenges (r/webscraping, etc.)
Large bulk requests may hit rate limits
Search endpoints work but may be slower than direct scraping

Recommended approach:

Use interactive mode for best success rate
Start with popular, stable subreddits like python, technology
Use proxies and captcha solving for reliable large-scale scraping
Search functionality works well for targeted queries

Working Examples (Tested)

python3 -m reddit_scraper.cli interactive --config config.json

python3 -m reddit_scraper.cli json subreddit python --limit 10
python3 -m reddit_scraper.cli json subreddit technology --config config.json --limit 50

python3 -m reddit_scraper.cli search "python tips" --subreddit python

python3 -m reddit_scraper.cli requests paginated python --max-posts 100

python3 -m reddit_scraper.cli status --config config.json
python3 -m reddit_scraper.cli test-proxies --config config.json

Subreddits that work well:

python, programming, technology
news, todayilearned
entrepreneur, startups

Command Options

Common Options

--config, -c - Path to configuration file
--output, -o - Output file path
--format - Output format (json, csv)
--limit - Number of items to fetch
--sort - Sort method (hot, new, top, rising, etc.)
--delay - Delay between requests (seconds)

Python API

Basic Usage

from reddit_scraper import JSONScraper, get_config_manager

scraper = JSONScraper()
posts = scraper.scrape_subreddit("python", "hot", 50)

config_manager = get_config_manager("config.json")
proxy_manager, captcha_solver = setup_advanced_features(config_manager)

advanced_scraper = JSONScraper(
    proxy_manager=proxy_manager,
    captcha_solver=captcha_solver,
    delay=config_manager.get_scraping_config().default_delay
)

posts = advanced_scraper.scrape_subreddit("MachineLearning", "top", 1000)

Proxy Management

from reddit_scraper import ProxyManager

proxy_manager = ProxyManager()
proxy_manager.add_proxy("proxy.example.com", 8080, "user", "pass", "http")

proxy_manager.health_check_all()
stats = proxy_manager.get_proxy_stats()
print(f"Healthy proxies: {stats['healthy_proxies']}/{stats['total_proxies']}")

Captcha Solving

from reddit_scraper import CaptchaSolverManager

solver = CaptchaSolverManager("YOUR_CAPSOLVER_API_KEY")

solution = solver.check_balance_and_solve(
    solver.solver.solve_recaptcha_v2,
    "https://reddit.com",
    "site_key_here"
)

if solution.success:
    print(f"Captcha solved: {solution.solution}")

Best Practices

Ethical Scraping

Always respect Reddit's Terms of Service
Don't overload Reddit's servers
Consider using the official API for commercial use

Rate Limiting

Default: 1 second delay between requests
Use appropriate delays between requests
Increase delay for large-scale operations
Monitor proxy health to avoid IP bans

Data Usage

Store scraped data responsibly
Respect user privacy
Don't republish personal information

Troubleshooting

Common Issues

"No healthy proxies available"

reddit-scraper test-proxies

reddit-scraper status

"Captcha solver balance error"

reddit-scraper status

Rate limiting errors

Increase --delay parameter
Use configuration file with multiple proxies
Reduce --limit per request

API Documentation

Capsolver Integration

This project integrates with capsolver for automated captcha solving, supporting:

reCAPTCHA v2/v3
hCaptcha
FunCaptcha
Image-to-text captchas

Reddit API Compatibility

Compatible with Reddit's public JSON endpoints for FREE data access.

Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Submit a pull request

License

This project is for educational and research purposes. Please respect Reddit's Terms of Service and robots.txt.

Blog Post

For a detailed walkthrough of how this Reddit scraper was built, including the challenges faced and solutions implemented, read our comprehensive blog post:

Reddit Scraper: How to Scrape Reddit for Free

The blog post covers:

Why Python was chosen for this project
How pagination problems were solved
Different approaches for small vs large scraping jobs
Proxy rotation and error handling strategies
Real-world examples and use cases

Support

For issues, questions, or feature requests, please open an issue on GitHub or contact [email protected].

Note: This tool is designed for ethical data collection and research purposes. Always comply with Reddit's Terms of Service and respect rate limits.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
reddit_scraper		reddit_scraper
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
config.example.json		config.example.json
pyproject.toml		pyproject.toml

License

proxidize/reddit-scraper

Folders and files

Latest commit

History

Repository files navigation

Reddit Scraper

Features

Multiple Scraping Methods

Advanced Capabilities

Installation

Using uv (Recommended)

Using pip

Development

Setup for Development

Running Tests

Test Markers

Docker Support

Building and Running with Docker

Quick Start

1. Interactive Mode (Recommended)

2. Direct Commands

Configuration

Key Features

Setup Commands

Data Validation & Processing

Input Validation

Data Processing

Error Handling

Available Commands

Interactive Mode

Core Scraping Commands

JSON Scraper (Fastest)

Comment Scraping

Advanced Requests Scraper (Best for Bulk)

Utility Commands

System Health & Status

Setup and Configuration

Global Search

Current Reddit Restrictions

Working Examples (Tested)

Command Options

Common Options

Python API

Basic Usage

Proxy Management

Captcha Solving

Best Practices

Ethical Scraping

Rate Limiting

Data Usage

Troubleshooting

Common Issues

"No healthy proxies available"

"Captcha solver balance error"

Rate limiting errors

API Documentation

Capsolver Integration

Reddit API Compatibility

Contributing

License

Blog Post

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages