Distributed Hugging Face Downloader

Overview

Due to bandwidth limitations imposed by proxies on a single machine, even with multi-threading it's difficult to exceed the maximum download speed (1 MB/s). To overcome the bandwidth constraints of a single machine, we need a distributed downloader.

The system works as follows: a master node retrieves a specified Hugging Face dataset name, fetches all files associated with that dataset, and publishes them to a Redis message queue. Each machine acts as a worker, consuming tasks from Redis. Based on the file names specified in each task, workers use the Hugging Face Python SDK to download the corresponding files.

In theory, if the number of machines matches the number of files, the total download speed for the entire repository is only limited by the size of individual files.

Features

Distributed Architecture: Scale downloads across multiple machines
Redis Message Queue: Reliable task distribution and coordination
Repository Management: Hierarchical task organization (repository → job → tasks)
Batch Job Creation: Create multiple download jobs in parallel for different repositories
File Integrity Verification: Mandatory verification with automatic re-download of corrupted files
3-Retry System: Automatic retry with proper failure handling for all failed downloads
Configuration Management: Flexible config files, environment variables, and CLI options
NAS Aggregation: Automatically copy files to centralized NAS storage
SSL Bypass: Optional SSL certificate verification bypass for corporate environments
Progress Monitoring: Real-time status updates and progress tracking by repository
Graceful Shutdown: Workers handle interruptions gracefully and can be stopped immediately
Fault Tolerance: System continues working even if individual workers fail
Dual Download Methods: Uses both Hugging Face Hub API and direct HTTP downloads

Architecture

Components

Master Node (MasterNode)
- Discovers all files in a Hugging Face dataset
- Creates download tasks and publishes them to Redis queue
- Monitors overall job progress
- Provides status reporting
Worker Nodes (WorkerNode)
- Consume download tasks from Redis queue
- Download files using Hugging Face Hub or direct HTTP
- Report progress and handle failures
- Send periodic heartbeats to maintain alive status
- Copy files to NAS storage (if configured)
Redis Message Queue (RedisClient)
- Task queue management with username/password authentication support
- Worker coordination and heartbeat tracking
- Job status and progress tracking
- Failed task retry management
Configuration System (ConfigManager)
- File-based configuration with automatic discovery
- Environment variable overrides
- CLI argument integration
- Validation and error handling
NAS Aggregator (NASAggregator)
- Automatic file copying to centralized storage
- Configurable directory structure preservation
- Asynchronous copying for performance
- Smart deduplication and error handling

Data Models

DownloadTask: Individual file download task
WorkerStatus: Worker health and activity tracking
JobStatus: Overall download job progress
TaskStatus: Enumeration of task states (pending, in_progress, completed, failed, retrying)
AppConfig: Application configuration settings
RedisConfig: Redis connection settings
HuggingFaceConfig: Hugging Face authentication and caching
NASConfig: NAS aggregation settings

Installation

Prerequisites

Python 3.10+
Redis server (supports Redis 6.0+ ACL authentication)
UV package manager

Setup

# Clone the repository
git clone <repository-url>
cd distributed_downloader

# Install dependencies with UV
uv sync

# Or with native TLS support
uv --native-tls sync

# Or install in development mode
uv pip install -e .

# For high-speed downloads (optional but recommended)
uv sync --extra hf_transfer

Redis Setup

# Install and start Redis (Ubuntu/Debian)
sudo apt-get install redis-server
sudo systemctl start redis-server

# Or using Docker
docker run -d -p 6379:6379 redis:latest

# For Redis 6.0+ with ACL authentication
redis-cli
> ACL SETUSER myuser +@all ~* &mypassword

Configuration

Configuration Files

The system supports multiple configuration methods with the following priority order:

CLI arguments (highest priority)
Environment variables
Configuration files
Default values (lowest priority)

Create Configuration File

# Generate a sample configuration file
hf-downloader init-config

# Or create manually
hf-downloader init-config --output my-config.ini

Configuration File Locations

The system automatically searches for config files in this order:

config.ini (current directory)
distributed_downloader.ini
~/.config/distributed_downloader/config.ini
~/.distributed_downloader.ini
/etc/distributed_downloader/config.ini

Sample Configuration (`config.ini`)

[redis]
host = localhost
port = 6379
password = your_redis_password
username = your_redis_username  # For Redis 6.0+ ACL
db = 0

[huggingface]
token = your_huggingface_token
cache_dir = /path/to/cache/dir
disable_ssl_verify = false
enable_hf_transfer = true
endpoint = http://hf-mirror.com

[nas]
enabled = true
path = /mnt/nas/huggingface-datasets
copy_after_download = true
preserve_structure = true

[app]
log_level = INFO
output_dir = /path/to/downloads

Environment Variables

# Redis settings
export REDIS_HOST=redis.example.com
export REDIS_PASSWORD=mypassword
export REDIS_USERNAME=myuser

# Hugging Face settings  
export HF_TOKEN=your_token_here
export HF_DISABLE_SSL_VERIFY=true
export HF_HUB_ENABLE_HF_TRANSFER=1
export HF_ENDPOINT=http://hf-mirror.com

# NAS settings
export NAS_ENABLED=true
export NAS_PATH=/mnt/nas/datasets

# Application settings
export LOG_LEVEL=DEBUG
export OUTPUT_DIR=/downloads

Usage

Command Line Interface

The system provides a CLI tool hf-downloader with the following commands:

Start a Download Job (Master)

# Create a download job for a single dataset
hf-downloader master "microsoft/DialoGPT-medium"

# Using configuration file
hf-downloader --config my-config.ini master "nvidia/Llama-Nemotron-VLM-Dataset-v1"

# Specify output directory
hf-downloader master "microsoft/DialoGPT-medium" --output-dir ./my-downloads

# With Redis authentication
hf-downloader --redis-password mypass --redis-username myuser master "dataset-name"

Batch Job Creation (Multiple Repositories)

# Create multiple download jobs in parallel
hf-downloader batch-master "microsoft/DialoGPT-medium,nvidia/Llama-Nemotron-VLM-Dataset-v1,openai/whisper-large-v3"

# From a file (one repository per line)
echo "microsoft/DialoGPT-medium" > datasets.txt
echo "nvidia/Llama-Nemotron-VLM-Dataset-v1" >> datasets.txt
hf-downloader batch-master datasets.txt --from-file

# With custom output directory
hf-downloader batch-master "repo1,repo2,repo3" --output-dir ./batch-downloads

Start Worker Nodes

# Start a worker (run on each machine)
hf-downloader worker

# Start worker with custom configuration
hf-downloader --config /path/to/config.ini worker --worker-id "worker-01"

# With SSL bypass for corporate environments
hf-downloader --disable-ssl-verify worker

# With custom Redis settings
hf-downloader --redis-host 192.168.1.100 --redis-password mypass worker

Monitor Progress

# Show overall system status
hf-downloader status

# Monitor specific job by job ID
hf-downloader status <job-id>

# Monitor repository by name
hf-downloader status "microsoft/DialoGPT-medium"

# Watch status with live updates
hf-downloader status "nvidia/Llama-Nemotron-VLM-Dataset-v1" --watch

# List all repositories and their progress
hf-downloader repos

# Check queue statistics
hf-downloader queue

# List active workers
hf-downloader workers

Advanced Features

NAS Aggregation

The NAS aggregation feature automatically copies downloaded files to a centralized storage location:

[nas]
enabled = true
path = /mnt/nas/huggingface-datasets
copy_after_download = true      # Copy files after successful download
preserve_structure = true       # Keep original directory structure

Directory Structure Options:

preserve_structure = true: /nas/dataset_name/path/to/file.txt
preserve_structure = false: /nas/dataset_name_file.txt

Features:

Asynchronous copying (doesn't slow down downloads)
Smart deduplication (skips existing files with matching size)
Automatic directory creation
Error tolerance (NAS failures don't break downloads)

SSL Certificate Bypass

For corporate environments with proxy servers or custom certificates:

[huggingface]
disable_ssl_verify = true

Or via environment variable:

export HF_DISABLE_SSL_VERIFY=true

This disables SSL certificate verification for:

Hugging Face Hub API calls
Direct HTTP file downloads
Suppresses urllib3 SSL warnings

High-Speed Downloads with hf_transfer

For high-bandwidth environments, enable hf_transfer for significantly faster downloads:

[huggingface]
enable_hf_transfer = true

Or via environment variable:

export HF_HUB_ENABLE_HF_TRANSFER=1

Benefits:

Rust-based library for maximum performance
Multi-threaded transfers with optimal bandwidth utilization
Automatic fallback to standard downloads if unavailable
No configuration changes required - works transparently

Requirements:

High-bandwidth network connection (>100 Mbps recommended)
The hf_transfer library is automatically installed with huggingface-hub[hf_transfer]

Mirror Sites and Custom Endpoints

For users in regions with limited access to Hugging Face or those wanting to use mirror sites:

[huggingface]
endpoint = http://hf-mirror.com

Or via environment variable:

export HF_ENDPOINT=http://hf-mirror.com

Popular Mirror Sites:

China: http://hf-mirror.com
Custom Enterprise: https://your-company-hf-mirror.com
Default: https://huggingface.co (if not specified)

Features:

Automatic URL rewriting: All download URLs use the specified endpoint
API compatibility: Full compatibility with Hugging Face Hub API
Transparent operation: Works with both hf_hub_download and direct HTTP downloads
SSL support: Works with both HTTP and HTTPS endpoints

Redis Authentication

Supports both traditional and modern Redis authentication:

Traditional password-only (Redis < 6.0):

[redis]
password = your_password

Username + password (Redis 6.0+ ACL):

[redis]
username = your_username
password = your_password

Example Workflows

Basic Setup

Create configuration file

hf-downloader init-config
# Edit config.ini with your settings

Start Redis server
```
redis-server
```

Create download jobs

# Single repository
hf-downloader master "microsoft/DialoGPT-medium" -o ./downloads

# Multiple repositories in parallel
hf-downloader batch-master "microsoft/DialoGPT-medium,nvidia/Llama-Nemotron-VLM-Dataset-v1,openai/whisper-large-v3"

Start workers on multiple machines

# Machine 1
hf-downloader worker --worker-id worker-01

# Machine 2
hf-downloader worker --worker-id worker-02

# Machine N
hf-downloader worker --worker-id worker-N

Monitor progress

# List all repositories
hf-downloader repos

# Monitor specific repository
hf-downloader status "microsoft/DialoGPT-medium" --watch

# Check overall system status
hf-downloader status

Corporate Environment Setup

# 1. Create config with SSL bypass and authentication
cat > config.ini << EOF
[redis]
host = redis.corporate.com
username = downloader_user
password = secure_password

[huggingface]
token = hf_your_token_here
disable_ssl_verify = true
enable_hf_transfer = true
endpoint = http://hf-mirror.com

[nas]
enabled = true
path = /corporate/nas/ml-datasets
EOF

# 2. Start batch jobs for multiple datasets
hf-downloader --config config.ini batch-master "dataset1,dataset2,dataset3"

# 3. Start workers across the corporate network
hf-downloader --config config.ini worker

China Mirror Setup

# For users in China using hf-mirror.com
cat > config.ini << EOF
[huggingface]
token = your_hf_token
endpoint = http://hf-mirror.com
enable_hf_transfer = true
disable_ssl_verify = false
EOF

# Start downloading with mirror
hf-downloader --config config.ini batch-master "microsoft/DialoGPT-medium,nvidia/Llama-Nemotron-VLM-Dataset-v1"
hf-downloader --config config.ini worker

High-Performance Setup with NAS

# Configure for maximum throughput with centralized storage
cat > config.ini << EOF
[huggingface]
enable_hf_transfer = true
endpoint = http://hf-mirror.com

[nas]
enabled = true
path = /high-speed-nas/datasets
copy_after_download = true
preserve_structure = true
delete_after_copy = true

[app]
output_dir = /local-ssd/temp-downloads
EOF

# Workers download to local SSD with hf_transfer, then move to NAS
hf-downloader --config config.ini worker

Repository Management & File Integrity

Repository-Based Organization

The system now uses a hierarchical approach for better organization:

Repository Level: Each HuggingFace repository gets its own job
Job Level: Contains all tasks for downloading a specific repository
Task Level: Individual file download tasks within a job

File Integrity Verification

Mandatory Verification System:

Size Verification: Files must match expected size exactly
Corruption Detection: Identifies HTML error pages, empty files, and unreadable content
Automatic Re-download: Corrupted or incomplete files are automatically deleted and re-downloaded
Read Test: Ensures downloaded files are actually readable

3-Retry System

Mandatory Retries: All failed downloads are automatically retried up to 3 times
Comprehensive Failure Handling: Covers network errors, corruption, size mismatches, and timeouts
Smart Retry Logic: Each retry attempt gets a fresh download attempt
Permanent Failure Tracking: Tasks that fail all 3 retries are marked as permanently failed
Worker Recovery: Workers automatically reconnect and resume after network issues
Immediate Shutdown: Ctrl+C stops workers immediately, cleaning up partial downloads
Task Requeuing: Interrupted tasks are requeued for other workers

Performance Considerations

Scaling: Add more workers to increase download throughput
File Size: Large files may bottleneck individual workers
Network: Total throughput limited by aggregate network bandwidth
Redis: Single Redis instance can handle hundreds of workers
Storage: Ensure sufficient disk space on worker machines
NAS Performance: Asynchronous copying minimizes download impact

Development

Project Structure

distributed_downloader/
├── __init__.py          # Package initialization
├── models.py            # Data models (Pydantic)
├── config.py            # Configuration management
├── redis_client.py      # Redis operations
├── master.py            # Master node implementation
├── worker.py            # Worker node implementation
├── ssl_config.py        # SSL bypass configuration
├── nas_aggregator.py    # NAS file aggregation
└── cli.py              # Command-line interface

Troubleshooting

Common Issues

Redis Connection Failed
- Check Redis server is running: redis-cli ping
- Verify host/port configuration in config file
- Test authentication: redis-cli -u redis://user:pass@host:port
- Check firewall rules
SSL Certificate Errors
- Set disable_ssl_verify = true in config
- Or use environment variable: export HF_DISABLE_SSL_VERIFY=true
- Check corporate proxy settings
Hugging Face Authentication
- Set token in config: token = your_hf_token
- Or login via CLI: huggingface-cli login
- Verify token has dataset access permissions
Worker Not Responding to Ctrl+C
- Updated workers respond immediately to interruption
- For old processes: pkill -f "hf-downloader"
- Force kill: pkill -9 -f "hf-downloader"
NAS Copy Failures
- Check NAS path exists and is writable
- Verify network connectivity to NAS
- Check disk space on NAS
- Review worker logs for detailed errors
Configuration Not Loading
- Check config file syntax: cat config.ini
- Verify file location in search paths
- Use --config to specify exact path
- Check file permissions

Logs

Workers and masters log detailed information. Increase log level for debugging:

# In config file
[app]
log_level = DEBUG

# Or via CLI
hf-downloader --log-level DEBUG worker

# Or via environment
export LOG_LEVEL=DEBUG

Process Management

# Find running processes
ps aux | grep hf-downloader

# Kill specific process
kill <PID>

# Kill all downloader processes
pkill -f "hf-downloader"

# Force kill all
pkill -9 -f "hf-downloader"

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
distributed_downloader		distributed_downloader
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
config.ini.example		config.ini.example
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

Fix3dP0int/distributed-downloader

Folders and files

Latest commit

History

Repository files navigation

Distributed Hugging Face Downloader

Overview

Features

Architecture

Components

Data Models

Installation

Prerequisites

Setup

Redis Setup

Configuration

Configuration Files

Create Configuration File

Configuration File Locations

Sample Configuration (config.ini)

Environment Variables

Usage

Command Line Interface

Start a Download Job (Master)

Batch Job Creation (Multiple Repositories)

Start Worker Nodes

Monitor Progress

Advanced Features

NAS Aggregation

SSL Certificate Bypass

High-Speed Downloads with hf_transfer

Mirror Sites and Custom Endpoints

Redis Authentication

Example Workflows

Basic Setup

Corporate Environment Setup

China Mirror Setup

High-Performance Setup with NAS

Repository Management & File Integrity

Repository-Based Organization

File Integrity Verification

3-Retry System

Performance Considerations

Development

Project Structure

Troubleshooting

Common Issues

Logs

Process Management

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Sample Configuration (`config.ini`)

Packages