Python Web Crawler (Basic & Advanced)

Full technical guide can be found here 🕮

This repository contains three Python scripts for web crawling:

simpleCrawler.py: A minimal, educational web crawler for BFS crawling.
advancedCrawler.py: A robust, multithreaded crawler with user-agent rotation, robots.txt compliance, and advanced filtering.
crawlerScrape-do.py: A multithreaded crawler that leverages Scrape.do for anti-bot bypass and JavaScript rendering.

All scripts crawl public web pages, respect robots.txt, and export discovered URLs to CSV.

Requirements

Python 3.7+
requests and beautifulsoup4 libraries
Install with:
```
pip install requests beautifulsoup4
```
For crawlerScrape-do.py: a Scrape.do API token (free 1000 API credits/month)

🔍 How to Use Each Script

`simpleCrawler.py`

A minimal, educational web crawler using BFS.

Set the seed URL and max pages (default: Wikipedia, 10 pages):
```
crawl("https://www.wikipedia.org/", max_pages=10)
```
Run:
```
python simpleCrawler.py
```

Outputs crawled URLs to crawled_urls.csv.

`advancedCrawler.py`

A robust, multithreaded crawler with user-agent rotation, robots.txt compliance, and advanced filtering.

Set the seed URL and max pages (default: Wikipedia, 20 pages):
```
threaded_crawl("https://www.wikipedia.org/", max_pages=20)
```
Run:
```
python advancedCrawler.py
```

Features:

Multithreading for speed
User-agent rotation
Logging to crawler.log
Skips login/admin/cart/etc. pages
Respects robots.txt
Saves HTML to pages/ and URLs to crawled_urls.csv

`crawlerScrape-do.py`

A multithreaded crawler that uses Scrape.do to bypass anti-bot protections and optionally render JavaScript.

Set your Scrape.do API token and seed URL:

crawl_with_scrape_do(
    seed_url="https://en.wikipedia.org/",
    token="<your-scrape-do-token>",
    max_pages=10,
    delay=2.5,
    render=False
)

Run:
```
python crawlerScrape-do.py
```

Features:

Uses Scrape.do for requests (handles proxies, CAPTCHAs, JS rendering)
Multithreaded crawling
Respects robots.txt
Saves HTML to pages_scrape_do/ and URLs to crawled_urls_scrape_do.csv
Logging to scrape_do_crawler.log

⚠️ Legal & Ethical Notes

Please ensure:

You crawl only public web pages
You do not automate excessive requests or violate website Terms of Service
Use Scrape.do responsibly and ethically

🚀 Why Use Scrape.do with `crawlerScrape-do.py`?

Rotating premium proxies & geo-targeting
Built-in header spoofing
Handles redirects, CAPTCHAs, and JavaScript rendering
1000 free credits/month

👉 Get your free API token here

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
advancedCrawler.py		advancedCrawler.py
crawlerScrape-do.py		crawlerScrape-do.py
simpleCrawler.py		simpleCrawler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Python Web Crawler (Basic & Advanced)

Requirements

🔍 How to Use Each Script

`simpleCrawler.py`

`advancedCrawler.py`

`crawlerScrape-do.py`

⚠️ Legal & Ethical Notes

🚀 Why Use Scrape.do with `crawlerScrape-do.py`?

About

Uh oh!

Releases

Packages

Languages

scrape-do/python-web-crawler

Folders and files

Latest commit

History

Repository files navigation

Python Web Crawler (Basic & Advanced)

Requirements

🔍 How to Use Each Script

simpleCrawler.py

advancedCrawler.py

crawlerScrape-do.py

⚠️ Legal & Ethical Notes

🚀 Why Use Scrape.do with crawlerScrape-do.py?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`simpleCrawler.py`

`advancedCrawler.py`

`crawlerScrape-do.py`

🚀 Why Use Scrape.do with `crawlerScrape-do.py`?

Packages