Full technical guide can be found here 🕮
This repository contains three Python scripts for web crawling:
- simpleCrawler.py: A minimal, educational web crawler for BFS crawling.
- advancedCrawler.py: A robust, multithreaded crawler with user-agent rotation, robots.txt compliance, and advanced filtering.
- crawlerScrape-do.py: A multithreaded crawler that leverages Scrape.do for anti-bot bypass and JavaScript rendering.
All scripts crawl public web pages, respect robots.txt, and export discovered URLs to CSV.
-
Python 3.7+
-
requests
andbeautifulsoup4
libraries
Install with:pip install requests beautifulsoup4
-
For
crawlerScrape-do.py
: a Scrape.do API token (free 1000 API credits/month)
A minimal, educational web crawler using BFS.
- Set the seed URL and max pages (default: Wikipedia, 10 pages):
crawl("https://www.wikipedia.org/", max_pages=10)
- Run:
python simpleCrawler.py
Outputs crawled URLs to crawled_urls.csv
.
A robust, multithreaded crawler with user-agent rotation, robots.txt compliance, and advanced filtering.
- Set the seed URL and max pages (default: Wikipedia, 20 pages):
threaded_crawl("https://www.wikipedia.org/", max_pages=20)
- Run:
python advancedCrawler.py
Features:
- Multithreading for speed
- User-agent rotation
- Logging to
crawler.log
- Skips login/admin/cart/etc. pages
- Respects robots.txt
- Saves HTML to
pages/
and URLs tocrawled_urls.csv
A multithreaded crawler that uses Scrape.do to bypass anti-bot protections and optionally render JavaScript.
- Set your Scrape.do API token and seed URL:
crawl_with_scrape_do( seed_url="https://en.wikipedia.org/", token="<your-scrape-do-token>", max_pages=10, delay=2.5, render=False )
- Run:
python crawlerScrape-do.py
Features:
- Uses Scrape.do for requests (handles proxies, CAPTCHAs, JS rendering)
- Multithreaded crawling
- Respects robots.txt
- Saves HTML to
pages_scrape_do/
and URLs tocrawled_urls_scrape_do.csv
- Logging to
scrape_do_crawler.log
Please ensure:
- You crawl only public web pages
- You do not automate excessive requests or violate website Terms of Service
- Use Scrape.do responsibly and ethically
- Rotating premium proxies & geo-targeting
- Built-in header spoofing
- Handles redirects, CAPTCHAs, and JavaScript rendering
- 1000 free credits/month