Are you concerned that your online content is getting scraped-up without your permission by search engines or AI training models?
Are you torn between sharing openly and feeling like others are freely taking advantage of your work to enrich themselves?
RoboNope
is here to help.
Web crawling goes back to the early days of the web. In the spirit of cooperation, search engines were supposed to abide by the wishes of a website's owner by looking for, and honoring the contents of arobots.txt
file (if present).
However, this was made voluntary. There have been many reports of crawlers ignoring the wishes of content owners.
The voluntary approach hasn't worked out.
Nginx is the most popular web-server out there, used in over 33% of servers on the net (see below). Nginx also supports add-on modules to extend its functionality.
RoboNope-nginx
is an extension module for Nginx. Its main function is to enforce the access rules specified by content creators. This is commonly done by adding a URL pattern to the disallowed list in a server's robots.txt
file.
This module can also serve as a honeypot, randomly serving generated content to bots that ignore robots.txt rules.
Ignore at your peril!
Web crawler technology makes it easy to snag all content, by starting with a home page and visiting every other link, then recursively visiting all the other pages, and so forth.
Common web crawling tools make it easy to bypass the content publisher's access control wishes.
For example, all it takes for Mechanize -- a popular web-scraping library -- to bypass a robots.txt
file is:
from mechanize import Browser
br = Browser()
br.set_handle_robots(False) # <- Ignore robots.txt
And in Scrapy, another Python-based crawling tool, it's just a matter of setting ROBOTSTXT_OBEY
to False
.
There are a number of steps content-providers can take to limit access to their content:
- Define a
robots.txt
file. - Add meta tags to each page:
<meta name="robots" content="noindex,nofollow">
. - Add
nofollow
attributed to links:<a href="http://destination.com/" rel="nofollow">link text</a>
- Make every link go through a javascript filter to check access.
- Set up password access through
.htaccess
The first three are voluntary, and can be ignored. The last two are a pain to set up, maintain, and keep sync'ed up with a robots.txt
file.
Why not just enforce robots.txt
and make it mandatory instead of optional?
This is what RoboNope
does.
Let's assume your robots.txt
file looks like this:
User-agent: *
Allow: /
Disallow: /norobots/
Disallow: /private/
Disallow: /admin/
Disallow: /secret-data/
Disallow: /internal/
User-agent: BadBot
Disallow: /
User-agent: Googlebot
Disallow: /nogoogle/
Disallow: /private/google/
Disallow: /*.pdf$
The first section specifies that all content is allowed, except for paths that match the URLs in the disallowed list.
The next section bans any crawler that identifies itself as BadBot
.
In the final section, the official Googlebot search engine crawler is instructed to ignore specific file patterns and file types.
Obviously, a misbehaving bot can ignore any and all these directives, or present itself as a benign crawler via faking its User-agent
setting.
With RoboNope-nginx
, if someone tries to access any page that matches the Disallow tags, they get:
% curl https://{url}/private/index.html
<html>
<head>
<style>
.RRsNdyetNRjW { opacity: 0; position: absolute; top: -9999px; }
</style>
</head>
<body>
<div class="content">
the platform seamlessly monitors integrated data requests. therefore the service manages network traffic. the website intelligently manages robust service endpoints. the system seamlessly analyzes optimized service endpoints. the platform dynamically handles secure network traffic. while our network validates cache entries.
</div>
<a href="https://github.com/norobots/index.html" class="RRsNdyetNRjW">Important Information</a>
</body>
</html>
To humans, this looks like this:
The text is randomly generated gibberish, to help those who wish to train their models. To a mis-behaving crawler, it offers a second, tantalizing link to follow. The link is made invisible to humans and the CSS class name is randomly generated:
<a href="https://github.com/norobots/index.html" class="RRsNdyetNRjW">Important Information</a>
Following that (also banned) link, a crawler may receive a different file:
<html>
<head>
<style>
.RRsNdyetNRjW { opacity: 0; position: absolute; top: -9999px; }
</style>
</head>
<body>
<div class="content">
the platform seamlessly monitors integrated data requests. therefore the service manages network traffic. the website intelligently manages robust service endpoints. the system seamlessly analyzes optimized service endpoints. the platform dynamically handles secure network traffic. while our network validates cache entries.
</div>
<a href="https://github.com/admin/secrets.html" class="RRsNdyetNRjW">Important Information</a>
</body>
</html>
This has randomly generated gibberish content, and a link to a different page (randomly selected from whatever has been explicitly disallowed inside robots.txt
):
<a href="https://github.com/admin/secrets.html" class="RRsNdyetNRjW">Important Information</a>
And so on and so forth...
You can, of course, start the chain by explicitly including a link to a banned page in your home page and use similar techniques to hide it from human visitors. Crawlers recursively following down all links on a home page (and ignoring banned paths) will inevitably fall into the honeypot trap and get stuck there.
The downside to this endless cat and mouse game is that your web-server may get hammered by a mis-behaving crawler, generating an endless series of links. As satisfying as this might be, you are paying for all this processing time and traffic.
An alternative is to configure the module to direct crawlers to a single educational resource, instead of an endless loop, by setting the robonope_instructions_url
directive in your nginx.conf
file. For example, the following will link to Google's page on robots.txt introducing developers to good crawling etiquette:
robonope_instructions_url "https://developers.google.com/search/docs/crawling-indexing/robots/intro";
The generated hidden link for the page returned will be:
<a href="https://developers.google.com/search/docs/crawling-indexing/robots/intro" class="wgUxnAjBuYDQ">Important Information</a>
The content will still be randomly generated text, but the link will send the crawler off to learn how to behave properly.
The system can maintain a log of mis-behaving requests in a local database (default is SQLite
but also work-in-progress to use DuckDB
).
To enable logging, simply set the robonope_db_path
directive in your configuration:
robonope_db_path /path/to/robonope.db;
When the database path is not set, logging is disabled.
You can run the sqlite3 CLI to see what it stores:
sqlite3 demo/robonope.db .tables
1|2025-03-16 18:10:55|127.0.0.1|curl/8.6.0|/private/index.html HTTP/1.1
Host|/private/
2|2025-03-16 18:17:17|127.0.0.1|curl/7.86.0|/norobots/index.html HTTP/1.1
Host|/norobots/
...
According to W3Techs the top 5 most popular webservers as of March 2025 are:
- nginx (33.8%)
- Apache (26.8%)
- Cloudflare Server (23.2%)
- Litespeed (14.5%)
- Node.js (4.2%)
This first version has been tested with nginx
v1.24. If there is demand, separate versions for other servers with the same functionality will be released. This includes Wordpress robots.txt.
And of course, community contributions are most Welcome!
- Parses and enforces
robots.txt
rules - Generates dynamic content for disallowed paths
- Tracks bot requests in SQLite (or DuckDB -- work in progress) when database path is configured
- Supports both static and dynamic content generation
- Configurable caching for performance
- Honeypot link generation with configurable destination via
robonope_instructions_url
- Test suite
- Cross-platform support
- Nginx (1.24.0 or later recommended)
- PCRE library
- OpenSSL
- SQLite3 or DuckDB (under development)
- C compiler (gcc/clang)
- make
# Clone the repository
git clone --recursive https://github.com/raminf/RoboNope-nginx.git
cd RoboNope-nginx
# Build the module
make
This builds the full version of RoboNope, alongside a full copy of Nginx. You can use this to test locally and verify that it does what you want.
When ready, you can build a standalone module that you can install for your existing server, by definining the STANDALONE environment variable before building:
% STANDALONE=1 make
# Start the demo server (runs on port 8080)
make demo-start
# Test with a disallowed URL
curl http://localhost:8080/private/index.html
or
make demo-test
# View logged requests (if database logging is enabled)
make demo-logs
# Stop the demo server
make demo-stop
You can customize the demo environment using these variables:
# Enable database logging with custom location
DB_PATH=/tmp/robonope.db make demo-start
# Use DuckDB instead of SQLite (work in progress)
DB_ENGINE=duckdb make all demo-start
# Customize the instructions URL for honeypot links
INSTRUCTIONS_URL=https://your-custom-url.com make demo-start
Add to your main nginx.conf:
load_module modules/ngx_http_robonope_module.so;
http {
# RoboNope configuration
robonope_enable on;
robonope_robots_path /path/to/robots.txt;
# Optional: Enable database logging
# robonope_db_path /path/to/database;
# Optional: Set instructions URL for honeypot links
# robonope_instructions_url "https://your-custom-url.com";
# Optional rate limiting for disallowed paths
limit_req_zone $binary_remote_addr zone=robonope_limit:10m rate=1r/s;
server {
# Apply rate limiting to disallowed paths
location ~ ^/(norobots|private|admin|secret-data|internal)/ {
limit_req zone=robonope_limit burst=5 nodelay;
robonope_enable on;
}
}
}
This project is licensed under the MIT License - see the LICENSE file for details.
Most of the project, the README, and the artwork was assisted by AI. Even the name was workshopped with an AI.