feat: add `on_skipped_request` decorator, to process links skipped according to `robots.txt` rules #1166

Mantisus · 2025-04-21T22:23:25Z

Description

This PR supplements feat: add respect_robots_txt_file option #1162 by adding an on_skipped_request decorator to handle references skipped according to robots.txt rules

Issues

Closes: add on_skipped_request hook #1160
Related feat: add respect_robots_txt_file option #1162

Copilot

Pull Request Overview

This PR introduces an on_skipped_request decorator and enhances robots.txt integration across multiple crawler implementations. Key changes include:

Adding a robots.txt endpoint and constant in server endpoints.
Integrating robots.txt filtering in both Playwright and Abstract HTTP crawlers.
Extending unit tests to cover new robots.txt behaviors and the on_skipped_request hook.

Reviewed Changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/unit/server_endpoints.py	Added a ROBOTS_TXT constant with sample directives for testing cases.
tests/unit/server.py	Introduced a new endpoint to serve the robots.txt file.
tests/unit/crawlers/*	Added tests for robots.txt respect and on_skipped_request hook across crawlers.
src/crawlee/crawlers/*	Updated link extraction logic and added skipped request handling for robots.txt.
src/crawlee/crawlers/_basic/_basic_crawler.py	Integrated robots.txt check in BasicCrawler with a new on_skipped_request callback.
src/crawlee/_utils/robots.py	Added a new RobotsTxtFile utility using Protego for parsing robots.txt content.
pyproject.toml	Added dependency on protego for robots.txt parsing.

Comments suppressed due to low confidence (1)

src/crawlee/crawlers/_basic/_basic_crawler.py:1000

[nitpick] Consider renaming the parameter 'need_mark' to 'mark_request' for clearer intent in the _handle_skipped_request method.

def _handle_skipped_request(self, request: Request | str, reason: SkippedReason, *, need_mark: bool = False) -> None:

tests/unit/server_endpoints.py

Mantisus · 2025-04-21T22:29:02Z

This PR is in addition to #1162 and should only be considered after merging #1162.

This is put in a separate PR, as adding a new handler decorator deserves a separate PR and a mention in the release.

Co-authored-by: Vlada Dusek <[email protected]>

### Description Update `UnprocessedRequest` to match actual data. Add test. ### Issues - Closes: apify#1150

… and the handler is executed for `PlaywrightCrawler` (apify#1163) ### Description - For `PlaywrightCrawler`, cookies should only be saved to the session store when the handler is fully executed. This is because the browser may continue to set cookies while the handler is being executed ### Testing - Add a test simulating the installation of a cookie in the browser during the `default_handler` execution process - Update the `test_isolation_cookies` test

### Description Adds retry to unprocessed requests in call `add_requests_batched`. Retry calls recursively `_process_batch`, which initially works on full request batch and then on batches of unprocessed requests until retry limit is reached or all requests are processed. Each retry is done after linearly increasing delay with each attempt. Unprocessed requests are not counted in `request_queue.get_total_count` Add test. ### Issues - Closes: [Handle unprocessed requests in batch_add_requests](apify/apify-sdk-python#456)

…tation count exceeds maximum (apify#1147) - Call `failed_request_handler` for `SessionError` when session rotation count exceeds maximum

vdusek

Could we please cover on_skipped_request somewhere in the docs? 🙂

docs/examples/code_examples/on_skipped_request.py

docs/examples/respect_robots_txt_file.mdx

vdusek

LGTM

Mantisus added 12 commits April 17, 2025 15:43

basic_robots_allow

427b00a

add respect robots_txt_file

638b5be

update load

33be1c8

change RobotFileParser to Protego

a44dff1

add tests

538672e

fix

b9b35be

update tests

a49ab66

update TODO comments

46a2356

update docstrings

10077b6

fix docstrings

358b20c

Merge branch 'respect_robots_txt' into on-skipped-request

5282cf9

add on_skipped_request

dfa2087

Mantisus requested a review from Copilot April 21, 2025 22:23

Copilot AI reviewed Apr 21, 2025

View reviewed changes

tests/unit/server_endpoints.py Outdated Show resolved Hide resolved

Mantisus marked this pull request as draft April 21, 2025 22:25

Mantisus self-assigned this Apr 21, 2025

Mantisus and others added 13 commits April 24, 2025 14:37

change staticmethod to classmethod

b92494e

Update src/crawlee/_utils/robots.py

9243cac

Co-authored-by: Vlada Dusek <[email protected]>

add _robots_txt_locks_cache

c50eabe

update pyproject.toml

b6baca8

update docstrings

9c7ad1c

add docs example

c9e6147

chore(deps): update typescript-eslint monorepo to v8.31.0 (apify#1168)

13a4c9f

chore(deps): update dependency setuptools to v79 (apify#1165)

8db45af

fix: Update UnprocessedRequest to match actual data (apify#1155)

00916f6

### Description Update `UnprocessedRequest` to match actual data. Add test. ### Issues - Closes: apify#1150

chore(release): Update changelog and package version [skip ci]

ca866cc

chore(release): Update changelog and package version [skip ci]

b960350

Apify Release Bot and others added 6 commits April 24, 2025 14:38

chore(release): Update changelog and package version [skip ci]

cc12d1b

fix: call failed_request_handler for SessionError when session ro…

1b7be37

…tation count exceeds maximum (apify#1147) - Call `failed_request_handler` for `SessionError` when session rotation count exceeds maximum

chore(release): Update changelog and package version [skip ci]

f67f9ca

one lock to rule them all

821f891

Merge branch 'master' into on-skipped-request

d3324f0

fix

fca60a2

Mantisus marked this pull request as ready for review April 24, 2025 15:56

Mantisus requested review from janbuchar and vdusek April 24, 2025 15:58

vdusek reviewed Apr 25, 2025

View reviewed changes

add docs

35fa23f

Mantisus requested a review from vdusek April 25, 2025 19:13

vdusek reviewed Apr 28, 2025

View reviewed changes

docs/examples/code_examples/on_skipped_request.py Outdated Show resolved Hide resolved

vdusek reviewed Apr 28, 2025

View reviewed changes

docs/examples/respect_robots_txt_file.mdx Outdated Show resolved Hide resolved

Mantisus added 4 commits April 28, 2025 12:16

add file prefix

e85153a

resolve

fdf400c

resolve

45a1bc1

resolve

ce76eb5

vdusek approved these changes Apr 28, 2025

View reviewed changes

remove filename

493157f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add `on_skipped_request` decorator, to process links skipped according to `robots.txt` rules #1166

feat: add `on_skipped_request` decorator, to process links skipped according to `robots.txt` rules #1166

Mantisus commented Apr 21, 2025

Copilot AI left a comment

Mantisus commented Apr 21, 2025

vdusek left a comment

vdusek left a comment

feat: add on_skipped_request decorator, to process links skipped according to robots.txt rules #1166

Are you sure you want to change the base?

feat: add on_skipped_request decorator, to process links skipped according to robots.txt rules #1166

Conversation

Mantisus commented Apr 21, 2025

Description

Issues

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Mantisus commented Apr 21, 2025

vdusek left a comment

Choose a reason for hiding this comment

vdusek left a comment

Choose a reason for hiding this comment

feat: add `on_skipped_request` decorator, to process links skipped according to `robots.txt` rules #1166

feat: add `on_skipped_request` decorator, to process links skipped according to `robots.txt` rules #1166