If I set DOWNLOAD_HANDLERS settting ,the Scrapy-Playwright will not working #332

Bluenight1 · 2024-12-26T09:52:30Z

When I add the settings for DOWNLOAD_ANDLERS(whether set in scrapy.settting or custom_settings)

"DOWNLOAD_HANDLERS":{
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},

And whether commenting on HTTP or HTTPS

"DOWNLOAD_HANDLERS":{
#"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},

I got the same result that my spider stop running like this

2024-12-26 16:41:42 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-12-26 16:41:42 [URL_Spider] INFO: Spider opened: URL_Spider
2024-12-26 16:41:42 [URL_Spider] INFO: Spider opened: URL_Spider
2024-12-26 16:41:42 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-12-26 16:42:42 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-12-26 16:43:42 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

And the Scrapy-Playwright doesn't seem to be working,I'm not see these logs like

[scrapy-playwright] INFO: Starting download handler
[scrapy-playwright] INFO: Launching browser chromium
[scrapy-playwright] INFO: Browser chromium launched

This is full of my log

(venv) PS D:\python\djangogirls\URLSpider> scrapy crawl URL_Spider -a target_id=40
2024-12-26 16:41:42 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: URLSpider)
2024-12-26 16:41:42 [scrapy.utils.log] INFO: Versions: lxml 4.9.0.0, libxml2 2.9.10, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.11.0, Python 3.9.10 (tags/v3.9.10:f2f3f53, Jan 17 2022, 15:14:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 24.3.0 (OpenSSL 3.4.0 22 Oct 2024), cryptography 44.0.0, Platform Windows-10-10.0.22631-SP0
2024-12-26 16:41:42 [scrapy.addons] INFO: Enabled addons:
[]
2024-12-26 16:41:42 [asyncio] DEBUG: Using selector: SelectSelector
2024-12-26 16:41:42 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-12-26 16:41:42 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2024-12-26 16:41:42 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-12-26 16:41:42 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2024-12-26 16:41:42 [scrapy.extensions.telnet] INFO: Telnet Password: cd8cd99a6f09939e
2024-12-26 16:41:42 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2024-12-26 16:41:42 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'URLSpider',
'CONCURRENT_REQUESTS': 32,
'CONCURRENT_REQUESTS_PER_DOMAIN': 16,
'CONCURRENT_REQUESTS_PER_IP': 16,
'DOWNLOAD_DELAY': 3,
'FEED_EXPORT_ENCODING': 'utf-8',
'NEWSPIDER_MODULE': 'URLSpider.spiders',
'SPIDER_MODULES': ['URLSpider.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-12-26 16:41:42 [asyncio] DEBUG: Using proactor: IocpProactor
2024-12-26 16:41:42 [scrapy-playwright] INFO: Started loop on separate thread:
2024-12-26 16:41:42 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'URLSpider.middlewares.UrlspiderDownloaderMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-12-26 16:41:42 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'URLSpider.middlewares.UrlspiderSpiderMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-12-26 16:41:42 [scrapy.middleware] INFO: Enabled item pipelines:
['URLSpider.pipelines.SaveToDatabasePipeline']
2024-12-26 16:41:42 [scrapy.core.engine] INFO: Spider opened
2024-12-26 16:41:42 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-12-26 16:41:42 [URL_Spider] INFO: Spider opened: URL_Spider
2024-12-26 16:41:42 [URL_Spider] INFO: Spider opened: URL_Spider
2024-12-26 16:41:42 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-12-26 16:42:42 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-12-26 16:43:42 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-12-26 16:43:57 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force

My Environment

Scrapy==2.12.0
playwright==1.49.1
scrapy-playwright==0.0.42
system=windows 10

I have searched for a large number of similar problems and discussions, and have tried all the methods I can find but have not found a solution. I hope to solve this problem as soon as possible. Thank you very much.

elacuesta · 2024-12-28T16:24:59Z

Looks like a configuration issue. Please share more of your spider code (no need for the full parsing code) and settings file.

Bluenight1 · 2024-12-29T05:28:32Z

OK！These are all the configurations I have set up

ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 32
DOWNLOAD_DELAY = 3
CONCURRENT_REQUESTS_PER_DOMAIN = 16
CONCURRENT_REQUESTS_PER_IP = 16

DOWNLOADER_MIDDLEWARES = {
"URLSpider.middlewares.UrlspiderDownloaderMiddleware": 543,
'scrapy_playwright.middleware.ScrapyPlaywrightDownloadHandler': None,
"scrapy.downloadermiddlewares.cookies.CookiesMiddleware": 700,
SPIDER_MIDDLEWARES = {
"URLSpider.middlewares.UrlspiderSpiderMiddleware": 543,
}
FEED_EXPORT_ENCODING = "utf-8"
TWISTED_REACTOR="twisted.internet.asyncioreactor.AsyncioSelectorReactor",

PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {
"headless": False,
DOWNLOAD_HANDLERS={
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},

#Django
import sys
import os
import djangoDJANGO_PROJECT_PATH = os.path.abspath(os.path.join(os.path.dirname(file), '../../'))
DJANGO_SETTINGS_MODULE = 'mysite.settings'
sys.path.append(DJANGO_PROJECT_PATH)
os.environ['DJANGO_SETTINGS_MODULE'] = DJANGO_SETTINGS_MODULE

django.setup()

And This is the code where the first error occurred (before I added the DOWNLOAD_ANDLERS setting, the spider would report an error at this location: keyerror：playwright-page,That's why I added the DOWNLOAD_ANDLERS setting. )Are these codes sufficient?

import hashlib
import json
from collections import deque
import scrapy
from scrapy.http import Request
from UI_Layer.models import TargetSite
from ..items import UserURLItem

class URL_Spider(scrapy.Spider):
name = "URL_Spider"
custom_settings = {
"TWISTED_REACTOR":"twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS":{
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
}

def __init__(self, target_id=None, *args, **kwargs):
    super(URL_Spider, self).__init__(*args, **kwargs)
    self.target_site = TargetSite.objects.get(id=target_id)  
    self.start_url = self.target_site.url
    self.url_queue = deque([self.start_url])  
    self.visited_urls = set()  
    self.depth_map = {}  
    self.sessions = []  
    self.cookies_files = [] 


def start_requests(self):
    yield scrapy.Request(
        url=f"{self.start_url}/login", 
        callback=self.parse_login,
        meta={"playwright": True, "playwright_include_page": True},  
        method='GET',
    )
async def parse_login(self, response):
    page = response.meta["playwright_page"]
      .
      .
      .

elacuesta · 2024-12-29T20:12:27Z

I assume the lines before the Django code are the settings.py file. If that's the case, is that the actual contents of the file? I see some missing closing curly braces which would be syntax errors (e.g. there are no closing } for DOWNLOADER_MIDDLEWARES and PLAYWRIGHT_LAUNCH_OPTIONS), and also a trailing comma after the closing curly brace for DOWNLOAD_HANDLERS which would make the DOWNLOAD_HANDLERS a tuple instead of a dict and prevent the crawl from running.

Bluenight1 · 2024-12-30T13:07:31Z

This should be a problem I copied, and I can still detect this obvious issue. I will provide the complete settings file again (I configured Scrapy in Django, so there will be Django configuration). Do you have any other ideas

ROBOTSTXT_OBEY = False

CONCURRENT_REQUESTS = 32

DOWNLOAD_DELAY = 3

CONCURRENT_REQUESTS_PER_DOMAIN = 16
CONCURRENT_REQUESTS_PER_IP = 16

DEFAULT_REQUEST_HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8",
"Accept-Language": "en",
}

SPIDER_MIDDLEWARES = {
"URLSpider.middlewares.UrlspiderSpiderMiddleware": 543,
}

DOWNLOADER_MIDDLEWARES = {
"URLSpider.middlewares.UrlspiderDownloaderMiddleware": 543,
'scrapy_playwright.middleware.ScrapyPlaywrightDownloadHandler': None,
"scrapy.downloadermiddlewares.cookies.CookiesMiddleware": 700,
}

ITEM_PIPELINES = {
"URLSpider.pipelines.SaveToDatabasePipeline": 300,
}

FEED_EXPORT_ENCODING = "utf-8"
TWISTED_REACTOR="twisted.internet.asyncioreactor.AsyncioSelectorReactor",
DOWNLOAD_HANDLERS={
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

ScrapyPlaywright

PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {
"headless": False,
}
import sys
import os
import django

DJANGO_PROJECT_PATH = os.path.abspath(os.path.join(os.path.dirname(file), '../../'))
DJANGO_SETTINGS_MODULE = 'mysite.settings'
sys.path.append(DJANGO_PROJECT_PATH)
os.environ['DJANGO_SETTINGS_MODULE'] = DJANGO_SETTINGS_MODULE

django.setup()

elacuesta added needs more info support Support questions labels Dec 28, 2024

elacuesta added the Windows label Dec 30, 2024

Bluenight1 closed this as completed Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

If I set DOWNLOAD_HANDLERS settting ,the Scrapy-Playwright will not working #332

If I set DOWNLOAD_HANDLERS settting ,the Scrapy-Playwright will not working #332

Bluenight1 commented Dec 26, 2024

elacuesta commented Dec 28, 2024

Uh oh!

Bluenight1 commented Dec 29, 2024

Uh oh!

elacuesta commented Dec 29, 2024

Uh oh!

Bluenight1 commented Dec 30, 2024

Uh oh!

If I set DOWNLOAD_HANDLERS settting ,the Scrapy-Playwright will not working #332

If I set DOWNLOAD_HANDLERS settting ,the Scrapy-Playwright will not working #332

Comments

Bluenight1 commented Dec 26, 2024

When I add the settings for DOWNLOAD_ANDLERS(whether set in scrapy.settting or custom_settings)

And whether commenting on HTTP or HTTPS

I got the same result that my spider stop running like this

And the Scrapy-Playwright doesn't seem to be working,I'm not see these logs like

This is full of my log

My Environment

I have searched for a large number of similar problems and discussions, and have tried all the methods I can find but have not found a solution. I hope to solve this problem as soon as possible. Thank you very much.

elacuesta commented Dec 28, 2024

Uh oh!

Bluenight1 commented Dec 29, 2024

OK！These are all the configurations I have set up

And This is the code where the first error occurred (before I added the DOWNLOAD_ANDLERS setting, the spider would report an error at this location: keyerror：playwright-page,That's why I added the DOWNLOAD_ANDLERS setting. )Are these codes sufficient?

Uh oh!

elacuesta commented Dec 29, 2024

Uh oh!

Bluenight1 commented Dec 30, 2024

This should be a problem I copied, and I can still detect this obvious issue. I will provide the complete settings file again (I configured Scrapy in Django, so there will be Django configuration). Do you have any other ideas

ScrapyPlaywright

Uh oh!