Skip to content

If I set DOWNLOAD_HANDLERS settting ,the Scrapy-Playwright will not working #332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Bluenight1 opened this issue Dec 26, 2024 · 4 comments
Closed

Comments

@Bluenight1
Copy link

When I add the settings for DOWNLOAD_ANDLERS(whether set in scrapy.settting or custom_settings)

"DOWNLOAD_HANDLERS":{
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},

And whether commenting on HTTP or HTTPS

"DOWNLOAD_HANDLERS":{
#"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},

I got the same result that my spider stop running like this

2024-12-26 16:41:42 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-12-26 16:41:42 [URL_Spider] INFO: Spider opened: URL_Spider
2024-12-26 16:41:42 [URL_Spider] INFO: Spider opened: URL_Spider
2024-12-26 16:41:42 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-12-26 16:42:42 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-12-26 16:43:42 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

And the Scrapy-Playwright doesn't seem to be working,I'm not see these logs like

[scrapy-playwright] INFO: Starting download handler
[scrapy-playwright] INFO: Launching browser chromium
[scrapy-playwright] INFO: Browser chromium launched

This is full of my log

(venv) PS D:\python\djangogirls\URLSpider> scrapy crawl URL_Spider -a target_id=40
2024-12-26 16:41:42 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: URLSpider)
2024-12-26 16:41:42 [scrapy.utils.log] INFO: Versions: lxml 4.9.0.0, libxml2 2.9.10, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.11.0, Python 3.9.10 (tags/v3.9.10:f2f3f53, Jan 17 2022, 15:14:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 24.3.0 (OpenSSL 3.4.0 22 Oct 2024), cryptography 44.0.0, Platform Windows-10-10.0.22631-SP0
2024-12-26 16:41:42 [scrapy.addons] INFO: Enabled addons:
[]
2024-12-26 16:41:42 [asyncio] DEBUG: Using selector: SelectSelector
2024-12-26 16:41:42 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-12-26 16:41:42 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2024-12-26 16:41:42 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-12-26 16:41:42 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2024-12-26 16:41:42 [scrapy.extensions.telnet] INFO: Telnet Password: cd8cd99a6f09939e
2024-12-26 16:41:42 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2024-12-26 16:41:42 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'URLSpider',
'CONCURRENT_REQUESTS': 32,
'CONCURRENT_REQUESTS_PER_DOMAIN': 16,
'CONCURRENT_REQUESTS_PER_IP': 16,
'DOWNLOAD_DELAY': 3,
'FEED_EXPORT_ENCODING': 'utf-8',
'NEWSPIDER_MODULE': 'URLSpider.spiders',
'SPIDER_MODULES': ['URLSpider.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-12-26 16:41:42 [asyncio] DEBUG: Using proactor: IocpProactor
2024-12-26 16:41:42 [scrapy-playwright] INFO: Started loop on separate thread:
2024-12-26 16:41:42 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'URLSpider.middlewares.UrlspiderDownloaderMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-12-26 16:41:42 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'URLSpider.middlewares.UrlspiderSpiderMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-12-26 16:41:42 [scrapy.middleware] INFO: Enabled item pipelines:
['URLSpider.pipelines.SaveToDatabasePipeline']
2024-12-26 16:41:42 [scrapy.core.engine] INFO: Spider opened
2024-12-26 16:41:42 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-12-26 16:41:42 [URL_Spider] INFO: Spider opened: URL_Spider
2024-12-26 16:41:42 [URL_Spider] INFO: Spider opened: URL_Spider
2024-12-26 16:41:42 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-12-26 16:42:42 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-12-26 16:43:42 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-12-26 16:43:57 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force

My Environment

Scrapy==2.12.0
playwright==1.49.1
scrapy-playwright==0.0.42
system=windows 10

I have searched for a large number of similar problems and discussions, and have tried all the methods I can find but have not found a solution. I hope to solve this problem as soon as possible. Thank you very much.

@elacuesta
Copy link
Member

Looks like a configuration issue. Please share more of your spider code (no need for the full parsing code) and settings file.

@elacuesta elacuesta added needs more info support Support questions labels Dec 28, 2024
@Bluenight1
Copy link
Author

OK!These are all the configurations I have set up

ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 32
DOWNLOAD_DELAY = 3
CONCURRENT_REQUESTS_PER_DOMAIN = 16
CONCURRENT_REQUESTS_PER_IP = 16

DOWNLOADER_MIDDLEWARES = {
"URLSpider.middlewares.UrlspiderDownloaderMiddleware": 543,
'scrapy_playwright.middleware.ScrapyPlaywrightDownloadHandler': None,
"scrapy.downloadermiddlewares.cookies.CookiesMiddleware": 700,
SPIDER_MIDDLEWARES = {
"URLSpider.middlewares.UrlspiderSpiderMiddleware": 543,
}
FEED_EXPORT_ENCODING = "utf-8"
TWISTED_REACTOR="twisted.internet.asyncioreactor.AsyncioSelectorReactor",

PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {
"headless": False,
DOWNLOAD_HANDLERS={
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},

#Django
import sys
import os
import djangoDJANGO_PROJECT_PATH = os.path.abspath(os.path.join(os.path.dirname(file), '../../'))
DJANGO_SETTINGS_MODULE = 'mysite.settings'
sys.path.append(DJANGO_PROJECT_PATH)
os.environ['DJANGO_SETTINGS_MODULE'] = DJANGO_SETTINGS_MODULE

django.setup()

And This is the code where the first error occurred (before I added the DOWNLOAD_ANDLERS setting, the spider would report an error at this location: keyerror:playwright-page,That's why I added the DOWNLOAD_ANDLERS setting. )Are these codes sufficient?

import hashlib
import json
from collections import deque
import scrapy
from scrapy.http import Request
from UI_Layer.models import TargetSite
from ..items import UserURLItem

class URL_Spider(scrapy.Spider):
name = "URL_Spider"
custom_settings = {
"TWISTED_REACTOR":"twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS":{
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
}

def __init__(self, target_id=None, *args, **kwargs):
    super(URL_Spider, self).__init__(*args, **kwargs)
    self.target_site = TargetSite.objects.get(id=target_id)  
    self.start_url = self.target_site.url
    self.url_queue = deque([self.start_url])  
    self.visited_urls = set()  
    self.depth_map = {}  
    self.sessions = []  
    self.cookies_files = [] 


def start_requests(self):
    yield scrapy.Request(
        url=f"{self.start_url}/login", 
        callback=self.parse_login,
        meta={"playwright": True, "playwright_include_page": True},  
        method='GET',
    )
async def parse_login(self, response):
    page = response.meta["playwright_page"]
      .
      .
      .

@elacuesta
Copy link
Member

I assume the lines before the Django code are the settings.py file. If that's the case, is that the actual contents of the file? I see some missing closing curly braces which would be syntax errors (e.g. there are no closing } for DOWNLOADER_MIDDLEWARES and PLAYWRIGHT_LAUNCH_OPTIONS), and also a trailing comma after the closing curly brace for DOWNLOAD_HANDLERS which would make the DOWNLOAD_HANDLERS a tuple instead of a dict and prevent the crawl from running.

@Bluenight1
Copy link
Author

This should be a problem I copied, and I can still detect this obvious issue. I will provide the complete settings file again (I configured Scrapy in Django, so there will be Django configuration). Do you have any other ideas

ROBOTSTXT_OBEY = False

CONCURRENT_REQUESTS = 32

DOWNLOAD_DELAY = 3

CONCURRENT_REQUESTS_PER_DOMAIN = 16
CONCURRENT_REQUESTS_PER_IP = 16

DEFAULT_REQUEST_HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8",
"Accept-Language": "en",
}

SPIDER_MIDDLEWARES = {
"URLSpider.middlewares.UrlspiderSpiderMiddleware": 543,
}

DOWNLOADER_MIDDLEWARES = {
"URLSpider.middlewares.UrlspiderDownloaderMiddleware": 543,
'scrapy_playwright.middleware.ScrapyPlaywrightDownloadHandler': None,
"scrapy.downloadermiddlewares.cookies.CookiesMiddleware": 700,
}

ITEM_PIPELINES = {
"URLSpider.pipelines.SaveToDatabasePipeline": 300,
}

FEED_EXPORT_ENCODING = "utf-8"
TWISTED_REACTOR="twisted.internet.asyncioreactor.AsyncioSelectorReactor",
DOWNLOAD_HANDLERS={
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

ScrapyPlaywright

PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {
"headless": False,
}
import sys
import os
import django

DJANGO_PROJECT_PATH = os.path.abspath(os.path.join(os.path.dirname(file), '../../'))
DJANGO_SETTINGS_MODULE = 'mysite.settings'
sys.path.append(DJANGO_PROJECT_PATH)
os.environ['DJANGO_SETTINGS_MODULE'] = DJANGO_SETTINGS_MODULE

django.setup()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants