-
Notifications
You must be signed in to change notification settings - Fork 135
If I set DOWNLOAD_HANDLERS settting ,the Scrapy-Playwright will not working #332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Looks like a configuration issue. Please share more of your spider code (no need for the full parsing code) and settings file. |
OK!These are all the configurations I have set upROBOTSTXT_OBEY = False DOWNLOADER_MIDDLEWARES = { PLAYWRIGHT_BROWSER_TYPE = "chromium" #Django django.setup() And This is the code where the first error occurred (before I added the DOWNLOAD_ANDLERS setting, the spider would report an error at this location: keyerror:playwright-page,That's why I added the DOWNLOAD_ANDLERS setting. )Are these codes sufficient?import hashlib class URL_Spider(scrapy.Spider):
|
I assume the lines before the Django code are the |
This should be a problem I copied, and I can still detect this obvious issue. I will provide the complete settings file again (I configured Scrapy in Django, so there will be Django configuration). Do you have any other ideasROBOTSTXT_OBEY = False CONCURRENT_REQUESTS = 32 DOWNLOAD_DELAY = 3 CONCURRENT_REQUESTS_PER_DOMAIN = 16 DEFAULT_REQUEST_HEADERS = { SPIDER_MIDDLEWARES = { DOWNLOADER_MIDDLEWARES = { ITEM_PIPELINES = { FEED_EXPORT_ENCODING = "utf-8" ScrapyPlaywrightPLAYWRIGHT_BROWSER_TYPE = "chromium" DJANGO_PROJECT_PATH = os.path.abspath(os.path.join(os.path.dirname(file), '../../')) django.setup() |
When I add the settings for DOWNLOAD_ANDLERS(whether set in scrapy.settting or custom_settings)
"DOWNLOAD_HANDLERS":{
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
And whether commenting on HTTP or HTTPS
"DOWNLOAD_HANDLERS":{
#"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
I got the same result that my spider stop running like this
2024-12-26 16:41:42 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-12-26 16:41:42 [URL_Spider] INFO: Spider opened: URL_Spider
2024-12-26 16:41:42 [URL_Spider] INFO: Spider opened: URL_Spider
2024-12-26 16:41:42 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-12-26 16:42:42 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-12-26 16:43:42 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
And the Scrapy-Playwright doesn't seem to be working,I'm not see these logs like
[scrapy-playwright] INFO: Starting download handler
[scrapy-playwright] INFO: Launching browser chromium
[scrapy-playwright] INFO: Browser chromium launched
This is full of my log
(venv) PS D:\python\djangogirls\URLSpider> scrapy crawl URL_Spider -a target_id=40
2024-12-26 16:41:42 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: URLSpider)
2024-12-26 16:41:42 [scrapy.utils.log] INFO: Versions: lxml 4.9.0.0, libxml2 2.9.10, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.11.0, Python 3.9.10 (tags/v3.9.10:f2f3f53, Jan 17 2022, 15:14:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 24.3.0 (OpenSSL 3.4.0 22 Oct 2024), cryptography 44.0.0, Platform Windows-10-10.0.22631-SP0
2024-12-26 16:41:42 [scrapy.addons] INFO: Enabled addons:
[]
2024-12-26 16:41:42 [asyncio] DEBUG: Using selector: SelectSelector
2024-12-26 16:41:42 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-12-26 16:41:42 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2024-12-26 16:41:42 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-12-26 16:41:42 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2024-12-26 16:41:42 [scrapy.extensions.telnet] INFO: Telnet Password: cd8cd99a6f09939e
2024-12-26 16:41:42 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2024-12-26 16:41:42 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'URLSpider',
'CONCURRENT_REQUESTS': 32,
'CONCURRENT_REQUESTS_PER_DOMAIN': 16,
'CONCURRENT_REQUESTS_PER_IP': 16,
'DOWNLOAD_DELAY': 3,
'FEED_EXPORT_ENCODING': 'utf-8',
'NEWSPIDER_MODULE': 'URLSpider.spiders',
'SPIDER_MODULES': ['URLSpider.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-12-26 16:41:42 [asyncio] DEBUG: Using proactor: IocpProactor
2024-12-26 16:41:42 [scrapy-playwright] INFO: Started loop on separate thread:
2024-12-26 16:41:42 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'URLSpider.middlewares.UrlspiderDownloaderMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-12-26 16:41:42 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'URLSpider.middlewares.UrlspiderSpiderMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-12-26 16:41:42 [scrapy.middleware] INFO: Enabled item pipelines:
['URLSpider.pipelines.SaveToDatabasePipeline']
2024-12-26 16:41:42 [scrapy.core.engine] INFO: Spider opened
2024-12-26 16:41:42 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-12-26 16:41:42 [URL_Spider] INFO: Spider opened: URL_Spider
2024-12-26 16:41:42 [URL_Spider] INFO: Spider opened: URL_Spider
2024-12-26 16:41:42 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-12-26 16:42:42 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-12-26 16:43:42 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-12-26 16:43:57 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force
My Environment
Scrapy==2.12.0
playwright==1.49.1
scrapy-playwright==0.0.42
system=windows 10
I have searched for a large number of similar problems and discussions, and have tried all the methods I can find but have not found a solution. I hope to solve this problem as soon as possible. Thank you very much.
The text was updated successfully, but these errors were encountered: