Skip to content

[BUG] Download PDF throws exception on some URLs #1557

@malicialab

Description

@malicialab

Context:

  • Playwright Version: 1.25.2
  • Operating System: Linux Ubuntu
  • Python version: 3.8.10
  • Browser: Firefox, Chromium
  • Extra:

Code Snippet

#!/usr/bin/env python3

import asyncio
from playwright.async_api import async_playwright

tracing_enabled = True
tracing_filepath = "trace.zip"

async def handle_download(download):
    print("Found download for %s" % download.url)
    download_filepath = await download.path()
    print("Downloaded %s from %s" % (download_filepath, download.url))
    return

async def main():
    url1 = "https://www.mandiant.com/sites/default/files/2021-09/mandiant-apt1-report.pdf"
    url2="https://057info.hr/doc/o_kolacicima.pdf"
    url = url1
    async with async_playwright() as p:
        browser = await p.firefox.launch(headless=True)
        context = await browser.new_context(accept_downloads=True)
        if tracing_enabled:
            await context.tracing.start(screenshots=True,
                                        snapshots=True,
                                        sources=True)

        page = await context.new_page()
        page.on('download', handle_download)

        print("Visiting %s" % url)
        try:
            response = await page.goto(url, timeout=0)
        except Exception as e:
            print("Got exception %s" % e)

        await page.close()
        if tracing_enabled:
            await context.tracing.stop(path = tracing_filepath)
        await context.close()
        await browser.close()

if __name__ == '__main__':
    asyncio.run(main())

Describe the bug

When running the above code with Firefox, url2 downloads correctly, but url1 throws the following exception:

Visiting https://www.mandiant.com/sites/default/files/2021-09/mandiant-apt1-report.pdf
Found download for https://www.mandiant.com/sites/default/files/2021-09/mandiant-apt1-report.pdf
Exception in callback AsyncIOEventEmitter._emit_run.._callback(<Task finishe...ot NoneType')>) at /home/USER/python-virtual-environments/async/lib/python3.8/site-packages/pyee/_asyncio.py:55
handle: <Handle AsyncIOEventEmitter._emit_run.._callback(<Task finishe...ot NoneType')>) at /home/USER/python-virtual-environments/async/lib/python3.8/site-packages/pyee/_asyncio.py:55>
Traceback (most recent call last):
File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
self._context.run(self._callback, *self._args)
File "/home/user/python-virtual-environments/async/lib/python3.8/site-packages/pyee/_asyncio.py", line 62, in _callback
self.emit('error', exc)
File "/home/user/python-virtual-environments/async/lib/python3.8/site-packages/pyee/_base.py", line 116, in emit
self._emit_handle_potential_error(event, args[0] if args else None)
File "/home/USER/python-virtual-environments/async/lib/python3.8/site-packages/pyee/_base.py", line 86, in _emit_handle_potential_error
raise error
File "./test.py", line 11, in handle_download
download_filepath = await download.path()
File "/home/USER/python-virtual-environments/async/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 5640, in path
return mapping.from_maybe_impl(await self._impl_obj.path())
File "/home/USER/python-virtual-environments/async/lib/python3.8/site-packages/playwright/_impl/_download.py", line 58, in path
return await self._artifact.path_after_finished()
File "/home/USER/python-virtual-environments/async/lib/python3.8/site-packages/playwright/_impl/_artifact.py", line 36, in path_after_finished
return pathlib.Path(await self._channel.send("pathAfterFinished"))
File "/usr/lib/python3.8/pathlib.py", line 1042, in new
self = cls._from_parts(args, init=False)
File "/usr/lib/python3.8/pathlib.py", line 683, in _from_parts
drv, root, parts = self._parse_args(args)
File "/usr/lib/python3.8/pathlib.py", line 667, in _parse_args
a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType

The trace seems to indicate that download.path() times out in url1, which perhaps is why the smaller PDF in url2 works? However, I do not know how to handle those timeouts (I am passing a timeout of zero for goto).

The report is for Firefox, but using Chromium has a similar exception (it throws an additional exception in goto, but that seems to be expected Chromium behavior according to microsoft/playwright-java#863 and the download still starts if that first exception is caught). Webkit throws a different exception (Frame load interrupted) in both URLs and the download event is not fired.

To give a little bit of context, in my scenario I am given URLs which may point to HTML page or PDF and I need to download both. I cannot use 'async with page.expect_download()' since the URL may directly point to a PDF file.

Thanks for your time. Let me know if you need further info

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions