get_text("rawdict") always returns same values for image xres and yres #4433

poffertje · 2025-04-08T14:41:12Z

Description of the bug

I am using this code to find information about the DPI of the images contained on the page, but no matter which PDF or what Image i check, the function always returns the same xres == yres = 96, which I suppose is the default of the library.

How to reproduce the bug

with fitz.open(pdf_path) as doc:
    page = doc[0]
    blocks =  page.get_text("rawdict", sort=True, clip=page.trimbox)["blocks"]
    images = [block for block in blocks if block["type"] == 1]
    
    for i, image in enumerate(images):
        log.debug(f"xres: {image['xres']}, yres: {image['yres']}"}

Output:

08-04-2025 16:35:04 - t_processor - DEBUG - Xres: 96, Yres: 96
08-04-2025 16:35:04 - t_processor - DEBUG - Xres: 96, Yres: 96
08-04-2025 16:35:04 - t_processor - DEBUG - Xres: 96, Yres: 96
08-04-2025 16:35:04 - t_processor - DEBUG - Xres: 96, Yres: 96

Does this run as intended?
Is there a different way to do what I want with the library?

Thank you for the great work!

PyMuPDF version

1.25.4

Operating system

Windows

Python version

3.10

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2025-04-08T15:16:03Z

As always:
Please provide evidence for a bug by supplying a PDF with an embedded image for which these values are reported wrong.

poffertje · 2025-04-09T07:34:16Z

Cannot share the exact PDF I was using when I discovered this bug because of privacy, however, for this washing machine manual I found on my laptop (pdf link) I get the same results for page 0.

09-04-2025 09:32:42 - t_processor - INFO - Processing: .\input\6\wgg246z5nl.pdf
09-04-2025 09:32:42 - t_processor - DEBUG - Xres: 96, Yres: 96
09-04-2025 09:32:42 - t_processor - DEBUG - Xres: 96, Yres: 96
09-04-2025 09:32:42 - t_processor - DEBUG - Xres: 96, Yres: 96
09-04-2025 09:32:42 - t_processor - DEBUG - Xres: 96, Yres: 96
09-04-2025 09:32:42 - t_processor - DEBUG - Xres: 96, Yres: 96
09-04-2025 09:32:42 - t_processor - DEBUG - Xres: 96, Yres: 96

JorjMcKie · 2025-04-09T09:09:34Z

Thanks for the example.
I will open a MuPDF issue and report the link here.

JorjMcKie · 2025-04-09T09:41:38Z

MuPDF issue link https://bugs.ghostscript.com/show_bug.cgi?id=708433

JorjMcKie · 2025-04-09T09:51:09Z

As a circumvention use this code snippet:

import pymupdf

doc = pymupdf.open("WGG246Z5NL.pdf")
for page in doc:
    for img in [b for b in page.get_text("dict")["blocks"] if b["type"] == 1]:
        print(f"{img['xres']=}, {img['yres']=}")
        pix = pymupdf.Pixmap(img["image"])
        print(f"{pix.xres=}, {pix.yres=}")  # <== this is correct!
        print()

poffertje · 2025-04-09T11:45:14Z

Thanks for the help!
It seems like your solution does help get a better approximation. I am however still not sure if these values are indeed correct, because when checking with Adobe Acrobat, I get a different resolution:

For the first picture on the page (the washing machine) your code outputs:
pix.xres=299, pix.yres=299

Meanwhile Adobe shows that this picture has:

Which should be roughly 353 DPI

JorjMcKie · 2025-04-09T12:22:58Z

I don't understand how Adobe computes this. Extract the images via XPDF or via mutool extract input.pdf and you will end up with a couple of images of which none has a resolution higher than 299.

JorjMcKie · 2025-04-16T14:47:02Z

Our MuPDF team has looked at this:
The value 96 is the default behavior / value for all PDF viewers because the original DPI in the image itself is irrelevant for PDF.
Therefore, MuPDF doesn't bother about determining this value and thus spares the overhead to decode the image. We in PyMuPDF should've not even provided these values to start with.

The recipe I gave you (via making an intermediate Pixmap) is the correct way to find the values embedded in the image.

I finally also learned that Adobe does the trivial computation image-width / bbox-width to come up with the values you showed me. So it also ignores any embedded DPI of the image.

When convenient we will update the documentation accordingly and mention that xres / yres are useless values.

JorjMcKie added example required Waiting for information labels Apr 8, 2025

JorjMcKie self-assigned this Apr 8, 2025

JorjMcKie removed example required Waiting for information labels Apr 9, 2025

JorjMcKie added the upstream bug bug outside this package label Apr 9, 2025

JorjMcKie added not a bug not a bug / user error / unable to reproduce wontfix no intention to resolve and removed upstream bug bug outside this package labels Apr 16, 2025

JorjMcKie closed this as completed Apr 16, 2025

wohali mentioned this issue May 4, 2025

Still unable to retrieve PDF-extracted image xres/yres #4485

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_text("rawdict") always returns same values for image xres and yres #4433

get_text("rawdict") always returns same values for image xres and yres #4433

poffertje commented Apr 8, 2025

JorjMcKie commented Apr 8, 2025

poffertje commented Apr 9, 2025

JorjMcKie commented Apr 9, 2025

JorjMcKie commented Apr 9, 2025

JorjMcKie commented Apr 9, 2025

poffertje commented Apr 9, 2025

JorjMcKie commented Apr 9, 2025

JorjMcKie commented Apr 16, 2025

get_text("rawdict") always returns same values for image xres and yres #4433

get_text("rawdict") always returns same values for image xres and yres #4433

Comments

poffertje commented Apr 8, 2025

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

JorjMcKie commented Apr 8, 2025

poffertje commented Apr 9, 2025

JorjMcKie commented Apr 9, 2025

JorjMcKie commented Apr 9, 2025

JorjMcKie commented Apr 9, 2025

poffertje commented Apr 9, 2025

JorjMcKie commented Apr 9, 2025

JorjMcKie commented Apr 16, 2025