Skip to content

get_text("rawdict") always returns same values for image xres and yres #4433

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
poffertje opened this issue Apr 8, 2025 · 8 comments
Closed
Assignees
Labels
not a bug not a bug / user error / unable to reproduce wontfix no intention to resolve

Comments

@poffertje
Copy link

Description of the bug

I am using this code to find information about the DPI of the images contained on the page, but no matter which PDF or what Image i check, the function always returns the same xres == yres = 96, which I suppose is the default of the library.

How to reproduce the bug

with fitz.open(pdf_path) as doc:
    page = doc[0]
    blocks =  page.get_text("rawdict", sort=True, clip=page.trimbox)["blocks"]
    images = [block for block in blocks if block["type"] == 1]
    
    for i, image in enumerate(images):
        log.debug(f"xres: {image['xres']}, yres: {image['yres']}"}

Output:

08-04-2025 16:35:04 - t_processor - DEBUG - Xres: 96, Yres: 96
08-04-2025 16:35:04 - t_processor - DEBUG - Xres: 96, Yres: 96
08-04-2025 16:35:04 - t_processor - DEBUG - Xres: 96, Yres: 96
08-04-2025 16:35:04 - t_processor - DEBUG - Xres: 96, Yres: 96

Does this run as intended?
Is there a different way to do what I want with the library?

Thank you for the great work!

PyMuPDF version

1.25.4

Operating system

Windows

Python version

3.10

@JorjMcKie
Copy link
Collaborator

As always:
Please provide evidence for a bug by supplying a PDF with an embedded image for which these values are reported wrong.

@poffertje
Copy link
Author

Cannot share the exact PDF I was using when I discovered this bug because of privacy, however, for this washing machine manual I found on my laptop (pdf link) I get the same results for page 0.

09-04-2025 09:32:42 - t_processor - INFO - Processing: .\input\6\wgg246z5nl.pdf
09-04-2025 09:32:42 - t_processor - DEBUG - Xres: 96, Yres: 96
09-04-2025 09:32:42 - t_processor - DEBUG - Xres: 96, Yres: 96
09-04-2025 09:32:42 - t_processor - DEBUG - Xres: 96, Yres: 96
09-04-2025 09:32:42 - t_processor - DEBUG - Xres: 96, Yres: 96
09-04-2025 09:32:42 - t_processor - DEBUG - Xres: 96, Yres: 96
09-04-2025 09:32:42 - t_processor - DEBUG - Xres: 96, Yres: 96

@JorjMcKie
Copy link
Collaborator

Thanks for the example.
I will open a MuPDF issue and report the link here.

@JorjMcKie
Copy link
Collaborator

@JorjMcKie
Copy link
Collaborator

As a circumvention use this code snippet:

import pymupdf

doc = pymupdf.open("WGG246Z5NL.pdf")
for page in doc:
    for img in [b for b in page.get_text("dict")["blocks"] if b["type"] == 1]:
        print(f"{img['xres']=}, {img['yres']=}")
        pix = pymupdf.Pixmap(img["image"])
        print(f"{pix.xres=}, {pix.yres=}")  # <== this is correct!
        print()

@JorjMcKie JorjMcKie added the upstream bug bug outside this package label Apr 9, 2025
@poffertje
Copy link
Author

Thanks for the help!
It seems like your solution does help get a better approximation. I am however still not sure if these values are indeed correct, because when checking with Adobe Acrobat, I get a different resolution:

For the first picture on the page (the washing machine) your code outputs:
pix.xres=299, pix.yres=299

Meanwhile Adobe shows that this picture has:

Image
Which should be roughly 353 DPI

@JorjMcKie
Copy link
Collaborator

I don't understand how Adobe computes this. Extract the images via XPDF or via mutool extract input.pdf and you will end up with a couple of images of which none has a resolution higher than 299.

@JorjMcKie
Copy link
Collaborator

Our MuPDF team has looked at this:
The value 96 is the default behavior / value for all PDF viewers because the original DPI in the image itself is irrelevant for PDF.
Therefore, MuPDF doesn't bother about determining this value and thus spares the overhead to decode the image. We in PyMuPDF should've not even provided these values to start with.

The recipe I gave you (via making an intermediate Pixmap) is the correct way to find the values embedded in the image.

I finally also learned that Adobe does the trivial computation image-width / bbox-width to come up with the values you showed me. So it also ignores any embedded DPI of the image.

When convenient we will update the documentation accordingly and mention that xres / yres are useless values.

@JorjMcKie JorjMcKie added not a bug not a bug / user error / unable to reproduce wontfix no intention to resolve and removed upstream bug bug outside this package labels Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not a bug not a bug / user error / unable to reproduce wontfix no intention to resolve
Projects
None yet
Development

No branches or pull requests

2 participants