Skip to content

Unable to get_text() - layer/clip nesting too deep #4403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
C-Saunders opened this issue Mar 25, 2025 · 8 comments
Closed

Unable to get_text() - layer/clip nesting too deep #4403

C-Saunders opened this issue Mar 25, 2025 · 8 comments
Assignees
Labels
enhancement-upstream to be implemented by MuPDF fix developed release schedule to be determined Fixed in next release

Comments

@C-Saunders
Copy link

Description of the bug

When page.get_text() is called on the first (and only) page of the attached PDF, this error is raised: RuntimeError: code=5: layer/clip nesting too deep

How to reproduce the bug

subset.pdf

import pymupdf

doc = pymupdf.open("./subset.pdf")
doc[0].get_text()

I was able to view and copy/paste text from this page using a couple of different PDF viewers, so it seems like it's not totally malformed?

I saw the same issue on v1.25.3, but did not test other versions. I got the same result on Mac and Linux.

PyMuPDF version

1.25.4

Operating system

MacOS

Python version

3.11

@JorjMcKie JorjMcKie added the enhancement-upstream to be implemented by MuPDF label Mar 25, 2025
@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Mar 25, 2025

This is not a restriction set by PyMuPDF, but by the base library, MuPDF.
I will forward this to their issue system and see what they have to say.

In any case, most viewers will at least exhibit an extended response time before showing the page.
The page's /Contents stream has an uncompressed size of 3.3 Megabytes (!) ... which says a lot.

@JorjMcKie
Copy link
Collaborator

MuPDF issue filed under https://bugs.ghostscript.com/show_bug.cgi?id=708373

@JorjMcKie
Copy link
Collaborator

Just looked up the nesting level depth limit in MuPDF and found 1024 (!). Probably not a big deal to increase this. But for a one pager showing just a few words, this does look crazy ...

@JorjMcKie JorjMcKie self-assigned this Mar 25, 2025
@C-Saunders
Copy link
Author

Thanks for the quick response!

For my specific use-case, I'm mostly interested in whether there is any text (ideally, whether there are any non-whitespace characters), rather than trying to get all the text on the page. Do you have any ideas/suggestions for a work-around that might give me that info while avoiding this issue?

@JorjMcKie
Copy link
Collaborator

You can try this snippet. It reads the 3.3 MB page /Contents stream and looks up all the text objects. This are bytes sub-strings wrapped between "BT"/"ET".
It then simply checks the length of each text object definition. It cannot know what there is going to be written (that's very hard / impossible), but probably persuasive enough for your case.

import pymupdf

doc = pymupdf.open("subset.pdf")
page = doc[0]
cont = page.read_contents()

p0 = 0
i = 0
while True:
    p0 = cont.find(b"BT", p0)
    if p0 == -1:
        break
    p1 = cont.find(b"ET", p0)
    if p1 == -1:  # should not occur ... but
        break
    print(f"text object {i} length:", len(cont[p0 + 2 : p1]))
    p0 = p1 + 2
    i += 1
doc.close()

@JorjMcKie JorjMcKie added the fix developed release schedule to be determined label Mar 26, 2025
@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Mar 26, 2025

The MuPDF team has rapidly developed a fix addressing this problem presented by such an "obscenely structured file".

Configuration info: Required MuPDF commit 0d48b70f9970c77d133d31a070f437cea70f067b.

@JorjMcKie
Copy link
Collaborator

@C-Saunders I have tested PyMuPDF with included MuPDF fix and can confirm that the issue will be resolved. Here is the output of a session:

import pymupdf
doc=pymupdf.open("subset.pdf")
page=doc[0]
text=page.get_text()
print(text)
1. Do you view the clients behavior and/or usage as a problem?
2. Have you or any other family member attempted to intervene or address the
clients behavior and/or usage?
☐ YesNo
Why or Why Not?
3. Have you noticed any changes in the clients behavior?
4. Have there been any traumatic events in the family or in the client's life?
5. Are you willing to participate in the clients treatment?
APPEARANCE Disheveled/unkempt
AFFECT
Flat
MOOD
Appropriate
BEHAVIOR
Cooperative
ORIENTATION Person , Time , Place
INSIGHT
Fair
JUDGMENT
Mature
PROBLEMS
3Considerably (3)
MEDICAL
3Considerably (3)
EMPLOYMENT
1Slightly (1)
• Clinician attempted to reach collateral contact but was unsuccessful. Message was
left and clinician will attempt again at a later time.
SECTION 16: ASSESSMENT OF MENTAL STATUS DURING INTERVIEW
SECTION 17: LEVELS OF IMPAIRMENT / SEVERITY RATINGS
Rate Client's Level of Impairment & Severity using the following scale:
0Not at all
1Slightly
2Moderately
3Considerably
4Extremely

pix=page.get_pixmap()
pix.size
1454208
pix.save("subset.png")

As per this file's "craziness", the response time of text extraction seems to work with the usual swiftness / shows no noticeable degradation.

Pixmap creation however takes several seconds, which is as expected given long response times also experienced by PDF viewers.

@julian-smith-artifex-com
Copy link
Collaborator

Fixed in PyMuPDF-1.25.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement-upstream to be implemented by MuPDF fix developed release schedule to be determined Fixed in next release
Projects
None yet
Development

No branches or pull requests

3 participants