Unable to get_text() - layer/clip nesting too deep #4403

C-Saunders · 2025-03-25T13:03:41Z

Description of the bug

When page.get_text() is called on the first (and only) page of the attached PDF, this error is raised: RuntimeError: code=5: layer/clip nesting too deep

How to reproduce the bug

subset.pdf

import pymupdf

doc = pymupdf.open("./subset.pdf")
doc[0].get_text()

I was able to view and copy/paste text from this page using a couple of different PDF viewers, so it seems like it's not totally malformed?

I saw the same issue on v1.25.3, but did not test other versions. I got the same result on Mac and Linux.

PyMuPDF version

1.25.4

Operating system

MacOS

Python version

3.11

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2025-03-25T13:19:06Z

This is not a restriction set by PyMuPDF, but by the base library, MuPDF.
I will forward this to their issue system and see what they have to say.

In any case, most viewers will at least exhibit an extended response time before showing the page.
The page's /Contents stream has an uncompressed size of 3.3 Megabytes (!) ... which says a lot.

JorjMcKie · 2025-03-25T13:26:59Z

MuPDF issue filed under https://bugs.ghostscript.com/show_bug.cgi?id=708373

JorjMcKie · 2025-03-25T13:32:38Z

Just looked up the nesting level depth limit in MuPDF and found 1024 (!). Probably not a big deal to increase this. But for a one pager showing just a few words, this does look crazy ...

C-Saunders · 2025-03-25T14:25:47Z

Thanks for the quick response!

For my specific use-case, I'm mostly interested in whether there is any text (ideally, whether there are any non-whitespace characters), rather than trying to get all the text on the page. Do you have any ideas/suggestions for a work-around that might give me that info while avoiding this issue?

JorjMcKie · 2025-03-25T14:53:12Z

You can try this snippet. It reads the 3.3 MB page /Contents stream and looks up all the text objects. This are bytes sub-strings wrapped between "BT"/"ET".
It then simply checks the length of each text object definition. It cannot know what there is going to be written (that's very hard / impossible), but probably persuasive enough for your case.

import pymupdf

doc = pymupdf.open("subset.pdf")
page = doc[0]
cont = page.read_contents()

p0 = 0
i = 0
while True:
    p0 = cont.find(b"BT", p0)
    if p0 == -1:
        break
    p1 = cont.find(b"ET", p0)
    if p1 == -1:  # should not occur ... but
        break
    print(f"text object {i} length:", len(cont[p0 + 2 : p1]))
    p0 = p1 + 2
    i += 1
doc.close()

JorjMcKie · 2025-03-26T14:55:05Z

The MuPDF team has rapidly developed a fix addressing this problem presented by such an "obscenely structured file".

Configuration info: Required MuPDF commit 0d48b70f9970c77d133d31a070f437cea70f067b.

JorjMcKie · 2025-03-27T10:45:41Z

@C-Saunders I have tested PyMuPDF with included MuPDF fix and can confirm that the issue will be resolved. Here is the output of a session:

import pymupdf
doc=pymupdf.open("subset.pdf")
page=doc[0]
text=page.get_text()
print(text)
1. Do you view the client’s behavior and/or usage as a problem?
2. Have you or any other family member attempted to intervene or address the
client’s behavior and/or usage?
☐ Yes ☐ No
Why or Why Not?
3. Have you noticed any changes in the client’s behavior?
4. Have there been any traumatic events in the family or in the client's life?
5. Are you willing to participate in the client’s treatment?
APPEARANCE Disheveled/unkempt
AFFECT
Flat
MOOD
Appropriate
BEHAVIOR
Cooperative
ORIENTATION Person , Time , Place
INSIGHT
Fair
JUDGMENT
Mature
PROBLEMS
3 – Considerably (3)
MEDICAL
3 – Considerably (3)
EMPLOYMENT
1 – Slightly (1)
• Clinician attempted to reach collateral contact but was unsuccessful. Message was
left and clinician will attempt again at a later time.
SECTION 16: ASSESSMENT OF MENTAL STATUS DURING INTERVIEW
SECTION 17: LEVELS OF IMPAIRMENT / SEVERITY RATINGS
Rate Client's Level of Impairment & Severity using the following scale:
0 – Not at all
1 – Slightly
2 – Moderately
3 – Considerably
4 – Extremely

pix=page.get_pixmap()
pix.size
1454208
pix.save("subset.png")

As per this file's "craziness", the response time of text extraction seems to work with the usual swiftness / shows no noticeable degradation.

Pixmap creation however takes several seconds, which is as expected given long response times also experienced by PDF viewers.

julian-smith-artifex-com · 2025-03-31T23:27:10Z

Fixed in PyMuPDF-1.25.5.

JorjMcKie added the enhancement-upstream to be implemented by MuPDF label Mar 25, 2025

JorjMcKie self-assigned this Mar 25, 2025

JorjMcKie added the fix developed release schedule to be determined label Mar 26, 2025

julian-smith-artifex-com added the Fixed in next release label Mar 31, 2025

julian-smith-artifex-com closed this as completed Mar 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to get_text() - layer/clip nesting too deep #4403

Unable to get_text() - layer/clip nesting too deep #4403

C-Saunders commented Mar 25, 2025

JorjMcKie commented Mar 25, 2025 •

edited

Loading

JorjMcKie commented Mar 25, 2025

JorjMcKie commented Mar 25, 2025

C-Saunders commented Mar 25, 2025

JorjMcKie commented Mar 25, 2025

JorjMcKie commented Mar 26, 2025 •

edited

Loading

JorjMcKie commented Mar 27, 2025

julian-smith-artifex-com commented Mar 31, 2025

Unable to get_text() - layer/clip nesting too deep #4403

Unable to get_text() - layer/clip nesting too deep #4403

Comments

C-Saunders commented Mar 25, 2025

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

JorjMcKie commented Mar 25, 2025 • edited Loading

JorjMcKie commented Mar 25, 2025

JorjMcKie commented Mar 25, 2025

C-Saunders commented Mar 25, 2025

JorjMcKie commented Mar 25, 2025

JorjMcKie commented Mar 26, 2025 • edited Loading

JorjMcKie commented Mar 27, 2025

julian-smith-artifex-com commented Mar 31, 2025

JorjMcKie commented Mar 25, 2025 •

edited

Loading

JorjMcKie commented Mar 26, 2025 •

edited

Loading