-
Notifications
You must be signed in to change notification settings - Fork 604
Unable to get_text() - layer/clip nesting too deep #4403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is not a restriction set by PyMuPDF, but by the base library, MuPDF. In any case, most viewers will at least exhibit an extended response time before showing the page. |
MuPDF issue filed under https://bugs.ghostscript.com/show_bug.cgi?id=708373 |
Just looked up the nesting level depth limit in MuPDF and found 1024 (!). Probably not a big deal to increase this. But for a one pager showing just a few words, this does look crazy ... |
Thanks for the quick response! For my specific use-case, I'm mostly interested in whether there is any text (ideally, whether there are any non-whitespace characters), rather than trying to get all the text on the page. Do you have any ideas/suggestions for a work-around that might give me that info while avoiding this issue? |
You can try this snippet. It reads the 3.3 MB page import pymupdf
doc = pymupdf.open("subset.pdf")
page = doc[0]
cont = page.read_contents()
p0 = 0
i = 0
while True:
p0 = cont.find(b"BT", p0)
if p0 == -1:
break
p1 = cont.find(b"ET", p0)
if p1 == -1: # should not occur ... but
break
print(f"text object {i} length:", len(cont[p0 + 2 : p1]))
p0 = p1 + 2
i += 1
doc.close() |
The MuPDF team has rapidly developed a fix addressing this problem presented by such an "obscenely structured file". Configuration info: Required MuPDF commit 0d48b70f9970c77d133d31a070f437cea70f067b. |
@C-Saunders I have tested PyMuPDF with included MuPDF fix and can confirm that the issue will be resolved. Here is the output of a session: import pymupdf
doc=pymupdf.open("subset.pdf")
page=doc[0]
text=page.get_text()
print(text)
1. Do you view the client’s behavior and/or usage as a problem?
2. Have you or any other family member attempted to intervene or address the
client’s behavior and/or usage?
☐ Yes ☐ No
Why or Why Not?
3. Have you noticed any changes in the client’s behavior?
4. Have there been any traumatic events in the family or in the client's life?
5. Are you willing to participate in the client’s treatment?
APPEARANCE Disheveled/unkempt
AFFECT
Flat
MOOD
Appropriate
BEHAVIOR
Cooperative
ORIENTATION Person , Time , Place
INSIGHT
Fair
JUDGMENT
Mature
PROBLEMS
3 – Considerably (3)
MEDICAL
3 – Considerably (3)
EMPLOYMENT
1 – Slightly (1)
• Clinician attempted to reach collateral contact but was unsuccessful. Message was
left and clinician will attempt again at a later time.
SECTION 16: ASSESSMENT OF MENTAL STATUS DURING INTERVIEW
SECTION 17: LEVELS OF IMPAIRMENT / SEVERITY RATINGS
Rate Client's Level of Impairment & Severity using the following scale:
0 – Not at all
1 – Slightly
2 – Moderately
3 – Considerably
4 – Extremely
pix=page.get_pixmap()
pix.size
1454208
pix.save("subset.png") As per this file's "craziness", the response time of text extraction seems to work with the usual swiftness / shows no noticeable degradation. Pixmap creation however takes several seconds, which is as expected given long response times also experienced by PDF viewers. |
Fixed in PyMuPDF-1.25.5. |
Description of the bug
When
page.get_text()
is called on the first (and only) page of the attached PDF, this error is raised:RuntimeError: code=5: layer/clip nesting too deep
How to reproduce the bug
subset.pdf
I was able to view and copy/paste text from this page using a couple of different PDF viewers, so it seems like it's not totally malformed?
I saw the same issue on v1.25.3, but did not test other versions. I got the same result on Mac and Linux.
PyMuPDF version
1.25.4
Operating system
MacOS
Python version
3.11
The text was updated successfully, but these errors were encountered: