Skip to content

[Bug]: deskew results in "empty" output file #1438

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hatl opened this issue Nov 29, 2024 · 2 comments
Open

[Bug]: deskew results in "empty" output file #1438

hatl opened this issue Nov 29, 2024 · 2 comments
Assignees
Labels

Comments

@hatl
Copy link

hatl commented Nov 29, 2024

Describe the bug

When running ocrmypdf on some files (all cropped in PDF Arranger), the output is broken.
The issue only occurs when adding the deskew flag.

I ran into this issue using the latest version of paperless-ngx.
see also: paperless-ngx/paperless-ngx#8375

Steps to reproduce

1. Run `ocrmypdf --jobs 6 -l deu --output-type pdf --rotate-pages --rotate-pages-threshold 12.0 --skip-text --clean --deskew  scan0003.pdf output.pdf`
2. output is broken

remove deskew and the output is fine

Files

scan0003.pdf
output.pdf

How did you download and install the software?

Ubuntu snap

OCRmyPDF version

16.6.3.dev8+gfe89be5d

Relevant log output

No response

@jbarlow83
Copy link
Collaborator

This appears to be related to a general issue with cropped page boxes in ocrmypdf.
The rather unusual mediabox for the input file has these dimensions
0 449.835616 249.069767 842

where typically the third number is 0.

@jbarlow83 jbarlow83 added bug and removed triage Issue needs triage labels Nov 29, 2024
@dennyh
Copy link

dennyh commented Apr 13, 2025

Sharing my workaround:
Input pdf (problematic.pdf) is rotated with pdftk, cropped with BRISS and joined with pdftk.

$ pdfinfo -box problematic.pdf

Creator:         pdftk-java 3.3.3
Custom Metadata: no
Metadata Stream: no
Tagged:          no
UserProperties:  no
Suspects:        no
Form:            none
JavaScript:      no
Pages:           41
Encrypted:       no
Page size:       624 x 419 pts
Page rot:        90
MediaBox:           19.00   584.00   643.00  1003.00
CropBox:            19.00   584.00   643.00  1003.00
BleedBox:           19.00   584.00   643.00  1003.00
TrimBox:            19.00   584.00   643.00  1003.00
ArtBox:             19.00   584.00   643.00  1003.00
File size:       8419541 bytes
Optimized:       no
PDF version:     1.4

Deskewering it like so:
$ ocrmypdf -d problematic.pdf bad_deskewed.pdf
produces split/half or sometimes empty pages. When the text is partially visible, it looks deskewed.

Solution: Convert problematic.pdf to pdfa:
$ gs -dPDFA=1 -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=not_problematic.pdfa -dPDFACompatibilityPolicy=1 problematic.pdf

Then feed it to ocrmypdf:
$ ocrmypdf -d not_problematic.pdfa good_deskewed.pdf
This produces a good result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants