Skip to content

[3rdparty]: Paperless-ngx fails on consuming a file #1495

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
GooRoo opened this issue Mar 17, 2025 · 3 comments
Open
3 tasks done

[3rdparty]: Paperless-ngx fails on consuming a file #1495

GooRoo opened this issue Mar 17, 2025 · 3 comments
Assignees
Labels
triage Issue needs triage

Comments

@GooRoo
Copy link

GooRoo commented Mar 17, 2025

Simple sanity checks

  • This is an issue with an app that uses OCRmyPDF for OCR
  • I am using a recent version of the third party app
  • I will include a file that reproduces the issuse

Third party app name and version

Paperless-ngx 2.14.7

Describe the bug

Paperless can't consume a file.

Steps to reproduce

1. Import attached file into Paperless-ngx.
2. OCR is automatically triggered.
3. The process is failed with the following errors in log.

Files

o451229v21_160992A98S_202401.pdf

OCRmyPDF version

No response

Relevant log output

[2025-03-17 23:01:37,509] [ERROR] [paperless.consumer] Error occurred while consuming document o451229v21_160992A98S_202401.pdf: SubprocessOutputError: Ghostscript PDF/A rendering failed. See logs for more information.
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_exec/ghostscript.py", line 288, in generate_pdfa
    p = run_polling_stderr(
        ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/subprocess/__init__.py", line 114, in run_polling_stderr
    raise CalledProcessError(proc.returncode, args, output=None, stderr=stderr)
subprocess.CalledProcessError: Command '['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=RGB', '-dPDFSTOPONERROR', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '/tmp/ocrmypdf.io.p49cqgey/pdfa.pdf', '-sstdout=%stderr', '/tmp/ocrmypdf.io.p49cqgey/pdfa.ps', '/tmp/ocrmypdf.io.p49cqgey/fix_docinfo.pdf']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 382, in parse
    ocrmypdf.ocr(**args)
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/api.py", line 380, in ocr
    return run_pipeline(options=options, plugin_manager=plugin_manager)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 214, in run_pipeline
    return _run_pipeline(options, plugin_manager)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 181, in _run_pipeline
    optimize_messages = exec_concurrent(context, executor)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 145, in exec_concurrent
    pdf, messages = postprocess(pdf, context, executor)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py", line 453, in postprocess
    pdf_out = convert_to_pdfa(pdf_out, ps_stub_out, context)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipeline.py", line 912, in convert_to_pdfa
    context.plugin_manager.hook.generate_pdfa(
  File "/usr/local/lib/python3.12/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pluggy/_callers.py", line 139, in _multicall
    raise exception.with_traceback(exception.__traceback__)
  File "/usr/local/lib/python3.12/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/builtin_plugins/ghostscript.py", line 131, in generate_pdfa
    ghostscript.generate_pdfa(
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_exec/ghostscript.py", line 301, in generate_pdfa
    raise SubprocessOutputError('Ghostscript PDF/A rendering failed') from e
ocrmypdf.exceptions.SubprocessOutputError: Ghostscript PDF/A rendering failed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/asgiref/sync.py", line 327, in main_wrap
    raise exc_info[1]
  File "/usr/src/paperless/src/documents/consumer.py", line 477, in run
    document_parser.parse(self.working_copy, mime_type, self.filename)
  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 405, in parse
    raise ParseError(
documents.parsers.ParseError: SubprocessOutputError: Ghostscript PDF/A rendering failed. See logs for more information.
[2025-03-17 23:01:37,560] [ERROR] [paperless.tasks] ConsumeTaskPlugin failed: o451229v21_160992A98S_202401.pdf: Error occurred while consuming document o451229v21_160992A98S_202401.pdf: SubprocessOutputError: Ghostscript PDF/A rendering failed. See logs for more information.
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_exec/ghostscript.py", line 288, in generate_pdfa
    p = run_polling_stderr(
        ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/subprocess/__init__.py", line 114, in run_polling_stderr
    raise CalledProcessError(proc.returncode, args, output=None, stderr=stderr)
subprocess.CalledProcessError: Command '['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=RGB', '-dPDFSTOPONERROR', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '/tmp/ocrmypdf.io.p49cqgey/pdfa.pdf', '-sstdout=%stderr', '/tmp/ocrmypdf.io.p49cqgey/pdfa.ps', '/tmp/ocrmypdf.io.p49cqgey/fix_docinfo.pdf']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 382, in parse
    ocrmypdf.ocr(**args)
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/api.py", line 380, in ocr
    return run_pipeline(options=options, plugin_manager=plugin_manager)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 214, in run_pipeline
    return _run_pipeline(options, plugin_manager)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 181, in _run_pipeline
    optimize_messages = exec_concurrent(context, executor)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 145, in exec_concurrent
    pdf, messages = postprocess(pdf, context, executor)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py", line 453, in postprocess
    pdf_out = convert_to_pdfa(pdf_out, ps_stub_out, context)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipeline.py", line 912, in convert_to_pdfa
    context.plugin_manager.hook.generate_pdfa(
  File "/usr/local/lib/python3.12/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pluggy/_callers.py", line 139, in _multicall
    raise exception.with_traceback(exception.__traceback__)
  File "/usr/local/lib/python3.12/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/builtin_plugins/ghostscript.py", line 131, in generate_pdfa
    ghostscript.generate_pdfa(
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_exec/ghostscript.py", line 301, in generate_pdfa
    raise SubprocessOutputError('Ghostscript PDF/A rendering failed') from e
ocrmypdf.exceptions.SubprocessOutputError: Ghostscript PDF/A rendering failed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/asgiref/sync.py", line 327, in main_wrap
    raise exc_info[1]
  File "/usr/src/paperless/src/documents/consumer.py", line 477, in run
    document_parser.parse(self.working_copy, mime_type, self.filename)
  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 405, in parse
    raise ParseError(
documents.parsers.ParseError: SubprocessOutputError: Ghostscript PDF/A rendering failed. See logs for more information.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/src/paperless/src/documents/tasks.py", line 154, in consume_file
    msg = plugin.run()
          ^^^^^^^^^^^^
  File "/usr/src/paperless/src/documents/consumer.py", line 509, in run
    self._fail(
  File "/usr/src/paperless/src/documents/consumer.py", line 151, in _fail
    raise ConsumerError(f"{self.filename}: {log_message or message}") from exception
documents.consumer.ConsumerError: o451229v21_160992A98S_202401.pdf: Error occurred while consuming document o451229v21_160992A98S_202401.pdf: SubprocessOutputError: Ghostscript PDF/A rendering failed. See logs for more information.
@GooRoo GooRoo added the triage Issue needs triage label Mar 17, 2025
@dsteinborn
Copy link

I'm getting the exactly same error with an invoice file I tried to upload to paperless-ngx.

@kernie
Copy link

kernie commented Apr 4, 2025

@GooRoo Which OCR mode did you use (skip, redo, force)?

I used your file in my Paperless v2.14.7 instance in skip mode and the log was full of

[2025-04-04 12:54:33,868] [ERROR] [ocrmypdf.optimize] xref 7147: While extracting this image, an error occurred

Traceback (most recent call last):

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/optimize.py", line 334, in extract_images

    result = extract_fn(

             ^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/optimize.py", line 224, in extract_image_generic

    elif not pim.indexed and pim.colorspace in pim.SIMPLE_COLORSPACES:

                             ^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/pikepdf/models/image.py", line 211, in colorspace

    raise NotImplementedError(

NotImplementedError: not sure how to get colorspace: ['/Separation', '/Black', '/DeviceRGB', pikepdf.Dictionary({

  "/C0": [ 1, 1, 1 ],

  "/C1": [ Decimal('0.136691'), Decimal('0.121947'), Decimal('0.125305') ],

  "/Domain": [ 0, 1 ],

  "/FunctionType": 2,

  "/N": 1,

  "/Range": [ 0, 1, 0, 1, 0, 1 ]

})]

but the document was finally consumed and usable.

@GooRoo
Copy link
Author

GooRoo commented Apr 4, 2025

@kernie I haven't changed this setting, and its default value is skip I believe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Issue needs triage
Projects
None yet
Development

No branches or pull requests

4 participants