Skip to content

ocrd-cis-ocropy-recognize: 'ascii' codec can't decode byte 0xa9 #41

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbarth-ubhd opened this issue Apr 2, 2020 · 3 comments · Fixed by #47
Closed

ocrd-cis-ocropy-recognize: 'ascii' codec can't decode byte 0xa9 #41

jbarth-ubhd opened this issue Apr 2, 2020 · 3 comments · Fixed by #47

Comments

@jbarth-ubhd
Copy link

models:

> find . -name *.pyrnn|xargs md5sum
bb90b17321987002afa6b94e650d16fa  ./venv/lib/python3.6/site-packages/ocrd_cis/ocropy/models/fraktur.pyrnn
ef3238cd60cb1c35ede74573c8d14766  ./venv/lib/python3.6/site-packages/ocrd_cis/ocropy/models/fraktur-jze.pyrnn

file: https://digi.ub.uni-heidelberg.de/diglitData/jb/ocropy-test.jpg

command:

> ocrd-make -f crop-anyocr-binarize-page-olena-sauvola-denoise-ocropy-deskew-page-ocropy-segment-tesseract-ocropy-dewarp-ocr-ocropy-tesseract.`mk 
make: Entering directory '/home/jb/workspace/ocrd/ocrd4dwork'
building OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP from OCR-D-SEG-LINE-tesseract-ocropy-DEWARP with pattern rule for ocrd-cis-ocropy-recognize
ocrd workspace remove-group -r OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP 2>/dev/null || true
ocrd-cis-ocropy-recognize -I OCR-D-SEG-LINE-tesseract-ocropy-DEWARP -O OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP -p OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP.json 2>&1 | tee OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP.log && touch -c OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP || { rm -fr OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP.json OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP; exit 1; }
16:39:06.634 INFO ocrd.workspace_validator - input_file_grp=['OCR-D-SEG-LINE-tesseract-ocropy-DEWARP'] output_file_grp=['OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP']
Traceback (most recent call last):
  File "/home/jb/ocrd_all/venv/bin/ocrd-cis-ocropy-recognize", line 8, in <module>
    sys.exit(ocrd_cis_ocropy_recognize())
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd_cis/ocropy/cli.py", line 49, in ocrd_cis_ocropy_recognize
    return ocrd_cli_wrap_processor(OcropyRecognize, *args, **kwargs)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd/decorators.py", line 54, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd/processor/base.py", line 57, in run_processor
    processor.process()
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd_cis/ocropy/recognize.py", line 134, in process
    self.network = load_object(self.get_model(), verbose=1)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd_cis/ocropy/ocrolib/common.py", line 459, in load_object
    return unpickler.load()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa9 in position 0: ordinal not in range(128)
Makefile:304: recipe for target 'OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP' failed
make: *** [OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP] Error 1
make: Leaving directory '/home/jb/workspace/ocrd/ocrd4dwork'
@bertsky
Copy link
Collaborator

bertsky commented Apr 2, 2020

Thanks for reporting!

I believe this is an artifact of incomplete Python 2-3 porting. You can avoid it by leaving the file in gzip-compressed form (with .gz extension).

The uncompressed case needs to use the same latin1 encoding IMO.

@jbarth-ubhd
Copy link
Author

Tried it, but then ocr-cis-ocropy-recognize does not find the *.pyrnn.gz

@bertsky
Copy link
Collaborator

bertsky commented Apr 3, 2020

That's odd. Relative paths should be searched:

  1. in __file__'s directory, e.g. venv/lib/python3.6/site-packages/ocrd_cis/ocropy
  2. in __file__'s models subdirectory, e.g. venv/lib/python3.6/site-packages/ocrd_cis/ocropy/models
  3. in any of the directories mentioned in ocrolib.ocropus_find_file:
     Result of searching $fname is the first existing in:
    
         * $base/$fname
         * $base/$fname.gz
         * $base/model/$fname
         * $base/model/$fname.gz
         * $base/data/$fname
         * $base/data/$fname.gz
         * $base/gui/$fname
         * $base/gui/$fname.gz   # if gz
    
     $base can be four base paths:
         * `$OCROPUS_DATA` environment variable
         * current working directory
         * ../../../../share/ocropus from this file's install location
         * `/usr/local/share/ocropus`
         * `$PREFIX/share/ocropus` ($PREFIX being the Python installation
            prefix, usually `/usr`)
    

3 probably won't help you, because the CWD is the OCR-D workspace directory in the processor's context, and you probably never installed ocropus itself.

So, you should stick with 1 or 2, in the .gz form (until we patched the uncompressed condition).

Perhaps you forgot to also add the .gz suffix in the makefile/parameter file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants