Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot get Tessdata with Tesseract-OCR 5 #3767

Closed
rezemika opened this issue Aug 10, 2024 · 5 comments
Closed

Cannot get Tessdata with Tesseract-OCR 5 #3767

rezemika opened this issue Aug 10, 2024 · 5 comments
Labels
bug fix developed release schedule to be determined

Comments

@rezemika
Copy link

rezemika commented Aug 10, 2024

Description of the bug

The pymupdf.get_tessdata() function raises an unexpected error when the installed version of Tesseract OCR is not 4.0 (tested on the latest Debian, with Tesseract 5).

>>> import pymupdf
>>> pymupdf.get_tessdata()
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "<...>/venv/lib/python3.11/site-packages/pymupdf/__init__.py", line 18082, in get_tessdata
    for sub_response in response.iterdir():
                        ^^^^^^^^^^^^^^^^
AttributeError: 'list' object has no attribute 'iterdir'

>>> pymupdf.version
('1.24.9', '1.24.8', '20240724000001')

How to reproduce the bug

I haven't looked into the details yet, but I think the problem lays here:

PyMuPDF/src/__init__.py

Lines 18093 to 18099 in eca7066

# determine tessdata via iteration over subfolders
tessdata = None
for sub_response in response.iterdir():
for sub_sub in sub_response.iterdir():
if str(sub_sub).endswith("tessdata"):
tessdata = sub_sub
break

I have the latest Debian with Tesseract OCR 5.3.0, installed in /usr/share/tesseract-ocr/5/tessdata/.
The function get_tessdata() expects it in /usr/share/tesseract-ocr/4.00/tessdata, else it will search it with whereis tesseract-ocr.

However, it tries to iterdir on the subprocess response, even though it's a list of bytes, which raises the error.

>>> import subprocess
>>> cp = subprocess.run('whereis tesseract-ocr', shell=1, capture_output=1, check=0)
>>> cp
CompletedProcess(args='whereis tesseract-ocr', returncode=0, stdout=b'tesseract-ocr: /usr/share/tesseract-ocr\n', stderr=b'')
>>> response = cp.stdout.strip().split()
>>> response
[b'tesseract-ocr:', b'/usr/share/tesseract-ocr']
>>> type(response), type(response[0])
(<class 'list'>, <class 'bytes'>)
>>> 
>>> response.iterdir()
Traceback (most recent call last):
  File "<console>", line 1, in <module>
AttributeError: 'list' object has no attribute 'iterdir'

I don't quite know the inner workings of Tesseract or Pymupdf, but it seems that this functions is looking for a sub-sub-folder whose name ends with tessdata, and should find it in the second part of response. So I guess something like this should work?

import subprocess
cp = subprocess.run('whereis tesseract-ocr', shell=1, capture_output=1, check=0)
response = cp.stdout.strip().split()
import pathlib
response_dir = pathlib.Path(response[1].decode("utf-8"))
# response_dir == PosixPath('/usr/share/tesseract-ocr')
for sub_dir in response_dir.iterdir():
    for sub_sub_dir in sub_dir.iterdir():
        if sub_sub_dir.name.endswith("tessdata"):
            tessdata = str(sub_sub_dir)
            break
# tessdata == '/usr/share/tesseract-ocr/5/tessdata'

Yeah, I know I should set the TESSDATA_PREFIX environment variable anyway, but as the expected 4.0 version of Tesseract OCR is about six years old now, and no longer seems to be in the Debian repos, I guess it wouldn't harm to handle this case (unless the 5.0 is unsupported)?

Thanks for developing PyMuPDF! :)

PyMuPDF version

1.24.9

Operating system

Linux

Python version

3.11

@JorjMcKie
Copy link
Collaborator

MuPDF contains Tesseract 4.0 code to perform the OCR - it is integral part of the MuPDF binary.

The MuPDF team has stated that release 5.0 behavior is far less stable / predictable as necessary for MuPDF's purposes - details for this assessment should be best discussed with the team directly, e.g. on this Discord channel.

So what PyMuPDF's OCR is actually needed is exclusively the tessdata (language support) folder.
I cannot say whether a 5.0 tessdata has a format compatible to one of release 4.0.
But I definitely would suggest to use either the environment variable or the tessdata parameter.

Independently of the aforementioned, we should correct the behavior of the pymupdf function.

@JorjMcKie JorjMcKie added bug fix developed release schedule to be determined labels Aug 11, 2024
@rezemika
Copy link
Author

Oh my bad, thanks for these details!

@JorjMcKie
Copy link
Collaborator

No problem. I made the tesseract installation detector version-independent.
But as I said: the MuPDF code is Tesseract 4.00, and I don't know what happens if it is confronted with a version 5 tessdata.

@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.24.10.

@mstr11
Copy link

mstr11 commented Nov 15, 2024

[>] No problem. I made the tesseract installation detector version-independent. But as I said: the MuPDF code is Tesseract 4.00, and I don't know what happens if it is confronted with a version 5 tessdata.

Unfortunately, errors occur if you use user training done in the jTessBoxEditor program with the version of Tesseract 5. But! in Tesseract 5, everything is recognized perfectly. I get the following errors in mutool draw:
Error: LSTM requested, but not present!! Loading tesseract
no best word!!
no best word!!
no best word!!
no best word!!
....

I ask for help. I have words, signs, etc. when building tesseract. They have been successfully added to the logs of the jtessboxeditor program. @JorjMcKie
In the fifth version of Tesseract, support for learning Tesseract 4.0 dictionaries has officially been discontinued, where two files are created, LSTM and LSTMF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix developed release schedule to be determined
Projects
None yet
Development

No branches or pull requests

4 participants