-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot get Tessdata with Tesseract-OCR 5 #3767
Comments
MuPDF contains Tesseract 4.0 code to perform the OCR - it is integral part of the MuPDF binary. The MuPDF team has stated that release 5.0 behavior is far less stable / predictable as necessary for MuPDF's purposes - details for this assessment should be best discussed with the team directly, e.g. on this Discord channel. So what PyMuPDF's OCR is actually needed is exclusively the tessdata (language support) folder. Independently of the aforementioned, we should correct the behavior of the pymupdf function. |
Oh my bad, thanks for these details! |
No problem. I made the tesseract installation detector version-independent. |
Fixed in 1.24.10. |
[>] No problem. I made the tesseract installation detector version-independent. But as I said: the MuPDF code is Tesseract 4.00, and I don't know what happens if it is confronted with a version 5 tessdata. Unfortunately, errors occur if you use user training done in the jTessBoxEditor program with the version of Tesseract 5. But! in Tesseract 5, everything is recognized perfectly. I get the following errors in mutool draw: I ask for help. I have words, signs, etc. when building tesseract. They have been successfully added to the logs of the jtessboxeditor program. @JorjMcKie |
Description of the bug
The
pymupdf.get_tessdata()
function raises an unexpected error when the installed version of Tesseract OCR is not 4.0 (tested on the latest Debian, with Tesseract 5).How to reproduce the bug
I haven't looked into the details yet, but I think the problem lays here:
PyMuPDF/src/__init__.py
Lines 18093 to 18099 in eca7066
I have the latest Debian with Tesseract OCR 5.3.0, installed in
/usr/share/tesseract-ocr/5/tessdata/
.The function
get_tessdata()
expects it in/usr/share/tesseract-ocr/4.00/tessdata
, else it will search it withwhereis tesseract-ocr
.However, it tries to
iterdir
on the subprocess response, even though it's a list of bytes, which raises the error.I don't quite know the inner workings of Tesseract or Pymupdf, but it seems that this functions is looking for a sub-sub-folder whose name ends with
tessdata
, and should find it in the second part ofresponse
. So I guess something like this should work?Yeah, I know I should set the
TESSDATA_PREFIX
environment variable anyway, but as the expected 4.0 version of Tesseract OCR is about six years old now, and no longer seems to be in the Debian repos, I guess it wouldn't harm to handle this case (unless the 5.0 is unsupported)?Thanks for developing PyMuPDF! :)
PyMuPDF version
1.24.9
Operating system
Linux
Python version
3.11
The text was updated successfully, but these errors were encountered: