-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[improvement] .render()
isn't that robust - wrong ordered results
#1586
Comments
Hi @kripper 👋, Thanks for reporting :) The issue here is that page 2 & 3 contains small rotations could you give it a try with passing |
Predictor initiated with:
But the probelm persists on page 1:
Also note that the OCR'ed page (page 1) is a clean PDF page. |
From what I have also seen, sometimes the models predict lines in the wrong block, even though their coordinates are correct. This is why the
|
@kripper Have you already tried to disable block and/or line resolving ?
|
It's now mixing blocks multiple times per line. What about taking a look at Tesseract's implementation? |
Sure :) |
No, but I will research tomorrow. |
Have you tried existing tools to convert doctr's HOCR output to text? There are many. Tesseract probably is also using some of them. |
Yeah you can use doctr's XML/hocr output to create PDF/A files for example with OCRmyPDF |
.render()
isn't that robust - wrong ordered results
Especially with rotations all other open source tools (paddleOCR / tesseract / easyOCR) fail also: Possible way to go: https://arxiv.org/abs/2305.02577 --> investigate |
Bug description
The default OCR model works very well, but the
render()
algorithm which converts coordinates to text positions is very buggy.This causes lines originally placed at the top to be positioned between other lines at the bottom, making the overall result unusable for LLM inference.
I wonder if you have considered reusing the algorithm implemented in Tesseract. They probably solved the same problem many years ago.
And I also wonder why the Tesseract team is not integrating the doctr engine into Tesseract :-)
Good job! You are leading the OCR leaderboard.
I attached a sample .PDF file and a snippet to reproduce the problem.
I checked other similar inactive issues, so I'm afraid rendering to text is currently not a hot topic :-(
...but how are we suposed to feed our hungry LLMs?
Code snippet to reproduce the bug
Error traceback
No error
Environment
Linux, conda, python 3.9
Deep Learning backend
Default model.
test-ocr.pdf
The text was updated successfully, but these errors were encountered: