Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improvement] .render() isn't that robust - wrong ordered results #1586

Open
kripper opened this issue May 6, 2024 · 12 comments
Open

[improvement] .render() isn't that robust - wrong ordered results #1586

kripper opened this issue May 6, 2024 · 12 comments
Assignees
Labels
help wanted Extra attention is needed module: models Related to doctr.models type: bug Something isn't working type: enhancement Improvement
Milestone

Comments

@kripper
Copy link

kripper commented May 6, 2024

Bug description

The default OCR model works very well, but the render() algorithm which converts coordinates to text positions is very buggy.
This causes lines originally placed at the top to be positioned between other lines at the bottom, making the overall result unusable for LLM inference.

I wonder if you have considered reusing the algorithm implemented in Tesseract. They probably solved the same problem many years ago.
And I also wonder why the Tesseract team is not integrating the doctr engine into Tesseract :-)

Good job! You are leading the OCR leaderboard.

I attached a sample .PDF file and a snippet to reproduce the problem.
I checked other similar inactive issues, so I'm afraid rendering to text is currently not a hot topic :-(
...but how are we suposed to feed our hungry LLMs?

Code snippet to reproduce the bug

import argparse
import os
import json

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

def convert_pdf_to_txt(input_pdf, output_txt):
  """
  Converts a PDF file to a text file using DocTR OCR.

  Args:
      input_pdf (str): Path to the input PDF file.
      output_txt (str): Path to the output text file.
  """

  print("Load pre-trained OCR model")
  model = ocr_predictor(pretrained=True)

  # Ensure input PDF exists
  if not os.path.exists(input_pdf):
    raise ValueError(f"Input PDF file '{input_pdf}' does not exist.")

  # Load the PDF document
  try:
    doc = DocumentFile.from_pdf(input_pdf)
  except Exception as e:
    raise ValueError(f"Error loading PDF '{input_pdf}': {e}")

  # Perform OCR and extract text
  try:
    result = model(doc)
    #exp = result.export()
    #text = json.dumps(exp)
    text = result.render()
  except Exception as e:
    raise ValueError(f"Error performing OCR on '{input_pdf}': {e}")

  # Write extracted text to output file
  with open(output_txt, 'w', encoding='utf-8') as f:
    f.write(text)

  print(f"PDF '{input_pdf}' converted to text file '{output_txt}'.")

if __name__ == "__main__":
  parser = argparse.ArgumentParser(description="Convert PDF to text using DocTR OCR")
  parser.add_argument("input_pdf", help="Path to the input PDF file")
  parser.add_argument("output_txt", help="Path to the output text file")
  args = parser.parse_args()

  convert_pdf_to_txt(args.input_pdf, args.output_txt)

Error traceback

No error

Environment

Linux, conda, python 3.9

Deep Learning backend

Default model.
test-ocr.pdf

@kripper kripper added the type: bug Something isn't working label May 6, 2024
@felixdittrich92
Copy link
Contributor

Hi @kripper 👋,

Thanks for reporting :)

The issue here is that page 2 & 3 contains small rotations could you give it a try with passing assume_straight_pages=False to the ocr_predictor instance ? :)

@kripper
Copy link
Author

kripper commented May 8, 2024

Predictor initiated with:

model = ocr_predictor(pretrained=True, assume_straight_pages=False)

But the probelm persists on page 1:

Notario y Conservador de Bienes Raices Licanten Vilma Beatriz Navarro
<--- "Reyes" SHOULD GO HERE
Certifico que el presente documento electronico es copia fiel e integra de
CERTIFICADO otorgado el 26 de Abril de 2024 reproducido en las siguientes

Reyes <-------- BUT WAS PLACED HERE

paginas.

Also note that the OCR'ed page (page 1) is a clean PDF page.
The second page is an image and assume_straight_pages could help here.

@Cata400
Copy link

Cata400 commented May 8, 2024

From what I have also seen, sometimes the models predict lines in the wrong block, even though their coordinates are correct. This is why the render() method returns the text mixed up, as it is only a bunch of nested for loops going over all the pages, blocks, lines and words. To get over it I did this, although it kind of messes up the line breaks, it preserves the order:

def sort_by_coordinates(element):
    return (element.geometry[0][1], element.geometry[0][0]) 

result = model(doc)
text = ""
 
for page in result.pages:
    line_list = []
    
    for block in page.blocks:
        line_list.extend(block.lines)
        
    sorted_lines = sorted(line_list, key=sort_by_coordinates)
    
    for line in sorted_lines:
        for word in line.words:
            text += word.text + " "
        text += "\n"
        
    text += "\n"

@felixdittrich92
Copy link
Contributor

@kripper Have you already tried to disable block and/or line resolving ?
https://mindee.github.io/doctr/using_doctr/using_models.html#two-stage-approaches

resolve_blocks=False
resolve_lines=False

@kripper
Copy link
Author

kripper commented May 8, 2024

@kripper Have you already tried to disable block and/or line resolving ? https://mindee.github.io/doctr/using_doctr/using_models.html#two-stage-approaches

resolve_blocks=False resolve_lines=False

It's now mixing blocks multiple times per line.

What about taking a look at Tesseract's implementation?

@felixdittrich92
Copy link
Contributor

@kripper Have you already tried to disable block and/or line resolving ? https://mindee.github.io/doctr/using_doctr/using_models.html#two-stage-approaches
resolve_blocks=False resolve_lines=False

It's now mixing blocks multiple times per line.

What about taking a look at Tesseract's implementation?

Sure :)
Do you have a direct reference to the code or algorithm ?

@kripper
Copy link
Author

kripper commented May 8, 2024

Do you have a direct reference to the code or algorithm ?

No, but I will research tomorrow.

@kripper
Copy link
Author

kripper commented May 8, 2024

Have you tried existing tools to convert doctr's HOCR output to text? There are many. Tesseract probably is also using some of them.

@felixdittrich92
Copy link
Contributor

Have you tried existing tools to convert doctr's HOCR output to text? There are many. Tesseract probably is also using some of them.

Yeah you can use doctr's XML/hocr output to create PDF/A files for example with OCRmyPDF

@kripper
Copy link
Author

kripper commented May 8, 2024

sometimes the models predict lines in the wrong block

The synthesized page looks fine. Identifying lines shouldn't be that difficult IMO.

out

@felixdittrich92
Copy link
Contributor

sometimes the models predict lines in the wrong block

The synthesized page looks fine. Identifying lines shouldn't be that difficult IMO.

out

Depends on the documents layout ^^ And there is a lot of difference (rotated, block text, etc.)

@felixdittrich92 felixdittrich92 changed the title Wrong layout generated by render() to text [improvement] .render() isn't that robust - wrong ordered results May 22, 2024
@felixdittrich92
Copy link
Contributor

Especially with rotations all other open source tools (paddleOCR / tesseract / easyOCR) fail also:

Possible way to go: https://arxiv.org/abs/2305.02577 --> investigate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed module: models Related to doctr.models type: bug Something isn't working type: enhancement Improvement
Projects
None yet
Development

No branches or pull requests

3 participants