Text Detection using the detection_predictor model is producing incorrect sequence of bounding boxes #1754

stiffmeister923 · 2024-10-17T16:07:21Z

stiffmeister923
Oct 17, 2024

Hello, I tried the detection_predictor model and it was working well, however I noticed after putting the text of the sequence in which the boxes were predicted(correct if me I am wrong or missing something). And this is the output. I was planning to build a pipeline for it and use TROCR for the recognition model and worked fine until I was parsing it into a string, receiving rather unsequenced words. Essentially, I just want the model to be sequenced from top to bottom, left to right properly, so I was wondering if there was a built in function for this already similar to when using the ocr-predictor, results.export(). That is all thank you.

This was the code I used which were provided in a similar discussion before
`import cv2
import numpy as np
from doctr.io import DocumentFile
from doctr.models import detection_predictor
from doctr.utils.geometry import detach_scores

def _to_absolute(geom, img_shape: tuple[int, int]) -> list[list[int]]:
h, w = img_shape
if len(geom) == 2: # Assume straight pages = True -> [[xmin, ymin], [xmax, ymax]]
(xmin, ymin), (xmax, ymax) = geom
xmin, xmax = int(round(w * xmin)), int(round(w * xmax))
ymin, ymax = int(round(h * ymin)), int(round(h * ymax))
return [[xmin, ymin], [xmax, ymin], [xmax, ymax], [xmin, ymax]]
else: # For polygons, convert each point to absolute coordinates
return [[int(point[0] * w), int(point[1] * h)] for point in geom]

image_path = r"name_section.jpg"
image = cv2.imread(image_path)

det_predictor = detection_predictor(
arch="fast_base",
pretrained=True,
assume_straight_pages=True,
symmetric_pad=True,
preserve_aspect_ratio=True,
) # Uncomment the following line if you have a GPU

.cuda().half()

det_predictor.model.postprocessor.bin_thresh = 0.01
det_predictor.model.postprocessor.box_thresh = 0.01

docs = DocumentFile.from_images(image_path)

results = det_predictor(docs)

for doc, res in zip(docs, results):
img_shape = (doc.shape[0], doc.shape[1])

detached_coords, prob_scores = detach_scores([res.get("words")])

for i, coords in enumerate(detached_coords[0]):
    coords = coords.reshape(2, 2).tolist() if coords.shape == (4, ) else coords.tolist()


    points = np.array(_to_absolute(coords, img_shape), dtype=np.int32).reshape((-1, 1, 2))


    cv2.polylines(image, [points], isClosed=True, color=(255, 0, 0), thickness=2)

    x_center = int((points[0][0][0] + points[2][0][0]) / 2)
    y_center = int((points[0][0][1] + points[1][0][1]) / 2)


    cv2.putText(image, str(i + 1), (x_center, y_center), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 255), 2)

cv2.imwrite("output_with_labels.jpg", image)
`

Answered by felixdittrich92

Oct 23, 2024

Hi @stiffmeister923 @agombert 👋,

If you really want to work directly with the DocumentBuilder to have all it's functionality (boxes sorting, .show(), .export(), etc. i think the easiest way would be to mock the recognition results:

For example:

import requests

from doctr.io import DocumentFile
from doctr.models import detection_predictor
from doctr.utils.geometry import detach_scores
from doctr.models.builder import DocumentBuilder


url = "https://www.francetvinfo.fr/pictures/uGwaNE-aJq7zHLhZJdzdCd9nyjE/1200x900/2021/03/16/phpCDwGn0.jpg"

det_predictor = detection_predictor(
    arch="db_mobilenet_v3_large",
    pretrained=True,
    assume_straight_pages=True,
    symmetric_pad=True,
  …

View full answer

felixdittrich92 · 2024-10-18T06:30:24Z

felixdittrich92
Oct 18, 2024
Maintainer

Hi @stiffmeister923 👋,

The sorting algorithm is not part of the detection_predictor, you can find it here . Solving first into lines is more accure from my experience instead of sorting only the "words". _resolve_blocks can be ignored from the mentioned code.

Hope this helps 🤗

3 replies

stiffmeister923 Oct 18, 2024
Author

Thank you for the reply, I will look into it and further research upon. Sorry for the late reply, I will send back an update here if it becomes successful.

stiffmeister923 Oct 18, 2024
Author

@felixdittrich92, Hello Mr. Felix, I was exploring through the codebase and the imports, how would one try to import the DocumentBuilder class like this, I have tried on iterating through all of the modules under doctr .io, .transform, .models , etc. I have yet to find the DocumentBuilder import and simply just copied the code into another python file. I am really sorry, if I may have not thoroughly examined the code or have caused a misstep as I am really trying my best to understand how everything works but only to no avail. Currently, I do have an alternative use through the ocr_predictor and simply parsing the JSON output, however the recognition model is unnecessary for me as I have my own fine-tuned trocr.

I have tried it like this and it ran fine however, it did nothing.

`from doctr.models import detection_predictor
doc_builder = DocumentBuilder(resolve_lines=True, resolve_blocks=False)
for doc, res in zip(docs, results):
img_shape = (doc.shape[0], doc.shape[1])

# Detach the probability scores from the results
detached_coords, prob_scores = detach_scores([res.get("words")])

for i, coords in enumerate(detached_coords[0]):
    coords = coords.reshape(2, 2).tolist() if coords.shape == (4, ) else coords.tolist()
    print(coords)
    # Convert relative to absolute pixel coordinates
    points = np.array(_to_absolute(coords, img_shape), dtype=np.int32).reshape((-1, 1, 2))
    doc_builder._resolve_lines(points)
    # Draw the bounding box on the image
    cv2.polylines(image, [points], isClosed=True, color=(255, 0, 0), thickness=2)

    # Get the position to place the text (center of the bounding box)
    x_center = int((points[0][0][0] + points[2][0][0]) / 2)
    y_center = int((points[0][0][1] + points[1][0][1]) / 2)

    # Add text label for the detection order
    cv2.putText(image, str(i + 1), (x_center, y_center), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 255), 2)
`

stiffmeister923 Oct 20, 2024
Author

Hello again Mr. @felixdittrich92, I just used the ocr predictor and extracted their geometry for the bounding boxes and is working fine. It is perfectly structured and simply used it as a text detection model even though it is a whole pipeline process. This has been my alternate process for the time being as I cannot make the parse the text detection properly due to my lacking understanding.

agombert · 2024-10-22T13:46:40Z

agombert
Oct 22, 2024

Hello @felixdittrich92

I've got this error when I add the doc_builder line

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[7], [line 14](vscode-notebook-cell:?execution_count=7&line=14)
     [12](vscode-notebook-cell:?execution_count=7&line=12) # Convert relative to absolute pixel coordinates
     [13](vscode-notebook-cell:?execution_count=7&line=13) points = np.array(_to_absolute(coords, img_shape), dtype=np.int32).reshape((-1, 1, 2))
---> [14](vscode-notebook-cell:?execution_count=7&line=14) doc_builder._resolve_lines(points)
     [15](vscode-notebook-cell:?execution_count=7&line=15) # Draw the bounding box on the image
     [16](vscode-notebook-cell:?execution_count=7&line=16) cv2.polylines(single_img, [points], isClosed=True, color=(255, 0, 0), thickness=2)

File ~lib/python3.11/site-packages/doctr/models/builder.py:120, in DocumentBuilder._resolve_lines(self, boxes)
    [109](/lib/python3.11/site-packages/doctr/models/builder.py:109) """Order boxes to group them in lines
    [110](/lib/python3.11/site-packages/doctr/models/builder.py:110) 
    [111](/lib/python3.11/site-packages/doctr/models/builder.py:111) Args:
   (...)
    [117](/lib/python3.11/site-packages/doctr/models/builder.py:117)     nested list of box indices
    [118](lib/python3.11/site-packages/doctr/models/builder.py:118) """
    [119](lib/python3.11/site-packages/doctr/models/builder.py:119) # Sort boxes, and straighten the boxes if they are rotated
--> [120](/lib/python3.11/site-packages/doctr/models/builder.py:120) idxs, boxes = self._sort_boxes(boxes)
    [122](/lib/python3.11/site-packages/doctr/models/builder.py:122) # Compute median for boxes heights
    [123](/lib/python3.11/site-packages/doctr/models/builder.py:123) y_med = np.median(boxes[:, 3] - boxes[:, 1])

File ~/lib/python3.11/site-packages/doctr/models/builder.py:61, in DocumentBuilder._sort_boxes(boxes)
     [45](/lib/python3.11/site-packages/doctr/models/builder.py:45) """Sort bounding boxes from top to bottom, left to right
     [46](/lib/python3.11/site-packages/doctr/models/builder.py:46) 
     [47](/lib/python3.11/site-packages/doctr/models/builder.py:47) Args:
   (...)
     [56](/lib/python3.11/site-packages/doctr/models/builder.py:56)         so that we fit the lines afterwards to the straigthened page
     [57](/lib/python3.11/site-packages/doctr/models/builder.py:57) """
     [58](/lib/python3.11/site-packages/doctr/models/builder.py:58) if boxes.ndim == 3:
     [59](/lib/python3.11/site-packages/doctr/models/builder.py:59)     boxes = rotate_boxes(
     [60](/lib/python3.11/site-packages/doctr/models/builder.py:60)         loc_preds=boxes,
---> [61](/lib/python3.11/site-packages/doctr/models/builder.py:61)         angle=-estimate_page_angle(boxes),
     [62](/lib/python3.11/site-packages/doctr/models/builder.py:62)         orig_shape=(1024, 1024),
     [63](/lib/python3.11/site-packages/doctr/models/builder.py:63)         min_angle=5.0,
     [64](/lib/python3.11/site-packages/doctr/models/builder.py:64)     )
     [65](/lib/python3.11/site-packages/doctr/models/builder.py:65)     boxes = np.concatenate((boxes.min(1), boxes.max(1)), -1)
     [66](/lib/python3.11/site-packages/doctr/models/builder.py:66) return (boxes[:, 0] + 2 * boxes[:, 3] / np.median(boxes[:, 3] - boxes[:, 1])).argsort(), boxes

File ~/lib/python3.11/site-packages/doctr/utils/geometry.py:380, in estimate_page_angle(polys)
    [376](/lib/python3.11/site-packages/doctr/utils/geometry.py:376) """Takes a batch of rotated previously ORIENTED polys (N, 4, 2) (rectified by the classifier) and return the
    [377](/lib/python3.11/site-packages/doctr/utils/geometry.py:377) estimated angle ccw in degrees
    [378](/lib/python3.11/site-packages/doctr/utils/geometry.py:378) """
    [379](/lib/python3.11/site-packages/doctr/utils/geometry.py:379) # Compute mean left points and mean right point with respect to the reading direction (oriented polygon)
--> [380](/lib/python3.11/site-packages/doctr/utils/geometry.py:380) xleft = polys[:, 0, 0] + polys[:, 3, 0]
    [381](/lib/python3.11/site-packages/doctr/utils/geometry.py:381) yleft = polys[:, 0, 1] + polys[:, 3, 1]
    [382](/lib/python3.11/site-packages/doctr/utils/geometry.py:382) xright = polys[:, 1, 0] + polys[:, 2, 0]

IndexError: index 3 is out of bounds for axis 1 with size 1

Any idea where it can come from ?

0 replies

felixdittrich92 · 2024-10-23T14:15:58Z

felixdittrich92
Oct 23, 2024
Maintainer

Hi @stiffmeister923 @agombert 👋,

If you really want to work directly with the DocumentBuilder to have all it's functionality (boxes sorting, .show(), .export(), etc. i think the easiest way would be to mock the recognition results:

For example:

import requests

from doctr.io import DocumentFile
from doctr.models import detection_predictor
from doctr.utils.geometry import detach_scores
from doctr.models.builder import DocumentBuilder


url = "https://www.francetvinfo.fr/pictures/uGwaNE-aJq7zHLhZJdzdCd9nyjE/1200x900/2021/03/16/phpCDwGn0.jpg"

det_predictor = detection_predictor(
    arch="db_mobilenet_v3_large",
    pretrained=True,
    assume_straight_pages=True,
    symmetric_pad=True,
    preserve_aspect_ratio=True,
)  # .cuda().half()  # Uncomment this line if you have a GPU

det_predictor.model.postprocessor.bin_thresh = 0.3
det_predictor.model.postprocessor.box_thresh = 0.1

docs = DocumentFile.from_images([requests.get(url).content])
results = det_predictor(docs)

builder = DocumentBuilder(resolve_lines=True)

documents = []

for doc, res in zip(docs, results):
    img_shape = (doc.shape[0], doc.shape[1])
    # Detach the probability scores from the results
    detached_coords, prob_scores = detach_scores([res.get("words")])
    #
    builder = DocumentBuilder(resolve_lines=True)
    # Mock all recognition parts to create the Document
    document = builder(
        [doc],
        detached_coords,
        prob_scores,
        [[("None", 1)] * len(detached_coords[0])],
        [img_shape],
        [[{"value": 0, "confidence": None} for _ in detached_coords[0]]],
    )
    documents.append(document)

for result in documents:
    result.show()

2 replies

felixdittrich92 Oct 23, 2024
Maintainer

Option 2 without the DocumentBuilder would be to implement your own function for sorting or copy&paste the sorting logic from the DocumentBuildersorting logic into your own function :)

stiffmeister923 Oct 25, 2024
Author

Thank you for this Mr. Felix, I just adjusted it a bit more to get the data I need as this was what worked for me.
`import requests

from doctr.io import DocumentFile
from doctr.models import detection_predictor
from doctr.utils.geometry import detach_scores
from doctr.models.builder import DocumentBuilder
import cv2

image_path = r"name_section.jpg"
image = cv2.imread(image_path)

det_predictor = detection_predictor(
arch="fast_base",
pretrained=True,
assume_straight_pages=True,
symmetric_pad=True,
preserve_aspect_ratio=True,
) # .cuda().half() # Uncomment this line if you have a GPU

det_predictor.model.postprocessor.bin_thresh = 0.1
det_predictor.model.postprocessor.box_thresh = 0.1

docs = DocumentFile.from_images(image_path)
results = det_predictor(docs)

builder = DocumentBuilder(resolve_lines=True)

documents = []
placeholder = []
for doc, res in zip(docs, results):
img_shape = (doc.shape[0], doc.shape[1])
# Detach the probability scores from the results
detached_coords, prob_scores = detach_scores([res.get("words")])
#
builder = DocumentBuilder(resolve_lines=True)
# Mock all recognition parts to create the Document
document = builder(
[doc],
detached_coords,
prob_scores,
[[("None", 1)] * len(detached_coords[0])],
[img_shape],
[[{"value": 0, "confidence": None} for _ in detached_coords[0]]],
)
documents.append(document)

for result in documents:
placeholder.append(result.export())

print(placeholder[0]['pages'])
`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text Detection using the detection_predictor model is producing incorrect sequence of bounding boxes #1754

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Text Detection using the detection_predictor model is producing incorrect sequence of bounding boxes #1754

stiffmeister923 Oct 17, 2024

.cuda().half()

Replies: 3 comments · 5 replies

felixdittrich92 Oct 18, 2024 Maintainer

stiffmeister923 Oct 18, 2024 Author

stiffmeister923 Oct 18, 2024 Author

stiffmeister923 Oct 20, 2024 Author

agombert Oct 22, 2024

felixdittrich92 Oct 23, 2024 Maintainer

felixdittrich92 Oct 23, 2024 Maintainer

stiffmeister923 Oct 25, 2024 Author

stiffmeister923
Oct 17, 2024

Replies: 3 comments 5 replies

felixdittrich92
Oct 18, 2024
Maintainer

stiffmeister923 Oct 18, 2024
Author

stiffmeister923 Oct 18, 2024
Author

stiffmeister923 Oct 20, 2024
Author

agombert
Oct 22, 2024

felixdittrich92
Oct 23, 2024
Maintainer

felixdittrich92 Oct 23, 2024
Maintainer

stiffmeister923 Oct 25, 2024
Author