Block Detection Only #1759

agombert · 2024-10-24T13:07:07Z

agombert
Oct 24, 2024

Hey,

First thank you very much for this library which is really great and helpful.

My use case is handwritten data from archives which are complex text. Before going into the OCR, I'd like to detect accurately the bounding boxes of the text.

I followed the code instructions with different classes. And then using this code to work on the data:

import torch
import cv2
import numpy as np
from torchvision.transforms import Compose, Resize, ToTensor, Normalize
from PIL import Image

from torch import nn
import torch
from doctr.models import ocr_predictor, db_resnet50

path_model = join(PATH_REPO, "doctr", "db_resnet50_20241024-133316.pt")

det_model = db_resnet50(pretrained=True, pretrained_backbone=False, class_names=["text", "margin"])
det_params = torch.load(path_model, map_location="cpu")
det_model.load_state_dict(det_params)

def preprocess_image(image_path, input_size=(1024, 1024)):
    img = cv2.imread(image_path)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    
    img = Image.fromarray(img)
    
    transform = Compose([
        Resize(input_size),
        ToTensor(),
        Normalize(mean=[0.798, 0.785, 0.772], std=[0.264, 0.2749, 0.287])
    ])
    
    img = transform(img)
    return img.unsqueeze(0) 

image_path = join(PATH_DATA, "layout", "training_set", "images", "0002_DAFCAOM04_DPPCEC_850116_0080.JPG")

preprocessed_image = preprocess_image(image_path)

det_model.eval()

with torch.no_grad():
    results = det_model(preprocessed_image)

And I get some predicitons for text and margin. I used the --pretrained parameter as I just want to work by block and I have low amount of data.

What would be your recommandation on this task with db_resnet50 regarding the amount of data to fine-tune ? Would you consider the block detection as possible with your pipeline ?

Best,

Arnault

felixdittrich92 · 2024-10-25T06:35:28Z

felixdittrich92
Oct 25, 2024
Maintainer

Hi @agombert 👋,

Sounds more like an layout detection problem to me :)
If you say "low amount" of training data about how much do we speak ?

I think you would have more success by using our contrib module (https://mindee.github.io/doctr/using_doctr/using_contrib_modules.html)
this is called ArtefactDetector but at the end you can put in any Yolo trained onnx exported model and afterwards use it as a kind of pre-stage for example for layout detection (This should work also with newest yolov11 trained and onnx exported models)

The only limitation atm is that it works only with straight boxes no oriented bounding boxes (OBB) inference supported yet.

Best,

Felix

0 replies

agombert · 2024-10-25T07:56:47Z

agombert
Oct 25, 2024
Author

Hey @felixdittrich92,

It is indeed. But I struggled to find some code to fine-tune such model (I'm more into NLP generally).

Low amount would be between 100 and 1k pages only. Maybe fine-tuning yolov11 could be a solution and then apply it with your pipeline. But maybe the Yolov11 script can be a better way to go.

I need to investigate. When I tried to fine-tune the detector, it did not learn anything therefore maybe it needs more data or it's not adapted or I don't do it well. ,

Best,

Arnault

1 reply

felixdittrich92 Oct 25, 2024
Maintainer

Yeah i think going ahead with the linked repo is the right direction and later on using docTR or OnnxTR - docTR optimized for prod scenarios - for raw OCR 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block Detection Only #1759

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Block Detection Only #1759

agombert Oct 24, 2024

Replies: 2 comments · 1 reply

felixdittrich92 Oct 25, 2024 Maintainer

agombert Oct 25, 2024 Author

felixdittrich92 Oct 25, 2024 Maintainer

agombert
Oct 24, 2024

Replies: 2 comments 1 reply

felixdittrich92
Oct 25, 2024
Maintainer

agombert
Oct 25, 2024
Author

felixdittrich92 Oct 25, 2024
Maintainer