Multilingual support #1699

decadance-dance · 2024-08-20T12:39:20Z

🚀 The feature

Support of multiple languages (accordingly VOCABS["multilingual"]) by pretrained models.

Motivation, pitch

It would be great to use models which supports multiple languages because it significantly improve user experience in various cases.

Alternatives

No response

Additional context

No response

felixdittrich92 · 2024-08-20T13:35:52Z

Hi @decadance-dance 👋,

Have you already tried:
docTR: https://huggingface.co/Felix92/doctr-torch-parseq-multilingual-v1
OnnxTR: https://huggingface.co/Felix92/onnxtr-parseq-multilingual-v1
? :)

Depends a bit if there is any data from mindee we could use.
Question goes to @odulcy-mindee ^^

decadance-dance · 2024-08-20T16:12:10Z

Hi, @felixdittrich92
I used docTR more than half year but have never faced this multilingual model, lol.
So, I am gonna try it, thanks.

felixdittrich92 · 2024-08-20T16:34:23Z

Ah let's keep this issue open there is more todo i think :)

felixdittrich92 · 2024-08-21T07:44:26Z

Hi, @felixdittrich92 I used docTR more than half year but have never faced this multilingual model, lol. So, I am gonna try it, thanks.

Happy about an feedback how it works for you :)
The model was fine tuned only on synth data.

odulcy-mindee · 2024-08-27T08:43:43Z

Depends a bit if there is any data from mindee we could use.
Question goes to @odulcy-mindee ^^

Unfortunately, we don't have such data

felixdittrich92 · 2024-08-27T08:55:50Z

@decadance-dance
For training such recognition models i don't see a problem.. we can generate synth train data and need in a best case only real val samples.
But for detection we would need real data that's the main issue.

In general we would need the help of the community to collect documents (newspaper, receipt photos, etc.) in divers langauges (can be unlabeled). / This would need a license to sign that we can freely use this data.
With enough divers data we could use Azure Doc AI for example to pre-label this data.
Later on i wouldn't see an issue to open source this dataset.

But not sure how to trigger such "event" 😅 @odulcy-mindee

nikokks · 2024-09-06T13:33:39Z

Hello =)
I found some public dataset for various tasks
english documents
mathematics documents
latex ocr
latex ocr
chinese ocr
chinese ocr
chinese ocr

nikokks · 2024-09-06T13:36:38Z

Moreover it should be interesting for Chinese detection models to add multiple recognition data in the same image without intersection. This should help for a Chinese detection model to perform better without real detection data.
Anyone interested in creating random multilingual data for detection models (hindi, chinese, etc.) ?

felixdittrich92 · 2024-09-06T14:13:07Z

Hi @nikokks 😃
Recognition should not be such a big deal i found already a good way to generate such data for fine tuning.

To collect multilingual data for detection is troublesome because it should be real data (or if possible really good generated ones / for example with a fine tuned FLUX model maybe !?)
We need different kinds of layouts/documents (newspapers, invoices, receipts, cards, etc.) so the data should come close to real use cases (not only scans also document photos etc.)
:)

decadance-dance · 2024-10-09T16:06:47Z

To collect multilingual data for detection is troublesome because it should be real data

Do you can estimate how much data we need to provide multilingual capabilities on the same level as only english ocr is?

felixdittrich92 · 2024-10-10T06:14:30Z

Hi @decadance-dance 👋,

I think if we could collect ~100-150 different types of documents for each language we would have a good starting point (at the end the language doesn't matter it's more about the different char sets / fonts / text sizes) - for example:

is super useful because it captures a lot of different fonts / text sizes
or something "in the wild":

At the end it's more critical to take care that we really can use such images legally.

The tricky part is the detection because we need complete real data .. if we have this it should be much easier for the recognition part we could create some synth data and eval on the already collected real data.

I think if we are able to collect the data up to end of january i could provide pre-labeling via Azure's Document AI.

Currently missing parts are:

handwritten (for the detection model - recognition is another story)
chinese (symbols)
hindi
bulgarian/ukrainian/russian/serbian (cyrillic)
special symbols (bullet points, etc.)
more latin based (spanish, czech, ..)
...

CC @odulcy-mindee

Lang list: https://github.com/eymenefealtun/all-words-in-all-languages

decadance-dance · 2024-10-10T08:00:43Z

@felixdittrich92, thank you for a detailed answer.
I'd help to collect data. It would be great if we can populate this initiative to our community. I think if everyone provides at least a couple of samples, then a good amount of data can be collected.
BTW, Is there any flow or established process for collecting and submitting data?

felixdittrich92 · 2024-10-10T09:56:09Z

@decadance-dance Not yet ..maybe the easiest would be to create a huggingface space for this because from this you could also do easily pictures from your smartphone and under the hood we push the taken or uploaded images into an HF dataset.

In this case we could also add an agreement before any data can be uploaded that the person who uploads agrees to have all rights on the image and uploads the image with the knowledge to provide the uploaded images openly to everyone who downloads the dataset.

Wdyt ?

Again CC @odulcy-mindee :D

ramSeraph · 2024-10-10T17:40:25Z

I found one possible dataset for printed documents for multiple languages. It is wikisource. They have text and images at the page level, originally created using some existing OCR(Google vision/tesseract) and the data has then been corrected/proofread by people. They have annotations to differentiate what has been proofread and what has not been. An example - https://te.wikisource.org/wiki/పుట%3AAandhrakavula-charitramu.pdf/439. The license would be CC-BY-SA and I am expecting them to only have pulled books for which copyright has expired. Collecting fonts for various languages is a bigger problem though( because of licenses ).

felixdittrich92 · 2024-10-17T14:37:18Z

Thanks @ramSeraph for sharing i will have a look 👍

@decadance-dance @nikokks

I created a space which can be used to collect some data (only raw data for starting) wdyt ?
https://huggingface.co/spaces/Felix92/docTR-multilingual-Datacollector

Later on if we say we have collected enough raw data we can filter the data and pre-label with Azure Document AI.

decadance-dance · 2024-10-21T07:21:41Z

Sounds good to me. Thanks чт, 17 окт. 2024 г. в 16:37, Felix Dittrich ***@***.***>:

…

Thanks @ramSeraph <https://github.com/ramSeraph> for sharing i will have a look 👍 @decadance-dance <https://github.com/decadance-dance> @nikokks <https://github.com/nikokks> I created a space which can be used to collect some data (only raw data for starting) wdyt ? https://huggingface.co/spaces/Felix92/docTR-multilingual-Datacollector Later on if we say we have collected enough raw data we can filter the data and pre-label with Azure Document AI. — Reply to this email directly, view it on GitHub <#1699 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AURNXMCDXAODUPBYE6BNUQLZ37DTLAVCNFSM6AAAAABMZ2E2AGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJZG4ZTIMRQGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

felixdittrich92 · 2024-11-01T09:20:49Z

@decadance-dance @nikokks @ramSeraph @allOther

I created an request to the mindee team to provide support on this task.
https://mindee-community.slack.com/archives/C02HGHMUJH0/p1730452486444309

Would be nice if you could write a comment in the thread about your needs to support this 🙏

felixdittrich92 · 2024-11-12T07:57:53Z

Short update the ticket is in progress i have already collected ~30k real samples including (Arabic, Hindi, Cyrillic, Thai, Greek, Chinese, more Latin, including also ~15% handwritten)

First stage would be to improve the detection models, for the sec stage the recognition part we could generate additional synthetic data

decadance-dance added the type: enhancement Improvement label Aug 20, 2024

decadance-dance closed this as completed Aug 20, 2024

felixdittrich92 reopened this Aug 20, 2024

felixdittrich92 added this to the 1.0.0 milestone Oct 10, 2024

felixdittrich92 mentioned this issue Oct 10, 2024

Hindi Language support #1617

Closed

felixdittrich92 pinned this issue Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multilingual support #1699

Multilingual support #1699

decadance-dance commented Aug 20, 2024

felixdittrich92 commented Aug 20, 2024

decadance-dance commented Aug 20, 2024

felixdittrich92 commented Aug 20, 2024

felixdittrich92 commented Aug 21, 2024

odulcy-mindee commented Aug 27, 2024

felixdittrich92 commented Aug 27, 2024 •

edited

Loading

nikokks commented Sep 6, 2024 •

edited

Loading

nikokks commented Sep 6, 2024 •

edited

Loading

felixdittrich92 commented Sep 6, 2024

decadance-dance commented Oct 9, 2024

felixdittrich92 commented Oct 10, 2024 •

edited

Loading

decadance-dance commented Oct 10, 2024

felixdittrich92 commented Oct 10, 2024 •

edited

Loading

ramSeraph commented Oct 10, 2024

felixdittrich92 commented Oct 17, 2024

decadance-dance commented Oct 21, 2024 via email

felixdittrich92 commented Nov 1, 2024

felixdittrich92 commented Nov 12, 2024 •

edited

Loading

Multilingual support #1699

Multilingual support #1699

Comments

decadance-dance commented Aug 20, 2024

🚀 The feature

Motivation, pitch

Alternatives

Additional context

felixdittrich92 commented Aug 20, 2024

decadance-dance commented Aug 20, 2024

felixdittrich92 commented Aug 20, 2024

felixdittrich92 commented Aug 21, 2024

odulcy-mindee commented Aug 27, 2024

felixdittrich92 commented Aug 27, 2024 • edited Loading

nikokks commented Sep 6, 2024 • edited Loading

nikokks commented Sep 6, 2024 • edited Loading

felixdittrich92 commented Sep 6, 2024

decadance-dance commented Oct 9, 2024

felixdittrich92 commented Oct 10, 2024 • edited Loading

decadance-dance commented Oct 10, 2024

felixdittrich92 commented Oct 10, 2024 • edited Loading

ramSeraph commented Oct 10, 2024

felixdittrich92 commented Oct 17, 2024

decadance-dance commented Oct 21, 2024 via email

felixdittrich92 commented Nov 1, 2024

felixdittrich92 commented Nov 12, 2024 • edited Loading

felixdittrich92 commented Aug 27, 2024 •

edited

Loading

nikokks commented Sep 6, 2024 •

edited

Loading

nikokks commented Sep 6, 2024 •

edited

Loading

felixdittrich92 commented Oct 10, 2024 •

edited

Loading

felixdittrich92 commented Oct 10, 2024 •

edited

Loading

felixdittrich92 commented Nov 12, 2024 •

edited

Loading