Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve line detection on skewed images #262

Closed
eroux opened this issue May 16, 2023 · 21 comments
Closed

improve line detection on skewed images #262

eroux opened this issue May 16, 2023 · 21 comments
Labels
Urgent Urgent fix or feature request

Comments

@eroux
Copy link
Contributor

eroux commented May 16, 2023

for (lightly) skewed images like

https://iiif.bdrc.io/bdr:I1KG10195::I1KG101950044.jpg/full/max/0/default.jpg

the current line detection of the HOCR import gives an output of

ལེཞེས་
བཟོདཔས་
གསེར་གྱི་བལྟས་
དང་།།ཆེའི་
པོམེ་ཏོག་པའོ།དང་
བུ་མོ་
ཀྱི་ཚེ་མར་
མཛེས་། བྱ་
བདང་རིན་ལས་
། ཆེའི་རྣམས་དེར་
བ་གུས་
བཀོད་པ་བརྩིགས་
ཡིན་
པས་ན་བསྐལ་
ནུས་པ་སྤྲོ་
ཡོདདེནུས་སངས་བུ་
སྣཚོགས་སྤྱོད་
མཐའ་
། ཁྲི་
བདྲེས་པ་ལ་བར་མེ་ཏོག་
གཟུགས་
བདང༌།འབྲས་བུ་
ནདགའ་ལ་བརྒྱན་
དང༌པའི་
དེ་རྟ་
དགའ་བ་ཀྱང་རྒྱ་པོའི་སྣ་ཚོགས་
ཡTLUIE36

on the following file:

00000044.zip

This is not ideal... perhaps the line break detection could be a bit more lenient?

@ngawangtrinley ngawangtrinley added the Urgent Urgent fix or feature request label Oct 3, 2023
@kaldan007
Copy link
Contributor

kaldan007 commented Nov 26, 2023

@eroux I am sorry for being late resolving the issue. So I have one strategy in m mind. The post processing boxes order is giving us a list in which all the box belonging to a particular line will be added in a nested list. So wat i m thinking is, since there is an information of line or segment in both google vision json and HOCR html output. I thought to keep the original information about which boxes belong to which line in a variable. after post processing we can compare the number of line differences we are having with original line information and post processed lines. if the difference is huge, it is most likely that our post processing was too strict and we can directly choose the original line information else we can the post processed line order.

image

@eroux
Copy link
Contributor Author

eroux commented Nov 26, 2023

well, that's an option yes. I believe the post-processing algorithm is pretty sophisticated and flexible, I really think it can be fixed easily by tweaking a few parameters, or maybe there's a bug that could be easy to fix. Perhaps we could look at that first?

@kaldan007
Copy link
Contributor

kaldan007 commented Nov 26, 2023

post processing is relying on a threshold which is hard coded. Thats y we r having the issue.
this one https://github.com/OpenPecha/Toolkit/blob/master/openpecha/formatters/ocr/ocr.py#L41

@kaldan007
Copy link
Contributor

i think we need to find a way to calculate this threshold or we can go with above option.

@eroux
Copy link
Contributor Author

eroux commented Nov 26, 2023

yes, let's tweak the threshold a bit, but for some reason I think the value is more or less fine, it's probably a bug in the algorithm, let's first fix what we have before implementing a more complex algorithm

@kaldan007
Copy link
Contributor

by tweaking the threshold, do u mean by sending threshold as parameter?

@eroux
Copy link
Contributor Author

eroux commented Nov 26, 2023

oh I just meant hardcoding a different value

@kaldan007
Copy link
Contributor

I think that will be an issue in future with different kind of pecha

@eroux
Copy link
Contributor Author

eroux commented Nov 26, 2023

why? the threshold is proportionate to the average stack box

@kaldan007
Copy link
Contributor

https://github.com/OpenPecha/Toolkit/blob/master/openpecha/formatters/ocr/ocr.py#L185C21-L185C21
don't u think with different ratio it will effect y_threshold differently and might give different line orders. I just don't want us to find the custom ratio for each pecha.

@eroux
Copy link
Contributor Author

eroux commented Nov 26, 2023

well, I don't know the code by heart so I can't find the solution for you, sorry. If you feel this is too complicated just go for the other option, I just think it's a waste of time.

@eroux
Copy link
Contributor Author

eroux commented Nov 26, 2023

please make your initial solution optional though, the reason why we developed the post-processing part is because there are serious issues in the original line information, especially for older Google OCR and I don't want to use that for things that go on BUDA

@kaldan007
Copy link
Contributor

kaldan007 commented Dec 9, 2023

@eroux i hv experimented with multiple threshold but it is not returning satisfactory result at all. Hence i have send a pr with my approach which has an option to control via a parameter called check_postcorrection. here is the PR #264

Ur review would be highly appreciated.
Sorry for the delay.

@eroux
Copy link
Contributor Author

eroux commented Dec 9, 2023

no problem, thanks! I'll have a look

@eroux
Copy link
Contributor Author

eroux commented Dec 9, 2023

@kaldan007 can you add a test with the image and the expected result given in the initial comment of this issue? it will be helpful to demonstrate how your change fixes it

@kaldan007
Copy link
Contributor

ལེ36
TLUIE
ཡ
པའི་ཚལ་ཞེས་བྱ་འརིན་པོཆེའི་ཤིང་སྣ་ཚོར་དང་། གསེར་དངུལ་བཻཌཱུརྱ་དང་ཤེལ་ལས་བྱས་པའི་ཕ་གུས་བརྩིགས་ཤིང་ཐམསྐས་ཡོདཔའི་རོ་བྲོ་བའི་རྫིང་བུ་བཞི་དང༌། ལྷའི་གོས་དང༌།མེ་ཏོག་དང་། འབྲས་བུ་མང་པོས་བརྒྱན་ཅིང་ཤིངརྟ་བཟང་པོ་ལྷའི་བུ་མོས་ 
མཛེས་པས་དགའ་བར་བྱས་པས་འདྲེན་པའོ། །ལྷོན་རྩུབ་འགྱུར་ཞེས་བྱ་པའི་ཚིལ། ལྷ་རྣམས་དེར་ངན་ལྷ་མ་ཡིན་དང་གཡུལ་འགྱེད་པར་སྤྲོ་ཞིང་། ཤིང་ལས་རིན་པོ་ཆེའི་གོ་ཆ་སྲ་བདང་། ༢ཁོར་ལོ་དང་། མདའ་བོ་ཆེ་ལ་སོགས་པའི་མཚོན་ཆ་སྣ་ཚོགས་འབྱུང་ 
བ་ཡོད་དོ། །བན་འདྲེས་པའིཚལ་ཞེས་བྱ་བ་འདོད་དགུ་འདྲེསམར་འབྱུང་བདང་།རིན་པོ་ཆེའི་ཤིང་དང་།མེ་ཏོག་དང་། ལྷའི་བུ་མོ་ལ་སོགས་བའང་འདྲེས་ཤིང་། ལོངས་སྤྱོད་རྣམས་ཀྱང་འཆོལ་བར་འཁྱམ་མོ། །བྱང་ནདགའ་བའི་ཚལ་ཞེས་བྱབ།རྫིང་བུ་བཟང་པོ་དགཥ་ 
བ་ཞེས་བྱ་བ་དང༌། ཤིང་དང་།མེ་ཏོག་དང་།ལྷའི་བུ་མོ་དགའ་བས་བརྒྱན་པ་སྟེ།དེར་སྤྱད་པསདགའ་བ་འཕེལ་བའོ།།ཆལ་དེ་ན་བསྐལ་པ་བཟང་པོའི་སངས་རྒྱས་སྟོང་གི་གཟིམས་ཁྲི་རིན་པོཆེའི་ལྗོན་པའི་ནང་ན་ཡོད་པའི་འོད་ཟེར་གྱི་བྱིན་ལྷའི་དབང་པོའི་བསོད་རྣམ་ 
ཀྱིས་བཟོདཔས་བལྟས་པན།དེའི་ལོགས་ལ་ལྷ་རྣམས་ཀྱི་ཚེ་རབས་དང་། ལེགས་ཉེས་དང་། གཡུལདུ་འཇུག་པ་ལ་གནོད་པས་བརྫི་བར་ནུས་པ་དང་མི་ནུས་པའི་མཚན་མ་མཐའ་དག་མེ་ལོང་ལ་གཟུགས་བརྙན་བཞིན་དུགསལ་བ་ཡོད་དོ།།དེ་དག་ཀྱང་རྒྱ་ཆེ་ལ་དབྱིབས་
ཌ་པ།གསེར་གྱི་རབ་དང་། རིན་པོཆེའི་ལྕོག་གིས་ཀུན་ནས་མཛེས་པའོ།།དེའི་ཕྱི་རོལ་ན་མིང་དང་བཀོད་པ་མཐུན་བའི་ས་གཞི་བཟང་པོ་བཞི་ཡོདདེ།སྣཚོགས་ཞེས་བྱབདང༌། བདྲེས་པ་ཞེས་བྱ་བདང༌།རྩུབ་འགྱུར་ཞེས་བྱ་བ་དང༌། དགའ་བ་ཞེས་བྱ་བ་དག་

@kaldan007
Copy link
Contributor

@eroux this the output i m getting after the update.

@eroux
Copy link
Contributor Author

eroux commented Dec 9, 2023

ah sorry I meant can you add the example as a test in the repo : https://github.com/OpenPecha/Toolkit/tree/master/tests/formatters/google_vision it will make it much easier for me to look at the PR

@kaldan007
Copy link
Contributor

sure will do that

@kaldan007
Copy link
Contributor

@eroux i have included the page in the test case of hocr.

@eroux
Copy link
Contributor Author

eroux commented Dec 21, 2023

so, I've merged @kaldan007 's PR which basically removes the post-processing when it goes wrong. Ideally we should have a better post-processing that doesn't have problems with skewed lines so that we can remove the duplication and other errors from Google OCR. Kurt's take on it is:

I'm working with a startup that does cursive handwriting recognition and they simply start with the bounding boxes provided by google ocr. Then they take the centroid of each bounding box. If you take various begin-to-end paths through the centroids then you choose as a line the path with the least standard deviation. hope htis is clear enough - but perhaps I do not understand the problem. the centroids are good because they will be much closer to the center of the line even if stacks etc dip into other lines.

Something like that would be ideal to implement but I don't think we have the skills / time for that yet in the organization so closing this until then.

@eroux eroux closed this as completed Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Urgent Urgent fix or feature request
Projects
Status: Done 🎉
Development

No branches or pull requests

3 participants