improve line detection on skewed images #262

eroux · 2023-05-16T11:03:11Z

for (lightly) skewed images like

https://iiif.bdrc.io/bdr:I1KG10195::I1KG101950044.jpg/full/max/0/default.jpg

the current line detection of the HOCR import gives an output of

ལེཞེས་
བཟོདཔས་
གསེར་གྱི་བལྟས་
དང་།།ཆེའི་
པོམེ་ཏོག་པའོ།དང་
བུ་མོ་
ཀྱི་ཚེ་མར་
མཛེས་། བྱ་
བདང་རིན་ལས་
། ཆེའི་རྣམས་དེར་
བ་གུས་
བཀོད་པ་བརྩིགས་
ཡིན་
པས་ན་བསྐལ་
ནུས་པ་སྤྲོ་
ཡོདདེནུས་སངས་བུ་
སྣཚོགས་སྤྱོད་
མཐའ་
། ཁྲི་
བདྲེས་པ་ལ་བར་མེ་ཏོག་
གཟུགས་
བདང༌།འབྲས་བུ་
ནདགའ་ལ་བརྒྱན་
དང༌པའི་
དེ་རྟ་
དགའ་བ་ཀྱང་རྒྱ་པོའི་སྣ་ཚོགས་
ཡTLUIE36

on the following file:

00000044.zip

This is not ideal... perhaps the line break detection could be a bit more lenient?

The text was updated successfully, but these errors were encountered:

kaldan007 · 2023-11-26T11:50:30Z

@eroux I am sorry for being late resolving the issue. So I have one strategy in m mind. The post processing boxes order is giving us a list in which all the box belonging to a particular line will be added in a nested list. So wat i m thinking is, since there is an information of line or segment in both google vision json and HOCR html output. I thought to keep the original information about which boxes belong to which line in a variable. after post processing we can compare the number of line differences we are having with original line information and post processed lines. if the difference is huge, it is most likely that our post processing was too strict and we can directly choose the original line information else we can the post processed line order.

eroux · 2023-11-26T12:28:56Z

well, that's an option yes. I believe the post-processing algorithm is pretty sophisticated and flexible, I really think it can be fixed easily by tweaking a few parameters, or maybe there's a bug that could be easy to fix. Perhaps we could look at that first?

kaldan007 · 2023-11-26T12:31:18Z

post processing is relying on a threshold which is hard coded. Thats y we r having the issue.
this one https://github.com/OpenPecha/Toolkit/blob/master/openpecha/formatters/ocr/ocr.py#L41

kaldan007 · 2023-11-26T12:33:36Z

i think we need to find a way to calculate this threshold or we can go with above option.

eroux · 2023-11-26T12:34:55Z

yes, let's tweak the threshold a bit, but for some reason I think the value is more or less fine, it's probably a bug in the algorithm, let's first fix what we have before implementing a more complex algorithm

kaldan007 · 2023-11-26T12:38:35Z

by tweaking the threshold, do u mean by sending threshold as parameter?

eroux · 2023-11-26T12:46:37Z

oh I just meant hardcoding a different value

kaldan007 · 2023-11-26T12:47:53Z

I think that will be an issue in future with different kind of pecha

eroux · 2023-11-26T12:53:35Z

why? the threshold is proportionate to the average stack box

kaldan007 · 2023-11-26T13:01:20Z

https://github.com/OpenPecha/Toolkit/blob/master/openpecha/formatters/ocr/ocr.py#L185C21-L185C21
don't u think with different ratio it will effect y_threshold differently and might give different line orders. I just don't want us to find the custom ratio for each pecha.

eroux · 2023-11-26T13:09:53Z

well, I don't know the code by heart so I can't find the solution for you, sorry. If you feel this is too complicated just go for the other option, I just think it's a waste of time.

eroux · 2023-11-26T13:14:02Z

please make your initial solution optional though, the reason why we developed the post-processing part is because there are serious issues in the original line information, especially for older Google OCR and I don't want to use that for things that go on BUDA

kaldan007 · 2023-12-09T12:04:53Z

@eroux i hv experimented with multiple threshold but it is not returning satisfactory result at all. Hence i have send a pr with my approach which has an option to control via a parameter called check_postcorrection. here is the PR #264

Ur review would be highly appreciated.
Sorry for the delay.

eroux · 2023-12-09T12:09:11Z

no problem, thanks! I'll have a look

eroux · 2023-12-09T12:17:15Z

@kaldan007 can you add a test with the image and the expected result given in the initial comment of this issue? it will be helpful to demonstrate how your change fixes it

kaldan007 · 2023-12-09T12:18:55Z

ལེ36
TLUIE
ཡ
པའི་ཚལ་ཞེས་བྱ་འརིན་པོཆེའི་ཤིང་སྣ་ཚོར་དང་། གསེར་དངུལ་བཻཌཱུརྱ་དང་ཤེལ་ལས་བྱས་པའི་ཕ་གུས་བརྩིགས་ཤིང་ཐམསྐས་ཡོདཔའི་རོ་བྲོ་བའི་རྫིང་བུ་བཞི་དང༌། ལྷའི་གོས་དང༌།མེ་ཏོག་དང་། འབྲས་བུ་མང་པོས་བརྒྱན་ཅིང་ཤིངརྟ་བཟང་པོ་ལྷའི་བུ་མོས་ 
མཛེས་པས་དགའ་བར་བྱས་པས་འདྲེན་པའོ། །ལྷོན་རྩུབ་འགྱུར་ཞེས་བྱ་པའི་ཚིལ། ལྷ་རྣམས་དེར་ངན་ལྷ་མ་ཡིན་དང་གཡུལ་འགྱེད་པར་སྤྲོ་ཞིང་། ཤིང་ལས་རིན་པོ་ཆེའི་གོ་ཆ་སྲ་བདང་། ༢ཁོར་ལོ་དང་། མདའ་བོ་ཆེ་ལ་སོགས་པའི་མཚོན་ཆ་སྣ་ཚོགས་འབྱུང་ 
བ་ཡོད་དོ། །བན་འདྲེས་པའིཚལ་ཞེས་བྱ་བ་འདོད་དགུ་འདྲེསམར་འབྱུང་བདང་།རིན་པོ་ཆེའི་ཤིང་དང་།མེ་ཏོག་དང་། ལྷའི་བུ་མོ་ལ་སོགས་བའང་འདྲེས་ཤིང་། ལོངས་སྤྱོད་རྣམས་ཀྱང་འཆོལ་བར་འཁྱམ་མོ། །བྱང་ནདགའ་བའི་ཚལ་ཞེས་བྱབ།རྫིང་བུ་བཟང་པོ་དགཥ་ 
བ་ཞེས་བྱ་བ་དང༌། ཤིང་དང་།མེ་ཏོག་དང་།ལྷའི་བུ་མོ་དགའ་བས་བརྒྱན་པ་སྟེ།དེར་སྤྱད་པསདགའ་བ་འཕེལ་བའོ།།ཆལ་དེ་ན་བསྐལ་པ་བཟང་པོའི་སངས་རྒྱས་སྟོང་གི་གཟིམས་ཁྲི་རིན་པོཆེའི་ལྗོན་པའི་ནང་ན་ཡོད་པའི་འོད་ཟེར་གྱི་བྱིན་ལྷའི་དབང་པོའི་བསོད་རྣམ་ 
ཀྱིས་བཟོདཔས་བལྟས་པན།དེའི་ལོགས་ལ་ལྷ་རྣམས་ཀྱི་ཚེ་རབས་དང་། ལེགས་ཉེས་དང་། གཡུལདུ་འཇུག་པ་ལ་གནོད་པས་བརྫི་བར་ནུས་པ་དང་མི་ནུས་པའི་མཚན་མ་མཐའ་དག་མེ་ལོང་ལ་གཟུགས་བརྙན་བཞིན་དུགསལ་བ་ཡོད་དོ།།དེ་དག་ཀྱང་རྒྱ་ཆེ་ལ་དབྱིབས་
ཌ་པ།གསེར་གྱི་རབ་དང་། རིན་པོཆེའི་ལྕོག་གིས་ཀུན་ནས་མཛེས་པའོ།།དེའི་ཕྱི་རོལ་ན་མིང་དང་བཀོད་པ་མཐུན་བའི་ས་གཞི་བཟང་པོ་བཞི་ཡོདདེ།སྣཚོགས་ཞེས་བྱབདང༌། བདྲེས་པ་ཞེས་བྱ་བདང༌།རྩུབ་འགྱུར་ཞེས་བྱ་བ་དང༌། དགའ་བ་ཞེས་བྱ་བ་དག་

kaldan007 · 2023-12-09T12:19:47Z

@eroux this the output i m getting after the update.

eroux · 2023-12-09T12:51:18Z

ah sorry I meant can you add the example as a test in the repo : https://github.com/OpenPecha/Toolkit/tree/master/tests/formatters/google_vision it will make it much easier for me to look at the PR

kaldan007 · 2023-12-09T14:26:27Z

sure will do that

kaldan007 · 2023-12-19T05:40:41Z

@eroux i have included the page in the test case of hocr.

eroux · 2023-12-21T09:42:41Z

so, I've merged @kaldan007 's PR which basically removes the post-processing when it goes wrong. Ideally we should have a better post-processing that doesn't have problems with skewed lines so that we can remove the duplication and other errors from Google OCR. Kurt's take on it is:

I'm working with a startup that does cursive handwriting recognition and they simply start with the bounding boxes provided by google ocr. Then they take the centroid of each bounding box. If you take various begin-to-end paths through the centroids then you choose as a line the path with the least standard deviation. hope htis is clear enough - but perhaps I do not understand the problem. the centroids are good because they will be much closer to the center of the line even if stacks etc dip into other lines.

Something like that would be ideal to implement but I don't think we have the skills / time for that yet in the organization so closing this until then.

ngawangtrinley added the Urgent Urgent fix or feature request label Oct 3, 2023

eroux closed this as completed Dec 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve line detection on skewed images #262

improve line detection on skewed images #262

eroux commented May 16, 2023

kaldan007 commented Nov 26, 2023 •

edited

Loading

eroux commented Nov 26, 2023

kaldan007 commented Nov 26, 2023 •

edited

Loading

kaldan007 commented Nov 26, 2023

eroux commented Nov 26, 2023

kaldan007 commented Nov 26, 2023

eroux commented Nov 26, 2023

kaldan007 commented Nov 26, 2023

eroux commented Nov 26, 2023

kaldan007 commented Nov 26, 2023

eroux commented Nov 26, 2023

eroux commented Nov 26, 2023

kaldan007 commented Dec 9, 2023 •

edited

Loading

eroux commented Dec 9, 2023

eroux commented Dec 9, 2023

kaldan007 commented Dec 9, 2023

kaldan007 commented Dec 9, 2023

eroux commented Dec 9, 2023

kaldan007 commented Dec 9, 2023

kaldan007 commented Dec 19, 2023

eroux commented Dec 21, 2023

improve line detection on skewed images #262

improve line detection on skewed images #262

Comments

eroux commented May 16, 2023

kaldan007 commented Nov 26, 2023 • edited Loading

eroux commented Nov 26, 2023

kaldan007 commented Nov 26, 2023 • edited Loading

kaldan007 commented Nov 26, 2023

eroux commented Nov 26, 2023

kaldan007 commented Nov 26, 2023

eroux commented Nov 26, 2023

kaldan007 commented Nov 26, 2023

eroux commented Nov 26, 2023

kaldan007 commented Nov 26, 2023

eroux commented Nov 26, 2023

eroux commented Nov 26, 2023

kaldan007 commented Dec 9, 2023 • edited Loading

eroux commented Dec 9, 2023

eroux commented Dec 9, 2023

kaldan007 commented Dec 9, 2023

kaldan007 commented Dec 9, 2023

eroux commented Dec 9, 2023

kaldan007 commented Dec 9, 2023

kaldan007 commented Dec 19, 2023

eroux commented Dec 21, 2023

kaldan007 commented Nov 26, 2023 •

edited

Loading

kaldan007 commented Nov 26, 2023 •

edited

Loading

kaldan007 commented Dec 9, 2023 •

edited

Loading