-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
55 additions
and
47 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,62 +1,21 @@ | ||
OCR work at GiellaLT | ||
===================== | ||
|
||
This page will at some point document our OCR work. | ||
This page contains notes from our OCR work. | ||
|
||
|
||
We have been experimenting with OCR in 2016 and earlier | ||
(cf. the meeting memo from 2016 below). With recent | ||
advances in OCR techniques we will have to start | ||
this work again, with new programs. This page sketches how. | ||
|
||
|
||
# Prerequisites for OCR work | ||
|
||
- Prerequisites: Letter repertoire and corrected text. Candidate: 1896 bible. | ||
|
||
# Experimenting with Tesseract | ||
|
||
## Fetching the program | ||
The open source program [Tesseract](https://github.com/tesseract-ocr) can be fetched from Github: | ||
|
||
``` | ||
git clone [email protected]:tesseract-ocr/tesseract.git | ||
git clone [email protected]:tesseract-ocr/tessdata.git | ||
... | ||
``` | ||
# OCR of modern texts | ||
|
||
## Development | ||
|
||
Tesseract comes with a set of languages (see **tessdata**). Most GiellaLT languages are not included, though. TODO: Document how to add them. | ||
|
||
## OCR reading | ||
|
||
A pdf document as a picture should be | ||
|
||
1. split into one pdf per page | ||
2. converted to png | ||
3. run through Tesseract with the relevant language(s) as setting | ||
|
||
### One pdf per page | ||
|
||
In Preview, set the document in Thumbs view and drag one page at a time to the desktop. TODO: Find a way to do this on the command line. | ||
|
||
### converted to png | ||
|
||
Let us say the document contained 8 pages, after the split named *1.pdf, 2.pdf, ...* Then do the following: | ||
|
||
``` | ||
for i in 1 2 3 4 5 6 7 8 9 10 ; do sips -s format png $i.pdf --out $i.png ; done | ||
``` | ||
|
||
|
||
### run through Tesseract | ||
|
||
Let us say the document contains Norwegian and Finnish. Standing in tesseract-ocr/tesseract, run the 8 pdf files through tesseract: | ||
|
||
``` | ||
for i in 1 2 3 4 5 6 7 8 ; do tesseract --tessdata-dir ../tessdata/ $i.png $i.txt -l fin+nor ; done | ||
``` | ||
|
||
The resulting files may then be collected into one text file. | ||
Experimenting with OCR: | ||
- [2022 testing with Tesseract](tesseract.md) | ||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
Experimenting with Tesseract | ||
============================ | ||
|
||
Note! This documents experimenting done in 2022 (?). | ||
|
||
## Fetching the program | ||
The open source program [Tesseract](https://github.com/tesseract-ocr) can be fetched from Github: | ||
|
||
``` | ||
git clone [email protected]:tesseract-ocr/tesseract.git | ||
git clone [email protected]:tesseract-ocr/tessdata.git | ||
... | ||
``` | ||
|
||
## Development | ||
|
||
Tesseract comes with a set of languages (see **tessdata**). Most GiellaLT languages are not included, though. TODO: Document how to add them. | ||
|
||
## OCR reading | ||
|
||
A pdf document as a picture should be | ||
|
||
1. split into one pdf per page | ||
2. converted to png | ||
3. run through Tesseract with the relevant language(s) as setting | ||
|
||
### One pdf per page | ||
|
||
In Preview, set the document in Thumbs view and drag one page at a time to the desktop. TODO: Find a way to do this on the command line. | ||
|
||
### converted to png | ||
|
||
Let us say the document contained 8 pages, after the split named *1.pdf, 2.pdf, ...* Then do the following: | ||
|
||
``` | ||
for i in 1 2 3 4 5 6 7 8 9 10 ; do sips -s format png $i.pdf --out $i.png ; done | ||
``` | ||
|
||
|
||
### run through Tesseract | ||
|
||
Let us say the document contains Norwegian and Finnish. Standing in tesseract-ocr/tesseract, run the 8 pdf files through tesseract: | ||
|
||
``` | ||
for i in 1 2 3 4 5 6 7 8 ; do tesseract --tessdata-dir ../tessdata/ $i.png $i.txt -l fin+nor ; done | ||
``` | ||
|
||
The resulting files may then be collected into one text file. | ||
|