Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
Trondtr committed Feb 15, 2024
1 parent 81501a9 commit ff90c96
Show file tree
Hide file tree
Showing 2 changed files with 55 additions and 47 deletions.
53 changes: 6 additions & 47 deletions tools/ocr.md
Original file line number Diff line number Diff line change
@@ -1,62 +1,21 @@
OCR work at GiellaLT
=====================

This page will at some point document our OCR work.
This page contains notes from our OCR work.


We have been experimenting with OCR in 2016 and earlier
(cf. the meeting memo from 2016 below). With recent
advances in OCR techniques we will have to start
this work again, with new programs. This page sketches how.


# Prerequisites for OCR work

- Prerequisites: Letter repertoire and corrected text. Candidate: 1896 bible.

# Experimenting with Tesseract

## Fetching the program
The open source program [Tesseract](https://github.com/tesseract-ocr) can be fetched from Github:

```
git clone [email protected]:tesseract-ocr/tesseract.git
git clone [email protected]:tesseract-ocr/tessdata.git
...
```
# OCR of modern texts

## Development

Tesseract comes with a set of languages (see **tessdata**). Most GiellaLT languages are not included, though. TODO: Document how to add them.

## OCR reading

A pdf document as a picture should be

1. split into one pdf per page
2. converted to png
3. run through Tesseract with the relevant language(s) as setting

### One pdf per page

In Preview, set the document in Thumbs view and drag one page at a time to the desktop. TODO: Find a way to do this on the command line.

### converted to png

Let us say the document contained 8 pages, after the split named *1.pdf, 2.pdf, ...* Then do the following:

```
for i in 1 2 3 4 5 6 7 8 9 10 ; do sips -s format png $i.pdf --out $i.png ; done
```


### run through Tesseract

Let us say the document contains Norwegian and Finnish. Standing in tesseract-ocr/tesseract, run the 8 pdf files through tesseract:

```
for i in 1 2 3 4 5 6 7 8 ; do tesseract --tessdata-dir ../tessdata/ $i.png $i.txt -l fin+nor ; done
```

The resulting files may then be collected into one text file.
Experimenting with OCR:
- [2022 testing with Tesseract](tesseract.md)



Expand Down
49 changes: 49 additions & 0 deletions tools/tesseract.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
Experimenting with Tesseract
============================

Note! This documents experimenting done in 2022 (?).

## Fetching the program
The open source program [Tesseract](https://github.com/tesseract-ocr) can be fetched from Github:

```
git clone [email protected]:tesseract-ocr/tesseract.git
git clone [email protected]:tesseract-ocr/tessdata.git
...
```

## Development

Tesseract comes with a set of languages (see **tessdata**). Most GiellaLT languages are not included, though. TODO: Document how to add them.

## OCR reading

A pdf document as a picture should be

1. split into one pdf per page
2. converted to png
3. run through Tesseract with the relevant language(s) as setting

### One pdf per page

In Preview, set the document in Thumbs view and drag one page at a time to the desktop. TODO: Find a way to do this on the command line.

### converted to png

Let us say the document contained 8 pages, after the split named *1.pdf, 2.pdf, ...* Then do the following:

```
for i in 1 2 3 4 5 6 7 8 9 10 ; do sips -s format png $i.pdf --out $i.png ; done
```


### run through Tesseract

Let us say the document contains Norwegian and Finnish. Standing in tesseract-ocr/tesseract, run the 8 pdf files through tesseract:

```
for i in 1 2 3 4 5 6 7 8 ; do tesseract --tessdata-dir ../tessdata/ $i.png $i.txt -l fin+nor ; done
```

The resulting files may then be collected into one text file.

0 comments on commit ff90c96

Please sign in to comment.