update

giellalt · Feb 15, 2024 · ff90c96 · ff90c96
1 parent 81501a9
commit ff90c96
Show file tree

Hide file tree

Showing 2 changed files with 55 additions and 47 deletions.
diff --git a/tools/ocr.md b/tools/ocr.md
@@ -1,62 +1,21 @@
 OCR  work at GiellaLT
 =====================
 
-This page will at some point document our OCR work.
+This page contains notes from our OCR work.
 
 
-We have been experimenting with OCR in 2016 and earlier
-(cf. the meeting memo from 2016 below). With recent
-advances in OCR techniques we will have to start
-this work again, with new programs. This page sketches how.
 
 
+# Prerequisites for OCR work
 
+- Prerequisites: Letter repertoire and corrected text. Candidate: 1896 bible.
 
-# Experimenting with Tesseract
 
-## Fetching the program
-The open source program [Tesseract](https://github.com/tesseract-ocr) can be fetched from Github:
 
-```
-git clone [email protected]:tesseract-ocr/tesseract.git
-git clone [email protected]:tesseract-ocr/tessdata.git
-...
-```
+# OCR of modern texts
 
-## Development
-
-Tesseract comes with a set of languages (see **tessdata**). Most GiellaLT languages are not included, though. TODO: Document how to add them.
-
-## OCR reading
-
-A pdf document as a picture should be
-
-1. split into one pdf per page
-2. converted to png
-3. run through Tesseract with the relevant language(s) as setting
-
-### One pdf per page
-
-In Preview, set the document in Thumbs view and drag one page at a time to the desktop. TODO: Find a way to do this on the command line. 
-
-### converted to png
-
-Let us say the document contained 8 pages, after the split named *1.pdf, 2.pdf, ...* Then do the following:
-
-```
-for i in 1 2 3 4 5 6 7 8 9 10 ; do sips -s format png $i.pdf --out $i.png ; done
-```
-
-
-### run through Tesseract
-
-Let us say the document contains Norwegian and Finnish. Standing in tesseract-ocr/tesseract, run the 8 pdf files through tesseract:
-
-```
-for i in 1 2 3 4 5 6 7 8 ; do tesseract --tessdata-dir ../tessdata/ $i.png $i.txt -l fin+nor ; done
-```
-
-The resulting files may then be collected into one text file.
+Experimenting with OCR:
+- [2022 testing with Tesseract](tesseract.md)
 
 
 

diff --git a/tools/tesseract.md b/tools/tesseract.md
@@ -0,0 +1,49 @@
+Experimenting with Tesseract
+============================
+
+Note! This documents experimenting done in 2022 (?).
+
+## Fetching the program
+The open source program [Tesseract](https://github.com/tesseract-ocr) can be fetched from Github:
+
+```
+git clone [email protected]:tesseract-ocr/tesseract.git
+git clone [email protected]:tesseract-ocr/tessdata.git
+...
+```
+
+## Development
+
+Tesseract comes with a set of languages (see **tessdata**). Most GiellaLT languages are not included, though. TODO: Document how to add them.
+
+## OCR reading
+
+A pdf document as a picture should be
+
+1. split into one pdf per page
+2. converted to png
+3. run through Tesseract with the relevant language(s) as setting
+
+### One pdf per page
+
+In Preview, set the document in Thumbs view and drag one page at a time to the desktop. TODO: Find a way to do this on the command line. 
+
+### converted to png
+
+Let us say the document contained 8 pages, after the split named *1.pdf, 2.pdf, ...* Then do the following:
+
+```
+for i in 1 2 3 4 5 6 7 8 9 10 ; do sips -s format png $i.pdf --out $i.png ; done
+```
+
+
+### run through Tesseract
+
+Let us say the document contains Norwegian and Finnish. Standing in tesseract-ocr/tesseract, run the 8 pdf files through tesseract:
+
+```
+for i in 1 2 3 4 5 6 7 8 ; do tesseract --tessdata-dir ../tessdata/ $i.png $i.txt -l fin+nor ; done
+```
+
+The resulting files may then be collected into one text file.
+