README ocr-tesseract-textract

Goal

This code shows how easy it is to use AWS Textract to synchronously OCR a single page. By "synchonous", we mean we submit a page to Textract and wait for the result.

Operation

Textract expects an image format, JPG or PNG. Scanned PDF docs have the image embedded in JPG format, so we first extract that binary data then send it to Textract. We parse the BLOCKs and LINEs and build the text output. This is then printed.

Creating a "scanned" image

To get a fake "scan" PDF, I created a one-page doc of text, converted it to a JPG, then converted that to a PDF.

Since our original text is something we know, we should be able to compare the accuracy of our extracted text with the original to determine accuracy. That's for later.

Invoke

Invoke like:

./textract.py palefire.jpg.pdf

Developer environment setup

I used pyenv to select python version 3.7.2 then created and activated a virtual environment, then installed boto3:

pyenv local 3.7.2
python -mvenv .venv
pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.rst		README.rst
palefire.jpg.pdf		palefire.jpg.pdf
palefire.txt		palefire.txt
pdf2jpg.py		pdf2jpg.py
requirements.txt		requirements.txt
textract.py		textract.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README ocr-tesseract-textract

Goal

Operation

Creating a "scanned" image

Invoke

Developer environment setup

About

Releases

Packages

Languages

v-studios/ocr-textract-tesseract

Folders and files

Latest commit

History

Repository files navigation

README ocr-tesseract-textract

Goal

Operation

Creating a "scanned" image

Invoke

Developer environment setup

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages