How to properly extract text from scanned pdfs with Spark OCR? #328

JustHeroo · 2021-08-26T07:11:37Z

JustHeroo
Aug 26, 2021
Collaborator

I am trying to extract text from some scanned pdfs. A few texts are extracted incorrectly. I think it happens because of some noise in pdf. To solve this problem, I have used Image processing after reading a pdf section code from this Notebook (5.Spark_OCR.ipynb). when I run the below blocks of code from this notebook, it is continuously running and not getting any output. Any feedback on that? How I can resolve this issue?

Answered by JustHeroo

Aug 26, 2021

it seems you're lacking some dependencies to handle a specific JPEG format.
Could you try with different input data so at least we can make sure that the environment is OK for regular data.
If we get this right then we can setup the dependencies in colab. That error you're getting is because some pages come with jpeg2 images.

View full answer

JustHeroo · 2021-08-26T11:50:22Z

JustHeroo
Aug 26, 2021
Collaborator Author

it seems you're lacking some dependencies to handle a specific JPEG format.
Could you try with different input data so at least we can make sure that the environment is OK for regular data.
If we get this right then we can setup the dependencies in colab. That error you're getting is because some pages come with jpeg2 images.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to properly extract text from scanned pdfs with Spark OCR? #328

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

How to properly extract text from scanned pdfs with Spark OCR? #328

JustHeroo Aug 26, 2021 Collaborator

Replies: 1 comment

JustHeroo Aug 26, 2021 Collaborator Author

JustHeroo
Aug 26, 2021
Collaborator

JustHeroo
Aug 26, 2021
Collaborator Author