How to properly extract text from scanned pdfs with Spark OCR? #328
-
I am trying to extract text from some scanned pdfs. A few texts are extracted incorrectly. I think it happens because of some noise in pdf. To solve this problem, I have used Image processing after reading a pdf section code from this Notebook (5.Spark_OCR.ipynb). when I run the below blocks of code from this notebook, it is continuously running and not getting any output. Any feedback on that? How I can resolve this issue? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
it seems you're lacking some dependencies to handle a specific JPEG format. |
Beta Was this translation helpful? Give feedback.
it seems you're lacking some dependencies to handle a specific JPEG format.
Could you try with different input data so at least we can make sure that the environment is OK for regular data.
If we get this right then we can setup the dependencies in colab. That error you're getting is because some pages come with jpeg2 images.