Skip to content

How to properly extract text from scanned pdfs with Spark OCR? #328

Discussion options

You must be logged in to vote

it seems you're lacking some dependencies to handle a specific JPEG format.
Could you try with different input data so at least we can make sure that the environment is OK for regular data.
If we get this right then we can setup the dependencies in colab. That error you're getting is because some pages come with jpeg2 images.

Replies: 1 comment

Comment options

JustHeroo
Aug 26, 2021
Collaborator Author

You must be logged in to vote
0 replies
Answer selected by JustHeroo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
1 participant