This sample shows how to extract text from PDF documents when regular methods produce garbled / unexpected text.
There are searchable PDF documents that look just fine. But it’s not possible to copy or extract text from them properly. Even by Adobe tools.
This happens when the document does not contain mappings of glyphs to Unicode characters. Or contains incorrect mappings.
There is the PdfTextExtractionOptions.UnmappedCharacterHandler property. This sample shows how to perform OCR for unmapped characters and then replace them with correct Unicode values.
This sample uses Docotic.Pdf library and Tesseract OCR Engine. You would also need to have Visual Studio 2015-2019 x86 & x64 runtimes installed.