Skip to content

Latest commit

 

History

History

FixGarbledText

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Fix garbled text when extracting from PDF documents in C# and VB.NET

This sample shows how to extract text from PDF documents when regular methods produce garbled / unexpected text.

There are searchable PDF documents that look just fine. But it’s not possible to copy or extract text from them properly. Even by Adobe tools.

This happens when the document does not contain mappings of glyphs to Unicode characters. Or contains incorrect mappings.

There is the PdfTextExtractionOptions.UnmappedCharacterHandler property. This sample shows how to perform OCR for unmapped characters and then replace them with correct Unicode values.

This sample uses Docotic.Pdf library and Tesseract OCR Engine. You would also need to have Visual Studio 2015-2019 x86 & x64 runtimes installed.

See also