Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pdf text recognition phase1. basic text import and layout #135

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

olivetthered
Copy link

Basic implementation of importing text from pdf files.
No fonts or styling yet, or second passes over grouping text areas together etc... just some basic text area grouping and layout but enough for additional features to be implemented fairly independently of each other. Currently, only support single page, select import text as text option (as opposed to the default import text as vector) in the GUI when importing a vector file of pdf type to import the text from a pdf file as text.

imnport text from a pdf document with some fuzzy matching to put lines of text that appear to be;long together in the same textframe. layout is good but there's no font or styling support as of yet and rotated text isn't supported either. creats lots of text boxes if the pdf file reports lots of text regions, they also need joining up in a second pass to merge textregions that should be together regardlesds of what the pdf file is reporting.
UI for selecting  text import as either vectors (dewfault) or as text. There will need to be some more variables for text import so the user can configure how loose or strict the text block matching is as I doub't even with good guesses it won't be a one size fits all solution.
@olivetthered
Copy link
Author

Pending file review by ale

@olivetthered
Copy link
Author

I raised the following bug to have this pull request reviewed and integrated:
https://bugs.scribus.net/view.php?id=16142

implement text import as a new outputdev inheriting slaOutputdev and making the appropriate private members of slaOutptutDev protected
tidy up so we make minimul changes from master
fixed some space differences with master
override type3 font output as we don't want to get confused and try to render them as vectors when vector rendering is only partially functional due to overrides from slaoutputdev. Hopefully they can be implemneted in the same way as addChar but if that turns out to be infeasable the overrtides can be removed and they can get rendered as vectors in the finished implementation.
…taken

change the name of TextOutputDev to PdfTextOutputDev as it's already taken
the PdfTextOutputDev naming matches tjhe naming of PdfTextRecognition
…varialbes

to make the classes and memb ers iuniform accrtoss the pdfTextRecognition implementation remane all the classes and member variables and function so they start with pdf ext unless it's not appropriate.
moved the optpuit dev into the pdftextrecognition files meaning slaoutput dev files longer have any dependencies on pdftextrecognition. This now keeps things neet and tody and a;l together.
fix z-order/grouping. I don't know why I did this in the first place
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant