-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
related to the closed issue of annotation/drawings #164
Comments
This was useful, thanks! |
Is this the approach you are proposing? Images are set to None in the list after being dealt with, a check on None values is added and at the end all None values are removed from the list
|
Yes, thanks. However, I’m uncertain about removing the None entries entirely, as this would alter the list’s size and potentially disrupt the indexing for |
I'm sorry to bring this up again, but I think I've found a solution that is easy to apply.
The problem concerns the case when there are many drawings in the page, these are captured with page.get_drawings() and consolidated in a larger rectangle with
Consider these slides.
slide12.pdf
In the current version 0.17 the text boxes in yellow background are not extracted. The reason is that pymupdf identifies many small drawings in the text and then consolidates all these drawing into a larger rectangle that encompasses the whole yellow textbox.
At first I thought I wished it could extract the text anyway, but then I thought: can we just extract this global drawing as if it were an image? In the current version the function output_images only extracts the identified images, not the drawings.
vg_cluster0 contains both drawings and images:
In similar way we can add to the list of images all the drawings:
In this way all the drawings are extracted as images and, if you set force_text=True, also the text in the drawing is extracted (before it was not).
Hope this may be useful
The text was updated successfully, but these errors were encountered: