Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

related to the closed issue of annotation/drawings #164

Open
kingennio opened this issue Oct 7, 2024 · 3 comments
Open

related to the closed issue of annotation/drawings #164

kingennio opened this issue Oct 7, 2024 · 3 comments

Comments

@kingennio
Copy link

I'm sorry to bring this up again, but I think I've found a solution that is easy to apply.
The problem concerns the case when there are many drawings in the page, these are captured with page.get_drawings() and consolidated in a larger rectangle with

for bbox in refine_boxes(page.cluster_drawings(drawings=paths)):
            if is_significant(bbox, paths):
                vg_clusters0.append(bbox)

Consider these slides.
slide12.pdf
In the current version 0.17 the text boxes in yellow background are not extracted. The reason is that pymupdf identifies many small drawings in the text and then consolidates all these drawing into a larger rectangle that encompasses the whole yellow textbox.
At first I thought I wished it could extract the text anyway, but then I thought: can we just extract this global drawing as if it were an image? In the current version the function output_images only extracts the identified images, not the drawings.
vg_cluster0 contains both drawings and images:

# also add image rectangles to the list
vg_clusters0.extend(parms.img_rects)

In similar way we can add to the list of images all the drawings:

# these may no longer be pairwise disjoint:
# remove area overlaps by joining into larger rects
parms.vg_clusters0 = refine_boxes(vg_clusters0)
parms.img_rects = parms.vg_clusters0[:]

In this way all the drawings are extracted as images and, if you set force_text=True, also the text in the drawing is extracted (before it was not).
Hope this may be useful

@mennafateen
Copy link

This was useful, thanks!
After this modification I was facing duplicate images in the output. Removing the del parms.img_rects[i] while iterating in the output_images function and assigning the index to None & checking for that instead fixed this issue.

@kingennio
Copy link
Author

kingennio commented Nov 1, 2024

Is this the approach you are proposing? Images are set to None in the list after being dealt with, a check on None values is added and at the end all None values are removed from the list

def output_images(parms, text_rect):
    """Output images and graphics above text rectangle."""
    if not parms.img_rects:
        return ""
    this_md = ""  # markdown string

    if text_rect is not None:  # select images above the text block
        for i, img_rect in enumerate(parms.img_rects):
            if img_rect is None:  # Skip already processed images
                continue
            if not img_rect.y1 <= text_rect.y0:
                continue
            pathname = save_image(parms.page, img_rect, i)
            if pathname:
                this_md += GRAPHICS_TEXT % pathname
            if force_text:
                img_txt = write_text(
                    parms,
                    img_rect,
                    tabs=None,
                    tab_rects={},  # we have no tables here
                    img_rects=[],  # we have no other images here
                    force_text=True,
                )
                if not is_white(img_txt):  # was there text at all?
                    this_md += img_txt
            parms.img_rects[i] = None  # Mark as processed instead of deleting

    else:  # output all remaining images
        for i, img_rect in enumerate(parms.img_rects):
            if img_rect is None:  # Skip already processed images
                continue
            pathname = save_image(parms.page, img_rect, i)
            if pathname:
                this_md += GRAPHICS_TEXT % pathname
            if force_text:
                img_txt = write_text(
                    parms,
                    img_rect,
                    tabs=None,
                    tab_rects={},  # we have no tables here
                    img_rects=[],  # we have no other images here
                    force_text=True,
                )
                if not is_white(img_txt):
                    this_md += img_txt
            parms.img_rects[i] = None  # Mark as processed instead of deleting
    
    # Remove None entries from parms.img_rects
    parms.img_rects = [rect for rect in parms.img_rects if rect is not None]
    
    return this_md

@mennafateen
Copy link

Yes, thanks. However, I’m uncertain about removing the None entries entirely, as this would alter the list’s size and potentially disrupt the indexing for pathname = save_image(parms.page, img_rect, i). It might be safer to keep the None values to preserve the original index positions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants