related to the closed issue of annotation/drawings #164

kingennio · 2024-10-07T09:05:17Z

I'm sorry to bring this up again, but I think I've found a solution that is easy to apply.
The problem concerns the case when there are many drawings in the page, these are captured with page.get_drawings() and consolidated in a larger rectangle with

for bbox in refine_boxes(page.cluster_drawings(drawings=paths)):
            if is_significant(bbox, paths):
                vg_clusters0.append(bbox)

Consider these slides.
slide12.pdf
In the current version 0.17 the text boxes in yellow background are not extracted. The reason is that pymupdf identifies many small drawings in the text and then consolidates all these drawing into a larger rectangle that encompasses the whole yellow textbox.
At first I thought I wished it could extract the text anyway, but then I thought: can we just extract this global drawing as if it were an image? In the current version the function output_images only extracts the identified images, not the drawings.
vg_cluster0 contains both drawings and images:

# also add image rectangles to the list
vg_clusters0.extend(parms.img_rects)

In similar way we can add to the list of images all the drawings:

# these may no longer be pairwise disjoint:
# remove area overlaps by joining into larger rects
parms.vg_clusters0 = refine_boxes(vg_clusters0)
parms.img_rects = parms.vg_clusters0[:]

In this way all the drawings are extracted as images and, if you set force_text=True, also the text in the drawing is extracted (before it was not).
Hope this may be useful

The text was updated successfully, but these errors were encountered:

mennafateen · 2024-11-01T05:32:54Z

This was useful, thanks!
After this modification I was facing duplicate images in the output. Removing the del parms.img_rects[i] while iterating in the output_images function and assigning the index to None & checking for that instead fixed this issue.

kingennio · 2024-11-01T13:40:22Z

Is this the approach you are proposing? Images are set to None in the list after being dealt with, a check on None values is added and at the end all None values are removed from the list

def output_images(parms, text_rect):
    """Output images and graphics above text rectangle."""
    if not parms.img_rects:
        return ""
    this_md = ""  # markdown string

    if text_rect is not None:  # select images above the text block
        for i, img_rect in enumerate(parms.img_rects):
            if img_rect is None:  # Skip already processed images
                continue
            if not img_rect.y1 <= text_rect.y0:
                continue
            pathname = save_image(parms.page, img_rect, i)
            if pathname:
                this_md += GRAPHICS_TEXT % pathname
            if force_text:
                img_txt = write_text(
                    parms,
                    img_rect,
                    tabs=None,
                    tab_rects={},  # we have no tables here
                    img_rects=[],  # we have no other images here
                    force_text=True,
                )
                if not is_white(img_txt):  # was there text at all?
                    this_md += img_txt
            parms.img_rects[i] = None  # Mark as processed instead of deleting

    else:  # output all remaining images
        for i, img_rect in enumerate(parms.img_rects):
            if img_rect is None:  # Skip already processed images
                continue
            pathname = save_image(parms.page, img_rect, i)
            if pathname:
                this_md += GRAPHICS_TEXT % pathname
            if force_text:
                img_txt = write_text(
                    parms,
                    img_rect,
                    tabs=None,
                    tab_rects={},  # we have no tables here
                    img_rects=[],  # we have no other images here
                    force_text=True,
                )
                if not is_white(img_txt):
                    this_md += img_txt
            parms.img_rects[i] = None  # Mark as processed instead of deleting
    
    # Remove None entries from parms.img_rects
    parms.img_rects = [rect for rect in parms.img_rects if rect is not None]
    
    return this_md

mennafateen · 2024-11-01T14:56:54Z

Yes, thanks. However, I’m uncertain about removing the None entries entirely, as this would alter the list’s size and potentially disrupt the indexing for pathname = save_image(parms.page, img_rect, i). It might be safer to keep the None values to preserve the original index positions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

related to the closed issue of annotation/drawings #164

related to the closed issue of annotation/drawings #164

kingennio commented Oct 7, 2024

mennafateen commented Nov 1, 2024

kingennio commented Nov 1, 2024 •

edited

Loading

mennafateen commented Nov 1, 2024

related to the closed issue of annotation/drawings #164

related to the closed issue of annotation/drawings #164

Comments

kingennio commented Oct 7, 2024

mennafateen commented Nov 1, 2024

kingennio commented Nov 1, 2024 • edited Loading

mennafateen commented Nov 1, 2024

kingennio commented Nov 1, 2024 •

edited

Loading