image extraction broken in 0.17, worked on 0.16 #163

kingennio · 2024-10-06T15:22:20Z

I think the new version has introduced a glitch in output_images function because several images are not extracted.

It's consistent throughout, but for demonstration consider this slide
slide.pdf
with v.0.016 two images are extracted, the photo and the logo. In v.0.17 only the logo is extracted and not the main photo.
I stepped through the code. I guess the problem is that the loop removes images as they are extracted but this creates a problem with the way the loop is structured.

In 0.16, the loop made a copy of the references of the list
for i, img_rect in sorted(
[j for j in img_rects.items() if j[1].y1 <= text_rect.y0],
key=lambda j: (j[1].y1, j[1].x0),
):

whereas the 0.17 works directly on the original list
for i, img_rect in enumerate(parms.img_rects):
if not img_rect.y1 <= text_rect.y0:
continue

so when the image is deleted
del parms.img_rects[i] # do not touch this image twice

the loop exhausts the items and exits. In fact there are 2 images, the first is the logo, and it is extracted, but since it's deleted from the list, at the next iteration the loop is completed because it has already dealt with an item and the list now has in fact one item and so it's over.

kingennio · 2024-10-07T08:15:01Z

I think I fixed the code by keeping track of the indices to remove and then delete them at the end (marked > the modification)

def output_images(parms, text_rect):
        """Output images and graphics above text rectangle."""
        if not parms.img_rects:
            return ""
        this_md = ""  # markdown string

    processed_images = []  # List to keep track of processed images

    if text_rect is not None:  # select images above the text block
        for i, img_rect in enumerate(parms.img_rects):
            if not img_rect.y1 <= text_rect.y0:
                continue
            pathname = save_image(parms.page, img_rect, i)
            if pathname:
                this_md += GRAPHICS_TEXT % pathname
            if force_text:
                img_txt = write_text(
                    parms,
                    img_rect,
                    tabs=None,
                    tab_rects={},  # we have no tables here
                    img_rects=[],  # we have no other images here
                    force_text=True,
                )
                if not is_white(img_txt):  # was there text at all?
                    this_md += img_txt

            #del parms.img_rects[i]  # do not touch this image twice
            processed_images.append(i)

    else:  # output all remaining images
        for i, img_rect in enumerate(parms.img_rects):
            pathname = save_image(parms.page, img_rect, i)
            if pathname:
                this_md += GRAPHICS_TEXT % pathname
            if force_text:
                img_txt = write_text(
                    parms,
                    img_rect,
                    tabs=None,
                    tab_rects={},  # we have no tables here
                    img_rects=[],  # we have no other images here
                    force_text=True,
                )
                if not is_white(img_txt):
                    this_md += img_txt

            #del parms.img_rects[i]  # do not touch this image twice
            processed_images.append(i)

    # Remove processed images from parms.img_rects after the loop
    for i in sorted(processed_images, reverse=True):
        del parms.img_rects[i]

    return this_md

luc42ei · 2024-10-14T16:29:36Z

yep, I have the same issue

PedroFCM · 2024-10-18T14:09:43Z

I got the same problem when having two images on the same PDF page.
@kingennio code solved it for me

greengeek · 2024-10-30T06:04:54Z

I am seeing this same issue as well in pymupdf4llm 0.0.17 using Python 3.12.4

JorjMcKie · 2024-11-02T11:17:11Z

@kingennio Thank you for your contribution, I will include the idea in the next version.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

image extraction broken in 0.17, worked on 0.16 #163

image extraction broken in 0.17, worked on 0.16 #163

kingennio commented Oct 6, 2024

kingennio commented Oct 7, 2024 •

edited

Loading

luc42ei commented Oct 14, 2024

PedroFCM commented Oct 18, 2024

greengeek commented Oct 30, 2024

JorjMcKie commented Nov 2, 2024

image extraction broken in 0.17, worked on 0.16 #163

image extraction broken in 0.17, worked on 0.16 #163

Comments

kingennio commented Oct 6, 2024

kingennio commented Oct 7, 2024 • edited Loading

luc42ei commented Oct 14, 2024

PedroFCM commented Oct 18, 2024

greengeek commented Oct 30, 2024

JorjMcKie commented Nov 2, 2024

kingennio commented Oct 7, 2024 •

edited

Loading