-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
image extraction broken in 0.17, worked on 0.16 #163
Comments
I think I fixed the code by keeping track of the indices to remove and then delete them at the end (marked > the modification)
|
yep, I have the same issue |
I got the same problem when having two images on the same PDF page. |
I am seeing this same issue as well in pymupdf4llm 0.0.17 using Python 3.12.4 |
@kingennio Thank you for your contribution, I will include the idea in the next version. |
I think the new version has introduced a glitch in output_images function because several images are not extracted.
It's consistent throughout, but for demonstration consider this slide
slide.pdf
with v.0.016 two images are extracted, the photo and the logo. In v.0.17 only the logo is extracted and not the main photo.
I stepped through the code. I guess the problem is that the loop removes images as they are extracted but this creates a problem with the way the loop is structured.
In 0.16, the loop made a copy of the references of the list
for i, img_rect in sorted(
[j for j in img_rects.items() if j[1].y1 <= text_rect.y0],
key=lambda j: (j[1].y1, j[1].x0),
):
whereas the 0.17 works directly on the original list
for i, img_rect in enumerate(parms.img_rects):
if not img_rect.y1 <= text_rect.y0:
continue
so when the image is deleted
del parms.img_rects[i] # do not touch this image twice
the loop exhausts the items and exits. In fact there are 2 images, the first is the logo, and it is extracted, but since it's deleted from the list, at the next iteration the loop is completed because it has already dealt with an item and the list now has in fact one item and so it's over.
The text was updated successfully, but these errors were encountered: