Possible regression in pdf cleaning during save. #4034

wz93672 · 2024-11-09T12:56:07Z

Description of the bug

Recently, in the documents to process, I received a document scanned with a Lexmark machine that become blank after saving with “clean” option set to True. This behavior starts with version 1.24.0, 1.23.26 works fine.

How to reproduce the bug

Here is a public document scanned with Lexmark, found in the internet:
https://www.feb.unesp.br/Home/Administracao110/DTAd/Compras/empenhos---19_06.pdf

with pymupdf.open('empenhos---19_06.pdf') as doc:
    doc.save('out.pdf', clean=True)

PyMuPDF version

1.24.13

Operating system

Windows

Python version

3.12

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2024-11-09T14:02:45Z

What are you trying to achieve with this parameter?
It is equivalent to executing page.clean_contents() for all its pages. For PDF as you used it for, it has no positive effect in any case. It never reduces files size if not used together with garbage collection and compression.
Of course what you experienced shouldn't happen either.

JorjMcKie · 2024-11-09T14:09:12Z

Here is the link to the corresponding MuPDF bug report: https://bugs.ghostscript.com/show_bug.cgi?id=708128

wz93672 · 2024-11-09T20:09:49Z

It was only minimal example to reproduce the bug.
I’m using pymupdf for preprocessing mix of pdf documents form different sources, quiet a lot of files, including merging, splitting, adding blank pages for parity, to streamline printing. Some are generated in older software, or scanned on older machines with old firmware, and have problems. Among them there are pdfs with broken blank pages I wrote about a few months ago, or with broken coordinate system (unclosed transformations as I understand). Flawed files come and go, disappearing characters, nulls after eof. Typically works when opened directly but breaks on save, or with whole setup gives hard to adress problems. Cleaning helps and currently I have to use it, I wish cleaning would purge them all :). And I’m using garbage option too.

JorjMcKie added the upstream bug bug outside this package label Nov 9, 2024

JorjMcKie mentioned this issue Nov 11, 2024

Another issue with destorying PDF when inserting html #3886

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible regression in pdf cleaning during save. #4034

Possible regression in pdf cleaning during save. #4034

wz93672 commented Nov 9, 2024 •

edited

Loading

JorjMcKie commented Nov 9, 2024

JorjMcKie commented Nov 9, 2024

wz93672 commented Nov 9, 2024

Possible regression in pdf cleaning during save. #4034

Possible regression in pdf cleaning during save. #4034

Comments

wz93672 commented Nov 9, 2024 • edited Loading

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

JorjMcKie commented Nov 9, 2024

JorjMcKie commented Nov 9, 2024

wz93672 commented Nov 9, 2024

wz93672 commented Nov 9, 2024 •

edited

Loading