Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible regression in pdf cleaning during save. #4034

Open
wz93672 opened this issue Nov 9, 2024 · 3 comments
Open

Possible regression in pdf cleaning during save. #4034

wz93672 opened this issue Nov 9, 2024 · 3 comments
Labels
upstream bug bug outside this package

Comments

@wz93672
Copy link

wz93672 commented Nov 9, 2024

Description of the bug

Recently, in the documents to process, I received a document scanned with a Lexmark machine that become blank after saving with “clean” option set to True. This behavior starts with version 1.24.0, 1.23.26 works fine.

How to reproduce the bug

Here is a public document scanned with Lexmark, found in the internet:
https://www.feb.unesp.br/Home/Administracao110/DTAd/Compras/empenhos---19_06.pdf

with pymupdf.open('empenhos---19_06.pdf') as doc:
    doc.save('out.pdf', clean=True)

PyMuPDF version

1.24.13

Operating system

Windows

Python version

3.12

@JorjMcKie
Copy link
Collaborator

What are you trying to achieve with this parameter?
It is equivalent to executing page.clean_contents() for all its pages. For PDF as you used it for, it has no positive effect in any case. It never reduces files size if not used together with garbage collection and compression.
Of course what you experienced shouldn't happen either.

@JorjMcKie
Copy link
Collaborator

Here is the link to the corresponding MuPDF bug report: https://bugs.ghostscript.com/show_bug.cgi?id=708128

@JorjMcKie JorjMcKie added the upstream bug bug outside this package label Nov 9, 2024
@wz93672
Copy link
Author

wz93672 commented Nov 9, 2024

It was only minimal example to reproduce the bug.
I’m using pymupdf for preprocessing mix of pdf documents form different sources, quiet a lot of files, including merging, splitting, adding blank pages for parity, to streamline printing. Some are generated in older software, or scanned on older machines with old firmware, and have problems. Among them there are pdfs with broken blank pages I wrote about a few months ago, or with broken coordinate system (unclosed transformations as I understand). Flawed files come and go, disappearing characters, nulls after eof. Typically works when opened directly but breaks on save, or with whole setup gives hard to adress problems. Cleaning helps and currently I have to use it, I wish cleaning would purge them all :). And I’m using garbage option too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream bug bug outside this package
Projects
None yet
Development

No branches or pull requests

2 participants