Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linearize PDFs for better metadata removal #111

Open
bhadaway opened this issue Dec 23, 2020 · 4 comments
Open

Linearize PDFs for better metadata removal #111

bhadaway opened this issue Dec 23, 2020 · 4 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@bhadaway
Copy link

I read that you wish to keep this app minimalistic, and as someone who shares the same philosophy, I can appreciate that.

I'm wondering if adding linearization of PDF files (so that meta data is actually removed), would be within that scope, or overkill?

Here, someone is using QPDF to compliment ExifTool to accomplish that:

https://blog.joshlemon.com.au/protecting-your-pdf-files-and-metadata/

@szTheory
Copy link
Owner

Thanks for bringing this to my attention! I have to look into this more but it sounds promising. I suppose we'd just have to include the latest 64-bit qpdf binary for each platform with the distribution, then for PDFs run qpdf before exiftool during the processing phase.

While I do want to keep the number of settings and buttons to a minimum, I also want the main feature of the app, removing metadata, to be comprehensive. For this reason I'm also exploring removing extended filesystem attributes. So better PDF handling is something I'd like to add if it can be done well.

@szTheory szTheory added enhancement New feature or request help wanted Extra attention is needed labels Dec 23, 2020
@szTheory szTheory pinned this issue Dec 23, 2020
@bhadaway
Copy link
Author

It would be amazing because currently, the only other options for secure PDF cleanup are:

  1. If you have a copy of Adobe Acrobat (expensive and bloated), then you can sanitize documents.
  2. Uploading your documents to an online tool that scrubs them (I can't think of a more dangerous and counterintuitive option for privacy and security, which is the entire point).
  3. Combining multiple command line recipes to get it done right (which, even if you're comfortable using command line, is still a pain).

There actually is one other option that's super easy and straightforward, that most people's operating systems support natively. And that's simply to print as PDF, which apparently flattens the document and removes all the metadata. But, I'm not confident it's 100% fool-proof. It would be nicer to actually see the before and after (what your app does) to verify it's been cleaned.

@szTheory szTheory changed the title Would linearizing PDFs be a worthwhile feature to consider? Linearize PDFs for better metadata removal May 2, 2021
@szTheory szTheory unpinned this issue Dec 8, 2021
@WeAreLegion999
Copy link

Why does QPDF produce different files everytime? I used the same source file to generate files through QPDF at two instances and the binary file comparison shows differences in the two PDFs produced, despite the input file being the same.
The difference is located at the top and bottom in a UUID

@Robiktron
Copy link

I think that until you'll update ExifCleaner to make it permanently remove PDF metadata, it would be best to remove all claims of PDF support altogether from github.com/szTheory/exifcleaner and exifcleaner.com

In the former it starts by saying "Desktop app to clean metadata from images, videos, PDFs, and other files." without any warning. In the "Benefits" section it warns that support is "partial" and links to this discussion, but that's not good enough either as NOTHING is truly removed. It's only appears to be removed, on the surface. That's a serious problem as users are expecting this tool to protect their privacy by permanently removing harmful metadata. The "Supported File Types" also mentions PDF without any warning. It's only in the "File writer limitations" that it's said properly that "The original metadata is never actually removed."

It's important to update these pages so that people will not get the wrong impression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants