Recently I've been working for a website that has a lot of scanned PDFs. They need to make sure that they don't have the same problem as the redacted PDFs of the FBI/CIA. You know, where you could copy-paste the words out from behind black boxes?
Yeah, we had that exact same problem. Except I'm not working for the FBI.
Google didn't help one bit, "flatten pdf" means something entirely different to most people, so I couldn't find it. Anyway, I chose to use the awesome tool ImageMagick.
I tried converting from PDF to PDF, but what happens is the filesize blows up if I want any sort of quality. Instead, I convert to PNG and then back to PDF.
So below is the shell script, with the only dependency being ImageMagick:
LINES=$(cat files.txt) COUNT=0
for LINE in $LINES do
COUNT=$(echo "$COUNT + 1"|bc) echo -n "Working on $LINE.."
if [ -f fixed/$LINE ]; then echo ". already done" continue fi
mkdir fixed/imgs-$LINE convert -density 400 broken/$LINE fixed/imgs-$LINE/out.png echo -n ".." convert -density 400 fixed/imgs-$LINE/out*.png fixed/$LINE rm -r fixed/imgs-$LINE echo ". done"
It takes files in the file "files.txt" from the broken/ folder and outputs fixed ones to fixed/.
I did this to make sure the files were good and then let me upload them later. The broken/ folder is an sshfs mount to the original server mounted read-only.
Once they're all fixed, I tested a bunch out and they were no longer copy-paste vulnerable and the info was completely boxed out with a similarly sized PDF