extract - Extracting background images from a PDF file?

Tuesday, December 27, 2016

extract - Extracting background images from a PDF file?

I have a PDF file containing maps of the building I work in, here:

http://www.libsys.und.edu/dev/FloorPlans_All.pdf

The original source files have been lost, and I've been asked to extract the map images, preferably without the text and icons that have been overlaid on top of them. This has proven annoyingly difficult.

So far, I have tried the following GUI programs:

Adobe Reader: lets me select text, but not the background images

FoxIt PDF Viewer: lets me select text, but not the background images

XPDF on Ubuntu 10.10: lets mes select text, but not the background images

And also the following command-line programs:

pdfimages: extracts the icons indicating bathrooms just fine, but not the background images

pdftohtml: same as pdfimages, plus it makes a poorly marked up HTML document

pdfextract: same as pdfimages

convert: successfully saved images, but with the text burned into them

I've even tried opening the PDF manually in a text editor and extracting the stream objects by pasting them into a new file and saving it with a .jpg, .png, or .bmp extension (each in turn). Considering how little I know about the internal structure of PDF files, it's no surprise that this didn't work.

So ... is there any way I can retrieve the map images from this thing without also getting the text and icons?

Answer

You can download the XPDF library from http://www.foolabs.com/xpdf/download.html for Linux and Windows. Then run pdfimages -j input.pdf output and you should get output-000.jpg, output-001.jpg, etc. Also, check out http://linuxcommand.org/man_pages/pdfimages1.html for more usage options.

Blog

Tuesday, December 27, 2016

extract - Extracting background images from a PDF file?

No comments:

Post a Comment

hard drive - Leaving bad sectors in unformatted partition?