Tuesday, December 27, 2016

extract - Extracting background images from a PDF file?



I have a PDF file containing maps of the building I work in, here:



http://www.libsys.und.edu/dev/FloorPlans_All.pdf



The original source files have been lost, and I've been asked to extract the map images, preferably without the text and icons that have been overlaid on top of them. This has proven annoyingly difficult.




So far, I have tried the following GUI programs:




  • Adobe Reader: lets me select text, but not the background images

  • FoxIt PDF Viewer: lets me select text, but not the background images

  • XPDF on Ubuntu 10.10: lets mes select text, but not the background images



And also the following command-line programs:





  • pdfimages: extracts the icons indicating bathrooms just fine, but not the background images

  • pdftohtml: same as pdfimages, plus it makes a poorly marked up HTML document

  • pdfextract: same as pdfimages

  • convert: successfully saved images, but with the text burned into them



I've even tried opening the PDF manually in a text editor and extracting the stream objects by pasting them into a new file and saving it with a .jpg, .png, or .bmp extension (each in turn). Considering how little I know about the internal structure of PDF files, it's no surprise that this didn't work.




So ... is there any way I can retrieve the map images from this thing without also getting the text and icons?


Answer



You can download the XPDF library from http://www.foolabs.com/xpdf/download.html for Linux and Windows. Then run pdfimages -j input.pdf output and you should get output-000.jpg, output-001.jpg, etc. Also, check out http://linuxcommand.org/man_pages/pdfimages1.html for more usage options.


No comments:

Post a Comment

hard drive - Leaving bad sectors in unformatted partition?

Laptop was acting really weird, and copy and seek times were really slow, so I decided to scan the hard drive surface. I have a couple hundr...