Problem
You have a PDF consisting of scanned images (e.g. from an old book). You would like to make it into a searchable PDF, so the text can be searched and copied-and-pasted into other documents.
Solution
Details
This was going to be a footnote to an upcoming blog post, but I was so impressed with this tool that I wanted to give it a moment in the spotlight. (Scotch Tape & Duct Whisky is more like a Maglite. The AA size. But still.)
I have an interesting old book that I wanted to scan and make available. You’ll see what it is in the next few days. I scanned the pages on my flatbed scanner with the GIMP under Windows 7, producing PNG files at 300x300 ppi. In FreeBSD, I converted those PNG files to JPEG files at the same resolution, and assembled them into a PDF file using Luigi Rizzo’s jpg2pdf utility.
The resulting PDF was quite nice looking, with clear text and nicely reproduced images. But like any PDF made from scanned images, there was no “text” behind the text. It was just pictures of words. You couldn’t search for a phrase in the book, or copy and paste text into another application. I looked around for software that could perform optical character recognition (OCR) on the text, and convert the PDF to a searchable PDF (which still presents the scanned image to the user, but also contains the machine-readable text in the “background”).
Most solutions cost $100 to $500. There are free OCR solutions, such as the aptly-named FreeOCR, but it doesn’t have the magic to produce a searchable PDF. Luckily, I came across WatchOCR.
WatchOCR combines a number of open-source tools, running under Linux, to accomplish exactly what I needed: conversion of a scanned PDF to a searchable PDF. It comes on a LiveCD image, which you could burn to a CD-ROM, and use to boot a spare computer to perform the conversion. But since I already run VirtualBox, I found it more convenient to boot the ISO in a virtual machine.
After downloading the image, it was just a couple of minutes’ work to set up a new virtual machine. I attached the ISO to the VM’s CD-ROM drive, and set the network interface to “bridged” mode, so that the VM would be visible on my LAN. No virtual hard drive is required. The VM booted, and automatically started a graphical interface that allowed me to start the WatchOCR software and monitor its status.
The software creates a network share, \\WatchOCRserver\WatchOCR, which has subdirectories for input and output. You simply copy your scanned PDF to the input directory, and a few minutes later, a searchable PDF appears with the same name in the output directory, like magic! (Note: Since the LiveCD has no permanent storage, these directories should simply be used as working space during the conversion. Your files will not persist there when the WatchOCR machine is shut down or rebooted.)
I verified that the output PDF was indeed searchable, and I was able to copy-and-paste a paragraph from it into a text editor. I’ve only given the OCR quality a cursory glance, but I didn’t see any errors in the paragraph I looked at (bearing in mind that my input PDF had very high quality scans).
WatchOCR is an incredibly useful tool that can save you hundreds of dollars. I made a donation to the author, and I encourage anyone who needs to create searchable scanned PDFs to try it out.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.