This is one of those tips that is hard to describe or visualize. But when we demonstrate it to someone new, we usually hear “Wow, that’s cool.”
Adobe Acrobat Pro includes a special OCR (Optical character recognition) feature you can use to convert scanned documents into searchable PDFs. The twist with Acrobat Pro is that instead of converting your scan to a straight text file, you can leave the visual appearance of your scanned document untouched, and yet the text itself becomes selectable and searchable in Acrobat.
Since Creativetechs is part of the Seattle graphic design community, we’ll demonstrate this tip using something more fun: Scans of type specimen samples from antique letterpress typography books.
Step 1: Scan a printed document.
Scan a sample document with enough resolution to clearly read all the letters you are going to want Acrobat to interpret. The best way to get a better ORC translation is to start with a clean scan at a resolution of least 600 pixel per inch.
For the purposes of this quick tutorial we’ve provided a scanned text specimen page from The Superior Copper Mixed Type Book from the late 1800’s. (One of the perks of having a Letterpress classroom next door to our office.)
Download this tiff and try the technique yourself:
Letterpress Sample.tif (zipped)
Step 2: Create a PDF from your scanned file.
In Acrobat Professional (version 7 shown) choose File > Create PDF > From File…
Acrobat will churn for a moment or two and open a new image-only PDF built from your scanned image.
Step 3: Choose Document > Recognize Text Using OCR > Start…
Acrobat will display the Recognize Text dialog box. Click the Edit button to change the default settings.
For this tutorial use the Recognize Text settings shown here. The Key option is to set the PDF Output Style to “Searchable Image (Exact)” which will maintain the look of your scan, while adding a hidden text layer.
Once you change your settings and click OK for both dialog boxes, Acrobat will process your image. This can take a short time — watch the bottom left of your Acrobat window for a small status bar.
Step 4: Check the results.
Select Acrobat’s type select tool and check out the results. If everything worked properly, you should be able to select, copy, or search type directly from your scanned image. This is the fun pay off that made this tip-worthy for us.
One complaint about this feature is the lack of feedback to see exactly how Acrobat interpreted your scanned text. When you are using the type select tool, you can select all the text in your document. Then switch over to a text editor and paste the results.
In our example, Acrobat made a few mistakes interpreting some of the non-text elements. However the results are pretty good in this case. Unfortunately if you find significant errors in Acrobat’s OCR interpretations, there is not an easy method for making corrections to the PDF.
In all fairness, we offer this tip more as an example of a lesser known feature than because it has proven terribly useful somewhere specific.
Who would want to create a searchable PDF? A classic example would be an attorney who needs to scan in many types of printed documents. The original appearance and signatures can’t be altered, but they need to be able to search documents later for certain phrases or words.
This technique might also be a way to incorporate older marketing materials into a company’s library of searchable PDFs.
Finally, for some designers who don’t have the need for dedicated OCR software, this might be a quick way to scan in and convert documents to text rather than retyping everything from scratch. For example, a client who provides a print-out of needed copy, but no word file, or a project that required extensive quotes from printed magazine or newspaper articles.