Sometimes, especially when you are a doing a big OCR project, you might want to find all the PDFs that are not searchable. That is to say, you want to find the PDFs that have not been OCR-ed.
It turns out that this is not as easy as you might think. Here are a few ways to “sort of” do it. As much as possible I wanted to limit this to search capabilities built into the operating system, or to applications that you might already have.
Mac OS X Spotlight
It occurred to me that, chances are, almost any PDF that has been made searchable will have at least one space in it. So, why not use Spotlight to find all PDFs that don’t have a space? Fire up Spotlight by going Command-Space and type the following:
kind:pdf NOT intext:" "
Is this a perfect test? No, but hopefully it will get you most of the way there.
Microsoft Windows
I had hoped to do the same thing with Windows Search in Windows 7, but it didn’t work. It doesn’t seem that it will let you just search for a space. The closest I could come to is to search for the word the. Obviously this is English only, so in your language hopefully there is an equivalent word that is in almost every document.
Start up Windows Search by pressing Windows Key-F and type the following:
ext:pdf NOT contents:the
That is not as likely to succeed as just searching for a space, but should get you most of the way there.
Adobe Acrobat
Adobe Acrobat has some features that may help. You can use Acrobat Pro’s Preflight feature, or even do a Batch Process Accessibility Report.
None of these searches are 100% guaranteed to succeed, but hopefully they will help you down the path. Thanks to DocumentSnap reader Matt for the idea for this post.
Do you have any tricks for finding non-OCR’ed PDFs? Share in the comments.
(Photo by Dirigentens)