How To Find PDFs That Are Not Searchable

How To Find PDFs That Are Not Searchable

Sometimes, especially when you are a doing a big OCR project, you might want to find all the PDFs that are not searchable. That is to say, you want to find the PDFs that have not been OCR-ed.

It turns out that this is not as easy as you might think. Here are a few ways to “sort of” do it. As much as possible I wanted to limit this to search capabilities built into the operating system, or to applications that you might already have.

Mac OS X Spotlight

It occurred to me that, chances are, almost any PDF that has been made searchable will have at least one space in it. So, why not use Spotlight to find all PDFs that don’t have a space? Fire up Spotlight by going Command-Space and type the following:

kind:pdf NOT intext:" "

Is this a perfect test? No, but hopefully it will get you most of the way there.

Microsoft Windows

I had hoped to do the same thing with Windows Search in Windows 7, but it didn’t work. It doesn’t seem that it will let you just search for a space. The closest I could come to is to search for the word the. Obviously this is English only, so in your language hopefully there is an equivalent word that is in almost every document.

Start up Windows Search by pressing Windows Key-F and type the following:

ext:pdf NOT contents:the

That is not as likely to succeed as just searching for a space, but should get you most of the way there.

Adobe Acrobat

Adobe Acrobat has some features that may help. You can use Acrobat Pro’s Preflight feature, or even do a Batch Process Accessibility Report.

None of these searches are 100% guaranteed to succeed, but hopefully they will help you down the path. Thanks to DocumentSnap reader Matt for the idea for this post.

Do you have any tricks for finding non-OCR’ed PDFs? Share in the comments.

(Photo by Dirigentens)

About the Author

Brooks Duncan helps individuals and small businesses go paperless. He's been an accountant, a software developer, a manager in a very large corporation, and has run DocumentSnap since 2008. You can find Brooks on Twitter at @documentsnap or @brooksduncan. Thanks for stopping by.

Leave a Reply 3 comments

Tom - January 18, 2011 Reply

I have found that the autotag feature in Yep works for this. There is a built in smart group for untagged pdf's. I select them all, autotag, and then the ones that don't get any tags probably aren't searchable.

    Sascha - May 29, 2011 Reply

    I doubt that. Tags have nothing to do with the OCR text layer on a PDF. If a PDF doesn't have tags, it doesn't mean it doesn't have OCR.

Leave a Reply: