Should You OCR That Document?

Text A few months ago, Ernest Svenson from PDF For Lawyers wrote a very interesting piece: When should you OCR documents? A quick primer.

His premise is that you should be very selective about the documents that you make searchable by applying Optical Character Recognition.

Many people know that OCR stands for ‘optical character recognition,’ or if they don’t know that then they know that OCR is what you do to a scanned document to make it text-searchable. When you buy a new scanner like the Fujitsu ScanSnap it’ll come with OCR software, and most people get the idea that they should OCR all documents that they scan. I don’t recommend this, and don’t know many “paperless experts” who recommend it.

He makes an interesting case that performing OCR takes too long, takes more space, and most of the time we don’t need to search for the stuff that we scan anyways, so while waste the time?

I have to admit, I take a different approach. I do tend to OCR almost everything. I have a few reasons why:

I tend to “pre-organize” my documents so that I can scan like types together. This way, I don’t usually have the problem of having to wait for the OCR to finish before scanning the next document. I just put the stack in and hit go.
I don’t personally find that it takes all that long to OCR.
Storage is getting cheaper and cheaper, and if my PDFs are a little larger, that is not something I personally worry about too much.
I find that you never know what you will need to find until you need to find it. I prefer to err on the side of making documents more findable and not less.
The biggest reason is: I don’t like having to make decisions about this sort of stuff. Every time you need to make a decision about doing your scanning is one more opportunity for things to fall off the rails. I prefer to “set it and forget it”.

As Ernest points out, some of this can be mitigated by doing batch OCR in ABBYY FineReader or in Acrobat.

I am not a lawyer, so it could be that the paper volume and time associated with a legal office makes selective OCR more important. Having OCR mostly on works well for me, but each situation is of course different.

How about you, do you pick and choose what you OCR?

(Photo by orangeacid)





About the Author

Brooks Duncan helps individuals and small businesses go paperless. He's been an accountant, a software developer, a manager in a very large corporation, and has run DocumentSnap since 2008. You can find Brooks on Twitter at @documentsnap or @brooksduncan. Thanks for stopping by.

Leave a Reply 7 comments

Eric Lorenz - August 27, 2016 Reply

Since all my documents end up in Evernote anyway…one way or another they get OCR’d. 🙂

How To Stop File Search From Indexing A Folder | Tips To Learn How To Go Paperless | DocumentSnap Paperless Blog - September 4, 2012 Reply

[…] am always going on and on about how I am a fan of searchable PDFs, and search is the main way that I find any […]

Fred - May 2, 2012 Reply

I agree with Lenny, the OCR enables full-text searching and that is quite valuable. Another thing I've found is naming documents that are scanned so you don't have to open the document to see what it is about – instead, the name tells you what is in it. While this doesn't solve the "where did I file it" issue, it does help in a directory where you've placed a large number of files such as credit card receipts. We use a standardized syntax like this:
YYYY-MM-DD followed by amount charged followed by the credit card charged to and then where the charge occurred. For example: 2012-05-01_10.00_AMEX at Costco for Gas.
This seems to help us a lot.

Brooks Duncan - May 2, 2012 Reply

Great point, thanks Fred. I am a huge believer in a descriptive, consistent naming convention. Thanks for bringing that up.<p style=”color: #A0A0A8;”>

Lenny - May 2, 2012 Reply

I don't see the point of scanning documents if you are not going to OCR them. For me, being able to find any document quickly is important, and it's just too hard to find a document if the only way to locate it is by knowing the filename and directory.

For example, what if I need to locate a receipt from a restaurant. Do I look for it under the "meals" directory of the "meetings" directory? I don't know, because sometimes I have business meetings in restaurants, and sometimes I have meals. They need to be separated because they are taxed differently. Am I just supposed to start opening PDFs at random until I find it?

Or what if I want to find all invoices from a particular contractor? My invoices are organized by property, and he may work on multiple properties. If I don't OCR, how can I find them? Or what if I want to know how many times the toilet on a particular property got clogged so I can decide whether or not to replace the toilet or have the pipes worked on? I work with many plumbers–you never know who will be able to get to the property first, and plumbing issues generally need to be resolved ASAP–how can I pull up invoices for a property that mention "toilet"?

For me, there's no point in scanning with no OCR. I might as well shove paper into a filing cabinet if I don't want to ever find it again.