Making Acrobat OCR'ed PDFs Smaller With Formatted Text & Graphics

Making Acrobat OCR’ed PDFs Smaller With Formatted Text & Graphics

One complaint that people have with the PDFs that Acrobat kicks out when doing OCR, either by doing it manually or via an Acrobat OCR Applescript, is that the files can get really big.

There are a few solutions to this, but one of them is to change the PDF Output Style.

The default that Acrobat uses is called Searchable Image. What that does is place all the OCR’ed text etc. “behind” the image, so that when you view the PDF you are looking at the original image, but you can copy and search on the text.

However, there’s another setting. If you choose the PDF Output Style of Formatted Text & Graphic, what that will do is actually convert the text image to text itself, formatted with whatever style was there before.

I did a simple test this morning and here is what I found:

  • Scanned Document before OCR: 312K
  • OCR with Acrobat Searchable Image: 940K
  • OCR with Acrobat Formatted Text & Graphics: 60K (!)

To change Acrobat to FT&G, here is what you do:

  • Go to Document -> OCR Text Recognition -> Recognize Text Using OCR…
  • Click the Edit button

ocrrecognizetext.png

  • In PDF Output Style, change to Formatted Text & Graphics
  • Hit OK

Acrobat will now use Formatted Text & Graphics, and should keep that setting for your future scans too.

What’s The Catch?

As with anything, there is a downside. Acrobat does its best to make the text look like what was there before, but it is not perfect. Also, anything that is mis-OCR’ed will actually show up in the document.

It depends on what your objectives are. If you want to have the exact replica of what you are scanning, you’ll probably want to use Searchable Image.

However, if size is your main concern and you just want to have a fairly-faithful representation, Formatted Text & Graphics may be the way to go.

Do you have any other tricks for making PDFs smaller?

About the Author

Brooks Duncan helps individuals and small businesses go paperless. He's been an accountant, a software developer, a manager in a very large corporation, and has run DocumentSnap since 2008. You can find Brooks on Twitter at @documentsnap or @brooksduncan. Thanks for stopping by.

Leave a Reply 11 comments

bill - April 10, 2013 Reply

When I do OCR with the default settings (Searchable Image), my PDF file size always drops by about half. So I don't seem to need to follow this tip.

I was trying to figure out why, and it could be the "downsample to 600 dpi" option that is helping me. Maybe my default scanner option is much higher.

Regardless, thanks for the info.

nodis - February 12, 2010 Reply

Thus is a great tip, but somewhat out of date. Acrobat 9 has a new technology for OCRd PDFs called ClearScan, that results in dramatically smaller file sizes and crisper PDFs. More about it here http://blogs.adobe.com/acrolaw/2009/05/better_pdf

(In case that link isn't visible, just Google 'acrobat clearscan ocr' and see the post in the Acrobat for Legal Professionals blog.

Sarah - August 4, 2009 Reply

@brooksd
I have unsearchable PDFs from the scansnap AND the searchable ones in devonthink bc I had devonthink do the OCR. I now want to run OCR on the unsearchable ones not in dt. Does that make sense?

    Rob - September 5, 2009 Reply

    Use one of the Applescripts available here and do the OCR with Acrobat or PDF Pen. I think it was MacSparky that recently had scripts for PDF Pen.

      Brooks Duncan
      Brooks Duncan - September 5, 2009 Reply

      Thanks for that Rob. I have been meaning to point to that PDFPen script for a while but haven't had a chance. This reminded me. 🙂

Sarah - August 3, 2009 Reply

I have a zillion scanned PDFS from my scansnap but I haven't run OCR on them yet. I have been having Devonthink do that, but now I am thinking that having all of my PDFs OCRd would be helpful. ABbyfinereader seems to make a mess of the PDFs, taking forever and making a super long file name out of them (in addition to keeping the original PDF which I no longer want). How do you all handle this? Help?

    Brooks Duncan
    Brooks Duncan - August 4, 2009 Reply

    @Sarah Do you mean you have PDFs that are currently searchable in Devonthink, but you want to take them out of Devonthink but still have them searchable?

pendolino - July 26, 2009 Reply

@brooks – thanks for this tip. i just used it on a one page typed letter i received and the difference was going from 740 to 40 KB. that pretty much is in line with what you got but i have an interesting mysterious problem with adobe standard 7.0 after OCRing where it just crashes a few seconds after the process is complete. i have to save the doc fast or else the OCR wont be saved. has anyone seen this before?

is there a way to set this behavior in Adobe as default for OCR scans? its a bit anyone having to dig down into the options every time.

    Brooks Duncan
    Brooks Duncan - July 27, 2009 Reply

    @pendolino – In the version of Acrobat that I have, 8.0 Mac, it remembers which OCR setting you used last time, so you don't have to go in and set it every time. So if 7.0 doesn't do that, then you may need to wait for 8.0?

      pendolino - July 27, 2009 Reply

      @brooksd – i just discovered the same with 7.0 moments before getting your response. thanks. looks like it sticks.

Leave a Reply: