Making Acrobat OCR’ed PDFs Smaller With Formatted Text & Graphics

One complaint that people have with the PDFs that Acrobat kicks out when doing OCR, either by doing it manually or via an Acrobat OCR Applescript, is that the files can get really big.

There are a few solutions to this, but one of them is to change the PDF Output Style.

The default that Acrobat uses is called Searchable Image. What that does is place all the OCR’ed text etc. “behind” the image, so that when you view the PDF you are looking at the original image, but you can copy and search on the text.

However, there’s another setting. If you choose the PDF Output Style of Formatted Text & Graphic, what that will do is actually convert the text image to text itself, formatted with whatever style was there before.

I did a simple test this morning and here is what I found:

  • Scanned Document before OCR: 312K
  • OCR with Acrobat Searchable Image: 940K
  • OCR with Acrobat Formatted Text & Graphics: 60K (!)

To change Acrobat to FT&G, here is what you do:

  • Go to Document -> OCR Text Recognition -> Recognize Text Using OCR…
  • Click the Edit button

ocrrecognizetext.png

  • In PDF Output Style, change to Formatted Text & Graphics
  • Hit OK

Acrobat will now use Formatted Text & Graphics, and should keep that setting for your future scans too.

What’s The Catch?

As with anything, there is a downside. Acrobat does its best to make the text look like what was there before, but it is not perfect. Also, anything that is mis-OCR’ed will actually show up in the document.

It depends on what your objectives are. If you want to have the exact replica of what you are scanning, you’ll probably want to use Searchable Image.

However, if size is your main concern and you just want to have a fairly-faithful representation, Formatted Text & Graphics may be the way to go.

Do you have any other tricks for making PDFs smaller?



Need Some Help Going Paperless?

How about three ways to help unclutter and de-stress by turning piles of paper into an organized electronic system?

First Name:
Email:*

Related posts:

  1. Updated: Acrobat Applescript for ScanSnap OCR
  2. How To Create Searchable PDFs With The ScanSnap S300M
  3. Use A Highlighter With A ScanSnap S1500 To Choose Searchable Text
  4. Doing OCR Batch Processing Using The ScanSnap And ABBYY FineReader

Tags: ,

10 Responses to “Making Acrobat OCR’ed PDFs Smaller With Formatted Text & Graphics”

  1. pendolino 26. Jul, 2009 at 7:16 am #

    @brooks – thanks for this tip. i just used it on a one page typed letter i received and the difference was going from 740 to 40 KB. that pretty much is in line with what you got but i have an interesting mysterious problem with adobe standard 7.0 after OCRing where it just crashes a few seconds after the process is complete. i have to save the doc fast or else the OCR wont be saved. has anyone seen this before?

    is there a way to set this behavior in Adobe as default for OCR scans? its a bit anyone having to dig down into the options every time.

    • BrooksD 27. Jul, 2009 at 12:37 pm #

      @pendolino – In the version of Acrobat that I have, 8.0 Mac, it remembers which OCR setting you used last time, so you don't have to go in and set it every time. So if 7.0 doesn't do that, then you may need to wait for 8.0?

      • pendolino 27. Jul, 2009 at 12:47 pm #

        @brooksd – i just discovered the same with 7.0 moments before getting your response. thanks. looks like it sticks.

  2. Sarah 03. Aug, 2009 at 4:22 pm #

    I have a zillion scanned PDFS from my scansnap but I haven't run OCR on them yet. I have been having Devonthink do that, but now I am thinking that having all of my PDFs OCRd would be helpful. ABbyfinereader seems to make a mess of the PDFs, taking forever and making a super long file name out of them (in addition to keeping the original PDF which I no longer want). How do you all handle this? Help?

    • BrooksD 04. Aug, 2009 at 1:00 pm #

      @Sarah Do you mean you have PDFs that are currently searchable in Devonthink, but you want to take them out of Devonthink but still have them searchable?

  3. Sarah 04. Aug, 2009 at 7:30 am #

    @brooksd
    I have unsearchable PDFs from the scansnap AND the searchable ones in devonthink bc I had devonthink do the OCR. I now want to run OCR on the unsearchable ones not in dt. Does that make sense?

    • Rob 05. Sep, 2009 at 7:51 pm #

      Use one of the Applescripts available here and do the OCR with Acrobat or PDF Pen. I think it was MacSparky that recently had scripts for PDF Pen.

      • BrooksD 05. Sep, 2009 at 11:54 pm #

        Thanks for that Rob. I have been meaning to point to that PDFPen script for a while but haven't had a chance. This reminded me. :)

  4. nodis 12. Feb, 2010 at 5:16 pm #

    Thus is a great tip, but somewhat out of date. Acrobat 9 has a new technology for OCRd PDFs called ClearScan, that results in dramatically smaller file sizes and crisper PDFs. More about it here http://blogs.adobe.com/acrolaw/2009/05/better_pdf...

    (In case that link isn't visible, just Google 'acrobat clearscan ocr' and see the post in the Acrobat for Legal Professionals blog.

    • BrooksD 12. Feb, 2010 at 5:20 pm #

      Thanks for the tip, nodis!

Leave a Reply