Tag Archives: acrobat

OCR AppleScript For Adobe Acrobat X

Once upon a time, Mac author Joe Kissell wrote an AppleScript that would allow you to automatically OCR PDF documents using Adobe Acrobat 8.

Back in 2008, I turned that Acrobat OCR AppleScript into a Droplet and posted it to the site.

Since the release of Acrobat X, I have received many many (have I mentioned many?) requests for an Acrobat X version of the script.

Unfortunately, this has been difficult for two reasons:

  1. Acrobat X is notoriously challenging to script
  2. I don’t own Acrobat X

Fortunately, the aforementioned Joe Kissell comes to the rescue yet again. As part of his excellent Take Control Of Your Paperless Office, Joe provided his readers with an update that includes OCR Applescripts for Acrobat X.

Joe was kind enough to share this update with DocumentSnap readers, so if these scripts help you out please do me a favor and go buy Joe’s Ebook. It really is great.

(Photo by Aurelian Săndulescu)

Comments ( 2 )

How To Set PDF Keywords In Microsoft Windows

The ScanSnap Organizer program that comes with WIndows versions of the Fujitsu ScanSnap is pretty good, but it does have one big limitation that DocumentSnap reader Katherine from Austin Texas ran into: you can’t OCR or set keywords to PDF files that were not created by the ScanSnap scanner.

I couldn’t find a built-in way to set PDF keywords in Windows 7 like there is on the Mac (if anyone knows one, please leave it in the comments), but here are a few options for doing it.

A-PDF Info Changer

A-PDF Info Changer is a handy Windows utility that lets you open up a PDF and set all the associated metadata such as Author, Title, Subject and Keywords.

It is freeware, but they do request a donation so if you find it useful kick them a few bucks. It has a command-line version for $35 that lets you manipulate a bunch of PDFs all at once.

For the free version, just fire it up and set your keywords separated by commas. Then hit Save File and you are done.

A-PDF Info Changer

By the way, A-PDF has a huge number of little PDF utilities, many with freeware versions, that are worth checking out. If you need to do something with Windows, chances are they have a utility to do it.

Adobe Acrobat

It would be overkill to buy Acrobat just for this purpose, but if you have a ScanSnap S1500 or ScanSnap S1500M you already own it.

Open up the PDF in Acrobat and go to File > Properties and you can enter the keywords in the Keywords box.

Acrobat keywords

Any other tricks to set keywords and PDF metadata on Windows? Leave a comment and let us know.

Comments ( 5 )

How To Find PDFs That Are Not Searchable

Sometimes, especially when you are a doing a big OCR project, you might want to find all the PDFs that are not searchable. That is to say, you want to find the PDFs that have not been OCR-ed.

It turns out that this is not as easy as you might think. Here are a few ways to “sort of” do it. As much as possible I wanted to limit this to search capabilities built into the operating system, or to applications that you might already have.

Mac OS X Spotlight

It occurred to me that, chances are, almost any PDF that has been made searchable will have at least one space in it. So, why not use Spotlight to find all PDFs that don’t have a space? Fire up Spotlight by going Command-Space and type the following:

kind:pdf NOT intext:" "

Is this a perfect test? No, but hopefully it will get you most of the way there.

Microsoft Windows

I had hoped to do the same thing with Windows Search in Windows 7, but it didn’t work. It doesn’t seem that it will let you just search for a space. The closest I could come to is to search for the word the. Obviously this is English only, so in your language hopefully there is an equivalent word that is in almost every document.

Start up Windows Search by pressing Windows Key-F and type the following:

ext:pdf NOT contents:the

That is not as likely to succeed as just searching for a space, but should get you most of the way there.

Adobe Acrobat

Adobe Acrobat has some features that may help. You can use Acrobat Pro’s Preflight feature, or even do a Batch Process Accessibility Report.

None of these searches are 100% guaranteed to succeed, but hopefully they will help you down the path. Thanks to DocumentSnap reader Matt for the idea for this post.

Do you have any tricks for finding non-OCR’ed PDFs? Share in the comments.

(Photo by Dirigentens)

Comments ( 3 )

OCR Smackdown: ABBYY FineReader vs. Adobe Acrobat

A very common request that I get here at DocumentSnap is to compare the Optical Character Recognition (OCR) capabilities of ABBYY FineReader with Adobe Acrobat. Why? Well, for starters, both of them come included with models the Fujitsu ScanSnap as well as other scanners.

I decided to do a quick test comparing the OCR of the two packages using the following criteria:

  • OCR Speed
  • Resulting File Size
  • Accuracy

The Hardware

For a scanner I used my ScanSnap S1300.

I used two computers for the test:

  • Windows: A new cheap Acer laptop with a Core i3 2.40 GHz processor and 4 GB RAM running Windows 7
  • Mac: An old 2.5 GHz Intel Core 2 Duo MacBook Pro with 4 GB RAM running Mac OS X Snow Leopard

The Software

Here are the packages I used:

  • Windows: ABBYY FineReader For ScanSnap 4.1 (called from ScanSnap Manager) vs. Adobe Acrobat 9 Pro
  • Mac: ABBYY FineReader For ScanSnap 4.1 (run standalone) vs. Adobe Acrobat 8 Pro

Yes, I realize that Adobe Acrobat X is out, but since I am not aware of any scanners that come bundled with it yet, I decided to stick with the versions that ship with the ScanSnap. I’ll update Acrobat X in a later post.

The Document

I scanned a magazine article for this test. It probably would have been better to do this with a bunch of different documents to compare, but hey.

In all cases except one, I scanned without OCR so that I could run it standalone later. Here’s some info on the document that I used:

  • Pages: 2
  • Scan Quality: 300dpi, Color
  • Resulting File Size: 1.5 MB
  • Columns: 2, with some images

Maybe I am blind, but I couldn’t figure out a way to run ABBYY FineReader for ScanSnap on Windows standalone. If you know how, please leave a message in the comments. In that test, I re-scanned with “Create Searchable PDF” checked in the ScanSnap Manager settings.

The Settings

I tried not to do too many fancy settings to keep things as “real-life” as possible. There were essentially three configurations:

ABBYY FineReader

ABBYY FineReader OCR Settings

I set Save Mode to “Text under page image” and Quality to High. These were the settings for the Mac ABBYY, and I believe it is what ScanSnap Manager on Windows uses as well.

Adobe Acrobat (Normal)

Adobe Acrobat OCR Settings

I set the output style to “Searchable Image (Exact)” because leaving it just as Searchable Image in my experience has caused some weird things to happen with the resulting PDF. I used these settings on both Windows and Mac.

Adobe Acrobat (With ClearScan)

Adobe Acrobat ClearScan

In Acrobat 9 there is a setting called ClearScan. I used that as an additional test to see what the difference is.

Speed

Windows

  • ABBYY Windows: 20.5 seconds
  • Acrobat 9: 13.9 seconds
  • Acrobat 9 With Clearscan: 17.6 seconds

Mac

  • ABBYY Mac: 44.7 seconds
  • Acrobat 8: 20.2 seconds

Winner: Acrobat!

Since they are different machines, you can’t directly compare the Windows and Mac times, but clearly in both cases Acrobat is faster.

File Size

The non-OCR’ed PDF was 1.5 MB.

Windows

  • ABBYY Windows: 1.7 MB (+.2 MB)
  • Acrobat 9: 1.5 MB (same)
  • Acrobat 9 With ClearScan: 315 KB (-1.16 MB)

Mac

  • ABBYY Mac: 1.4 MB (-.1 MB)
  • Acrobat 8: 1.5 MB (same)

Winner: Acrobat 9 with ClearScan!

With an astonishing 1.16 MB reduction in file size after OCR, Acrobat 9 with ClearScan is the winner. Wow.

Accuracy

Here is a passage from the article:

Article Text Before OCR

Let’s see how each of the packages did:

ABBYY Windows

The spreadsheet has become the virtual “slide rule” for CMAs. It’s used for everything from preliminary strategic plans to financial statements. As with any familiar method, it finds its way into numerous situations where better alternatives are available, mostsignificantly in itswidespread use as a de facto reporting tool.
The appeal of the spreadsheet as the quickest way to get a report out is not hard to appreciate. “Excel is probably the most comfortable environment for a lot of financial professionals,” Alok Ajmera, vice-president, professional services withMississauga, Ont.-basedProphixSoftware, says. “There’s a very little learning curve, you can effectively do whatever you want with the data, and it works fairly well in smaller organizations.”
Periodic and complex reporting in processes like revenue management or cost management, however, is where the spreadsheet model really starts to break down.

Acrobat 9 Windows

T he spreadsheet has become the virtual “slide rule” for CMAs. It’s used for everything from preliminary su·ategic plans to financial statements. As with any farniliar method, it finds its way into numerous situations where better alternatives are available, most significantly in its widespread use as a de facto reporting tool.
The appeal of tlle spreadsheet as the quickest way to get a report out is not hard to appreciate. “Excel is probably tlle most comfortable environment for a lot of financial professionals,” AJok Ajmera, vice-president, professional services with Mississauga, Ont.-based Prophix Software, says. “There’s a very little learning curve, you can effectively do whatever you want witll tlle data, and it works fairly well in smaller organizations.”
Periodic and complex reporting in processes like revenue management or cost management, however, is where the spreadsheet model really starts to break down.

Acrobat 9 With ClearScan

The spreadsheet has become the virtual “slide rule” for CMAs. It’s used for everything from preliminary su·ategic plans to financial statements. As with any farniliar method, it finds its way into numerous situations where better alternatives are available, most significantly in its widespread use as a de facto reporting tool.
The appeal of tlle spreadsheet as the quickest way to get a report out is not hard to appreciate. “Excel is probably tlle most comfortable environment for a lot of financial professionals,” AJok Ajmera, vice-president, professional services with Mississauga, Ont.-based Prophix Software, says. “There’s a very little learning curve, you can effectively do whatever you want witll tlle data, and it works fairly well in smaller organizations.”
Periodic and complex reporting in processes like revenue management or cost management, however, is where the spreadsheet model really starts to break down.

ABBYY Mac

The spreadsheet has become the virtual “slide rule” for CiMAs. It’s used for everything from preliminary strategic plans to financial statements. As with any familiar method, it finds its way into numerous situations where better alternatives are available, most significantly in its widespread use as a de facto reporting tool.
The appeal of die spreadsheet as the quickest way to get a report out is not hard to appreciate. “Excel is probably the most comfortable environment for a lot of financial professionals,” Alok Ajmera, vice-president, professional sendees with Mississauga, Ont.-based Prophix Software, says. “There’s a very little learning curve, you can effectively do whatever you want with the data, and it works fairly well in smaller organizations.”
Periodic and complex reporting in processes like revenue management or cost management, however, is where the spreadsheet model really starts to break down.

Acrobat 8 Mac

T he spreadsheet has become the virtual “slide rule” for CMAs. It’s used for everything frorn preliminary strategic plans to financial statements. Aswith any familiar method, it finds its way into numerous situations where better alterna tives are available, most significantly in its widespread use as a de facto reporting tool.
T he appeal of the spreadsheet as the quickest
way to get a report out is not hard to appreciate.
“Excel is probably the most comfortable
environment for a lot of financial professionals,” avaJlaun:.:,JIIU:::’l;)It;IIIULauuy1111l::>WIUC::>PU:C1U uocd::>
a de facto reporting tool. T he appeal of the spreadsheet as the quickest
way to get a report out is not hard to appreciate. “Excel is probably me most comfortable environment for a lot of financial professionals,” AJok Ajmera, vice-president, professional services with Mississauga, Ont.-based Prophix Software, says. “T here’s a very little learning curve, you can effectively do whatever you want with the data, and it works fairly well in smaller organiza tions.”
Periodic and complex reporting in processes like revenue management or cost management, however, is where the spreadsheet model really starts to break down.

Winner: ABBYY FineReader for Mac looks the best to me. Acrobat 8 on the Mac is pretty terrible (in this example anyways).

Conclusion

Is there a “best” choice? It seems that in this example anyways, Adobe Acrobat 9 with ClearScan turned on gives fast results with good OCR while dramatically reducing the file size.

If you don’t really care about speed so much, FineReader produces good OCR results and for ScanSnap users, has the additional benefit of being integrated with ScanSnap Manager.

As with most things, the best software is the one that works the best for you. Have you found similar results? Any other tests of your own to share? Leave a note in the comments.

(Photo by Polina Sergeeva)

Comments ( 15 )

Use Adobe Acrobat To Create A Digital Signature Stamp

Once you’ve started down the road to going paperless, one of the last reasons that you find yourself needing to print is when you need to sign something.

You’ve probably been there: you receive a PDF form that you need to sign, you then print it out, sign it, scan it back in, and email it back. Annoying.

Ernie the Attorney over at PDF For Lawyers has a very thorough tutorial for creating a digital signature stamp. He does it with a ScanSnap S1500M, Adobe Acrobat, and an image manipulator.

You can, of course, do this with any scanner, and if you are on the Mac you can use PDFPen as well.

Check out his blog post for a really helpful video that takes you through the entire process.

Do you sign documents with a digital signature stamp? If you have any tips and tricks, leave them in the comments.

(Photo by Dominic’s pics)

Comments ( 1 )

Use Adobe Acrobat To Add Pages To An Existing Document

Sometimes rather than creating new PDF files every time, you want to scan to an existing document.

Over on the ScanSnap Community, they’ve posted a helpful video showing how to use Adobe Acrobat (which comes with the ScanSnap S1500) to scan to Acrobat and then add the pages to an existing document.

The video is below, but head on over to the Community for some restrictions and things to keep in mind.

If nothing else, watch the video for the swingin’ music.

Comments ( 1 )

PDF Files Can Be Dangerous

A large percentage of posts here on DocumentSnap are relating to creating or processing PDF files in some way. However, there’s a dark side to PDFs – or more specifically, using outdated PDF reading software on your computer.

The Lawyerist blog has a writeup on the topic, and points to a scary ZDNet article that reports that a huge percentage of exploits in 2009 were related to malicious PDF files.

Make sure your internet browser, and PDF reader (Adobe Acrobat or Adobe Reader) are updated to the most current version to limit attacks from malicious PDF files. A recent report indicates that more and more hackers are exploiting security issues in PDF readers.

Aside from the predictable Mac vs. Windows vs. Linux wars in the ZDNet comments, both are a good read.

Make sure your Adobe Acrobat Reader is always updated!

Comments ( 1 )

Updated: Acrobat Applescript for ScanSnap OCR

As many of you know, in 2008 I posted an Applescript that will use Adobe Acrobat to make PDFs searchable using Acrobat’s OCR capabilities.

In the comments to that post, user nodis pointed out that adding 2 words to one of the lines can make the PDFs quite a bit smaller.

In my testing, I ran a 1.3 MB PDF through the script. Before nodis’ change, the resulting PDF was 1.7 MB. After the change, it was 424K!

Here is the updated script:

OCRIt-Acrobat – Droplet to batch OCR PDFs in Adobe Acrobat

To use it:

  • Download and uncompress the file and save it to your Desktop, Dock or wherever
  • Drag one or more PDFs onto the icon
  • Enjoy

Let me know how it works out for you and if you see similar reductions in file size.

Update: If you use Acrobat X, please see this post about OCR AppleScript for Acrobat X.

Comments ( 17 )

Cool Paperless Setup Video

As much of a paperless geek that I am, I normally wouldn’t sit and watch a video of someone scanning and shredding paper.

However, I just wanted to point you to this YouTube video by user allenday. He’s got a really cool setup of a ScanSnap S300M, Adobe Acrobat, a Mac Mini, a wall-mounted Sharp Aquos, the Royal PX1000MX to shred, and uploads everything to Evernote.

To do the OCRing, he uses the Acrobat OCR Applescript Droplet that I hacked/posted about earlier.


Very cool setup, thanks for sharing allenday! Do any of you have a cool paperless setup? Feel free to share pics or videos in the comments.

Comments ( 4 )

How To Create Searchable PDFs With The ScanSnap S300M

scansnap300m.jpg So you read all this great stuff about how the Fujitsu ScanSnap is awesome and creates searchable PDFs, and you’re on a Mac and want a portable scanner, so you drop the cash on a ScanSnap S300M.

Then you get it home and find out – wait a minute – the S300M doesn’t come with OCR software! If you’ve been there (and I have), hopefully this post will help you out, as I get a lot of questions about this.

Mail-In Rebate

Your local Fujitsu website may provide a mail-in rebate for OCR software if you purchase the S300M. At the time of this writing, the US Fujitsu websites has a mail-in rebate for a free copy of ReadIris OCR software

The rebate is at http://www.fujitsu.com/us/services/computing/peripherals/scanners/rebates.html . Check if your country has something similar.

Acrobat

While the S300M doesn’t come with Adobe Acrobat, if you have a copy of it laying around, or have access to it, you can use the ScanSnap with it. Here is an example of how I use the S300M with Acrobat 8.

Evernote

Evernote Premium allows users to upload PDFs and they will be automatically OCR’ed and made searchable.

DevonThink

If you use a program like Devonthink Pro Office to manage your documents, they will be made searchable.

NeatWorks

NeatWorks is a software that is bundled with the NeatDesk scanner, but it can be purchased on its own. See this post for how to use NeatWorks with the Fujitsu ScanSnap.

These are some ideas for how to make searchable PDFs with the ScanSnap S300M. Do you have any others? Leave a message in the comments.

Comments ( 3 )