Hazel Rule To OCR Documents Using PDFPen

Hazel Rule To OCR Documents Using PDFPen

The other day I posted an Applescript to OCR documents using PDFPen.

In the comments, awesome DocumentSnap reader Josh requested that it be done as a Hazel rule instead. Given that my love for Hazel is well documented, I am happy to oblige.

I created a folder and then created the following Hazel rule to run against it:

  • Extension is PDF
  • Date Last Modified is after Date Last Matched (to stop it from trying to re-OCR documents)

Then I asked it to run the following Applescript:

tell application "PDFpen"
open theFile as alias
tell document 1
ocr
repeat while performing ocr
delay 1
end repeat
delay 1
close with saving
end tell
end tell

Of course, if you are using PDFpenPro, replace the first line with “PDFpenPro”.

Here’s a screenshot (unfortunately the bottom of the script is cut off):

Hazel PDFPen Rule

Hope this helps out you Hazel and PDFPen fans out there. Enjoy.

About the Author

Brooks Duncan helps individuals and small businesses go paperless. He's been an accountant, a software developer, a manager in a very large corporation, and has run DocumentSnap since 2008. You can find Brooks on Twitter at @documentsnap or @brooksduncan. Thanks for stopping by.

Leave a Reply 29 comments

Perry Karipidis - June 26, 2018 Reply

Thanks for this Eric. I copied and pasted your script but when I compile the script, I get an error “Expected end of line but found identifier” and the word “ocr” is highlighted at the end of the line that reads “repeat while performing ocr”.

Is something missing from the script?

Thanks in advance.

Danny - June 8, 2016 Reply

I have a problem with PDFPenPro not playing nice with my scanned documents: unless I SAVE first, the OCR image does not line up with the actual text image. I spoke with PDFPen and they said it is a bug they will work on.

But for using AppleScript and Hazel, the solution should just be to get the script to run a SAVE command, then go forward with the same script as posted.

I do not code, or know AppleScript specifically….can someone tell me what that would look like?

Robin - August 25, 2015 Reply

helo,
I’m trying to use hazel as suggested but I always get an error in the embedded applescript on the word ‘performing ocr’.(it is always highlighted in yellow when i press the hammer-button)
I have no knowledge of applescript…
Do I have to compile the script… What’s the use of the little hammer-button ?

    Brooks Duncan - August 27, 2015 Reply

    Dumb question, but do you have PDFpen installed? If so which version?

Dan - June 5, 2015 Reply

I use the “Searchable PDF Converter” that came with my Fujitsu ix500 scanner. How do modify the script to use this software to OCR my PDF files?

Thanks

Dan

Thomas Gough - December 23, 2014 Reply

We were having trouble getting it to work. The solution we used was we created a droplet and had a shell script launch the droplet.

droplet (name must be whatever is used in the shell script below…):
on open theFile
tell application “PDFpenPro”
open theFile as alias
tell document 1
ocr
repeat while performing ocr
delay 1
end repeat
delay 1
close with saving
end tell
tell application “PDFpenPro”
quit
end tell
end tell
end open

shell script:
#!/bin/bash
open -a ~/Library/Application\ Scripts/com.smileonmymac.PDFpenPro/PDF\ Pen\ Automator.app “$*”

Klaus Gruber - March 3, 2014 Reply

Hi,

anybody got a fix for the already scanned documents issue?
i use tags “ocrd” for already ocr’d documents but i ran it on my documents folder and documents were scanned even if they are perfect printed with oct info

cheers.kg

    Brooks Duncan - March 3, 2014 Reply

    So what is it that you are wanting Klaus? You want the Hazel rule to ignore the document if a document is tagged with “ocrd”?

      Klaus Gruber - March 3, 2014 Reply

      HI,

      i use hazel in combination with kathies rule http://katiefloyd.me/blog/automatically-ocr-documents-with-hazel-and-pdfpen

      the thing i like to add is the option, if the document is already ocrd – because the downloaded document is already ocrd by the author- the ocr script should be ignored. i thought about a condition like:
      content of the document is “a” because in every document there is the letter “a” or “e”

      cheers.kg

John O. - March 27, 2013 Reply

Wow, thanks a lot. This really helped me.

Adam - March 20, 2013 Reply

This works great with the new PDFpenPro 6. Thanks very much! I don't know if this is new in 6, but every document comes up with a language selection dialog (english, spanish, etc.). This doesn't seem to impede the progress of the OCR. I think.

Gareth - January 31, 2013 Reply

This used to work perfectly, but now I've just tried it and it saves the document with no extension. I've not changed anything in the rules, can anyone help explain whats changed?
Thanks

Victoria - December 6, 2012 Reply

I took the script and tried to use it with Hazel unsuccessfully = ran a test and the docs from the Folder were to be OCR'd and then pump out to another folder – it didn't work so I changed the test to instead color label the item if ocr'd – sstill didnt' work. Message I get is –

2012-12-06 17:29:27.035 hazelworker[76951] Screen Shot 2012-12-06 at 3.22.11 PM.pdf: Rule OCR documents matched.
2012-12-06 17:29:27.057 hazelworker[76951] [Error] AppleScript failed: Error executing AppleScript /Users/VictoriaMacPro2012/Library/Scripts/Applications/OCR scripts/Hazel and PDFPen rule.scpt on file /Users/VictoriaMacPro2012/Documents/ITEMS TO OCR/Screen Shot 2012-12-06 at 3.22.11 PM.pdf.
2012-12-06 17:29:27.057 hazelworker[76951] AppleScript error: {
OSAScriptErrorNumberKey = "-1708";
}

    Brooks Duncan - December 6, 2012 Reply

    It looks like you're executing an external script. I'd try embedding it in the rule, or if you don't want to do that for whatever reason, check the Hazel help on AppleScript. You need to have a special handler in your script that you may not have.

Mike - September 1, 2012 Reply

Riley above makes an excellent point.
Also, not sur eif related or not, but I find that the measure taken to prevent repeat scans of docs already scanned isn’t working for me.

ANy sugestions?

Riley - June 19, 2012 Reply

Want to OCR PDFs that are downloaded using PDFpen, automated with Hazel 3.0. I noted that the posted Applescript does not handle the need to depress the option-command buttons when selecting Edit so that the OCR menu is available. I do not know how to write Applescripts; what needs to be added to the the posted script?
Thanks,
RRW

mals11 - December 23, 2011 Reply

Hi, I tried this with PdfPen Pro and Hazel. Unfortunately something appears to have failed.

@Hayle – did you use the "PdfPen Pro with Hazel" method?

    Brooks Duncan - December 23, 2011 Reply

    What's happening? What message are you getting? Lion, Snow Leopard, …?

      mals11 - December 25, 2011 Reply

      I will describe what I have done as soon as I can get to it. Thanks for offering to help again!
      Hoping to get an easy enough system which does OCR in the background without taking control of the screen.
      The Adobe Acrobat method I used from a post (on this site itself) takes control of the screen…

Hayle - December 21, 2011 Reply

That hits the target pefrclety. Thanks!

sims - February 17, 2011 Reply

Fair enough. I guess I will either accept the size increase or simply do without OCR.
Been managing alright without OCR.

Thank you!

sims - February 17, 2011 Reply

Hi BrooksD, thanks for running such a useful site on a topic where little information exists.
i successfully used the script that you have linked to – using folder action as advised on that page.
came around to it today.

one issue – the resultant file size is still about 50%+ higher than the original. does adding text to a pdf really have to increase the size so much? or are there more efficient (as measured by file size) approaches to OCR?

    Brooks Duncan - February 17, 2011 Reply

    Hi sims, OCR will typically add some size to PDFs. I know that some people run their PDF through Acrobat after OCRing to reduce the size. There is an Optimize PDF command or something. I also know that PDFPen has a Resample Image command under the Edit menu that will reduce the size, but I haven't played around with that. Maybe if size is a big concern you play around with it and then find out from Smile if there is some way to work that into an Applescript (I'm not sure).

sims - October 1, 2010 Reply

thank you. that explains it.
i will update my pdfpenpro and report.

sims - October 1, 2010 Reply

hey thanks for responding so quickly!
i tried the new script.
that does not seem to work either

here is an image to Hazel's message. 🙁 http://screencast.com/t/NDAyMGRhYjI

    Brooks Duncan - October 1, 2010 Reply

    Hmm, that link that I posted earlier that I stole the second script from has this in the blog post:

    "Updated July 1, 2009 – I’ve updated the script to require PDFpen 4.1.4, which has some new OCR functionality in the AppleScript dictionary. No more GUI scripting!"

    So, from that, I have a feeling that your old-ish version of PDFPenPro (4.0.4) doesn't have the new OCR functionality.. that probably explains what is going on.

Brooks Duncan - October 1, 2010 Reply

Also, with respect to the window being in front. With the Acrobat script it would definitely not be possible because it is GUI scripting. With the PDFPen script, it might be possible? I am not enough of an Applescript guru to know off the top of my head but if I get some time I can try to dig into it.

Brooks Duncan - October 1, 2010 Reply

Hm I am not sure, I found another script out there <a href="http://(http://carpeaqua.com/2009/01/08/automatic-ocr-conversion-with-pdfpen-and-folder-actions/)” target=”_blank”>(http://carpeaqua.com/2009/01/08/automatic-ocr-conversion-with-pdfpen-and-folder-actions/) and ripped out the part that should apply to Hazel. I haven't tried this myself but maybe give it a shot embedded?

tell application "PDFpenPro"
open theFile
set theDoc to document 1
ocr theDoc
repeat until performing ocr of theDoc is false
delay 1
end repeat
save theDoc
close theDoc
end tell

    Eric - March 29, 2015 Reply

    This concept does work, but the syntax in the post was incorrect.

    Try this:

    tell application “PDFpen”
    — wait a little for PDFpen to stabilize
    delay 3
    — open the file
    open theFile as alias
    — set a reference to the file in case there are other PDFpen processes running.
    set theDoc to document 1
    — Process theDoc
    tell theDoc
    ocr
    repeat while performing ocr
    delay 1
    end repeat
    delay 1
    close with saving
    end tell
    end tell

Leave a Reply: