Tag Archives: ocr

OCR Comparison By Marco Arment

12:38If there was some way to find out what the most used non-Apple app on my iPhone and iPad is, it would almost certainly be Instapaper by Marco Arment. I am either saving or reading articles in there every single day.

Between my Instapaper use and listening to the Build and Analyze podcast, I spend a possibly disturbing amount of time each week either listening to Marco’s voice or interacting with his work.

It was for this reason that I thrilled when he linked to my PDFPen OCR AppleScript post on his blog, Marco.org.

Besides the DocumentSnap link, his post Mac software to add searchable text to scanned PDFs is a great rundown of the different options for performing Optical Character Recognition on the Mac.

As part of my workflow, which isn’t very interesting, I’d like OCR software to recognize the text in scanned documents and embed it under the page images in their PDF files. With the text embedded, I can search the documents with Spotlight and attempt to organize them more easily.

If you are looking for OCR software for your Mac, it is worth taking a look at Marco’s post for his conclusion.

Myself, I am going to take a closer look at PDF OCR X for a future post. DocumentSnap reader Drew has had good results as he reported in this forum thread.

Do you have any OCR software to add? Let us know in the comments.

(Photo by Tom T)

Comments ( 5 )

How To Find PDFs That Are Not Searchable

Sometimes, especially when you are a doing a big OCR project, you might want to find all the PDFs that are not searchable. That is to say, you want to find the PDFs that have not been OCR-ed.

It turns out that this is not as easy as you might think. Here are a few ways to “sort of” do it. As much as possible I wanted to limit this to search capabilities built into the operating system, or to applications that you might already have.

Mac OS X Spotlight

It occurred to me that, chances are, almost any PDF that has been made searchable will have at least one space in it. So, why not use Spotlight to find all PDFs that don’t have a space? Fire up Spotlight by going Command-Space and type the following:

kind:pdf NOT intext:" "

Is this a perfect test? No, but hopefully it will get you most of the way there.

Microsoft Windows

I had hoped to do the same thing with Windows Search in Windows 7, but it didn’t work. It doesn’t seem that it will let you just search for a space. The closest I could come to is to search for the word the. Obviously this is English only, so in your language hopefully there is an equivalent word that is in almost every document.

Start up Windows Search by pressing Windows Key-F and type the following:

ext:pdf NOT contents:the

That is not as likely to succeed as just searching for a space, but should get you most of the way there.

Adobe Acrobat

Adobe Acrobat has some features that may help. You can use Acrobat Pro’s Preflight feature, or even do a Batch Process Accessibility Report.

None of these searches are 100% guaranteed to succeed, but hopefully they will help you down the path. Thanks to DocumentSnap reader Matt for the idea for this post.

Do you have any tricks for finding non-OCR’ed PDFs? Share in the comments.

(Photo by Dirigentens)

Comments ( 3 )

Free Online OCR With RICOH Innovations

Most scanners these days come with an Optical Character Recognition, or OCR, program of some sort to make PDFs searchable. However, what if you don’t have an OCR program or you just want to do a quick and dirty file conversion without messing around with an application?

A few DocumentSnap commenters have pointed out that RICOH Innovations has created a number of Beta applications, one of which is an online Document Conversion tool.

As the site says:

The document conversion widget provides free OCR to convert your images into editable and searchable pdf, MsWord, HTML and text documents, providing capabilities such as pdf to doc conversion.

I thought I would put the tool to the test using the same parameters as in my ABBYY Finereader vs. Adobe Acrobat OCR comparison.

  • Speed: Once I uploaded the file and hit Convert, it took 27.5 seconds to complete the PDF conversion
  • File Size: The original was 1.5 MB, the converted copy was 160 KB
  • Accuracy: Here is a screenshot from the original:

Article text

Here is the OCR’ed version:

The spreadsheet has become the virtual “slide rule” for CMAs. It’s used for everything from preliminary strategic plans to financial statements. As with any familiar method, it finds its way into numerous situations where better alternatives are available, most significantly in its widespread use as a de facto reporting tool.
The appeal of die spreadsheet as the quickest way to get a report out is not hard to appreciate. “Excel is probably the most comfortable environment for a lot of financial professionals,” Alok Ajmera, vice-president, professional services with Mississauga, Ont.-based Prophix Software, says. “There’s a very little learning curve, you can effectively do whatever you want with the data, and it works fairly well in smaller organizations.”
Periodic and complex reporting in processes like revenue management or cost management, however, is where the spreadsheet model really starts to break down.

Pretty good, I’d say

  • Quality: Here is where it gets dicey. Unlike the other tools reviewed, RICOH’s online tool doesn’t put a text layer behind the image, it actually converts the image to text. The results are pretty good actually, but it is not for you if you want your original PDF’s exact look. Here is a screenshot:

Converted PDF

Given that this is a free tool in beta, it is not surprising that there are some limits. The maximum file size is 20 MB and you can only request 20 conversions per hour.

You can choose to download the files immediately or have them emailed to you when they are reader.

Also, you may or may not want to use this tool to OCR sensitive documents. As they say, “In short, we will use submitted data to improve the service.”.

Privacy issues aside, this looks to be a tool that can come in handy when you really need it and is worth playing around with.

If you know of other good online OCR products, leave a note in the comments.

Comments ( 3 )

OCR Smackdown: ABBYY FineReader vs. Adobe Acrobat

A very common request that I get here at DocumentSnap is to compare the Optical Character Recognition (OCR) capabilities of ABBYY FineReader with Adobe Acrobat. Why? Well, for starters, both of them come included with models the Fujitsu ScanSnap as well as other scanners.

I decided to do a quick test comparing the OCR of the two packages using the following criteria:

  • OCR Speed
  • Resulting File Size
  • Accuracy

The Hardware

For a scanner I used my ScanSnap S1300.

I used two computers for the test:

  • Windows: A new cheap Acer laptop with a Core i3 2.40 GHz processor and 4 GB RAM running Windows 7
  • Mac: An old 2.5 GHz Intel Core 2 Duo MacBook Pro with 4 GB RAM running Mac OS X Snow Leopard

The Software

Here are the packages I used:

  • Windows: ABBYY FineReader For ScanSnap 4.1 (called from ScanSnap Manager) vs. Adobe Acrobat 9 Pro
  • Mac: ABBYY FineReader For ScanSnap 4.1 (run standalone) vs. Adobe Acrobat 8 Pro

Yes, I realize that Adobe Acrobat X is out, but since I am not aware of any scanners that come bundled with it yet, I decided to stick with the versions that ship with the ScanSnap. I’ll update Acrobat X in a later post.

The Document

I scanned a magazine article for this test. It probably would have been better to do this with a bunch of different documents to compare, but hey.

In all cases except one, I scanned without OCR so that I could run it standalone later. Here’s some info on the document that I used:

  • Pages: 2
  • Scan Quality: 300dpi, Color
  • Resulting File Size: 1.5 MB
  • Columns: 2, with some images

Maybe I am blind, but I couldn’t figure out a way to run ABBYY FineReader for ScanSnap on Windows standalone. If you know how, please leave a message in the comments. In that test, I re-scanned with “Create Searchable PDF” checked in the ScanSnap Manager settings.

The Settings

I tried not to do too many fancy settings to keep things as “real-life” as possible. There were essentially three configurations:

ABBYY FineReader

ABBYY FineReader OCR Settings

I set Save Mode to “Text under page image” and Quality to High. These were the settings for the Mac ABBYY, and I believe it is what ScanSnap Manager on Windows uses as well.

Adobe Acrobat (Normal)

Adobe Acrobat OCR Settings

I set the output style to “Searchable Image (Exact)” because leaving it just as Searchable Image in my experience has caused some weird things to happen with the resulting PDF. I used these settings on both Windows and Mac.

Adobe Acrobat (With ClearScan)

Adobe Acrobat ClearScan

In Acrobat 9 there is a setting called ClearScan. I used that as an additional test to see what the difference is.

Speed

Windows

  • ABBYY Windows: 20.5 seconds
  • Acrobat 9: 13.9 seconds
  • Acrobat 9 With Clearscan: 17.6 seconds

Mac

  • ABBYY Mac: 44.7 seconds
  • Acrobat 8: 20.2 seconds

Winner: Acrobat!

Since they are different machines, you can’t directly compare the Windows and Mac times, but clearly in both cases Acrobat is faster.

File Size

The non-OCR’ed PDF was 1.5 MB.

Windows

  • ABBYY Windows: 1.7 MB (+.2 MB)
  • Acrobat 9: 1.5 MB (same)
  • Acrobat 9 With ClearScan: 315 KB (-1.16 MB)

Mac

  • ABBYY Mac: 1.4 MB (-.1 MB)
  • Acrobat 8: 1.5 MB (same)

Winner: Acrobat 9 with ClearScan!

With an astonishing 1.16 MB reduction in file size after OCR, Acrobat 9 with ClearScan is the winner. Wow.

Accuracy

Here is a passage from the article:

Article Text Before OCR

Let’s see how each of the packages did:

ABBYY Windows

The spreadsheet has become the virtual “slide rule” for CMAs. It’s used for everything from preliminary strategic plans to financial statements. As with any familiar method, it finds its way into numerous situations where better alternatives are available, mostsignificantly in itswidespread use as a de facto reporting tool.
The appeal of the spreadsheet as the quickest way to get a report out is not hard to appreciate. “Excel is probably the most comfortable environment for a lot of financial professionals,” Alok Ajmera, vice-president, professional services withMississauga, Ont.-basedProphixSoftware, says. “There’s a very little learning curve, you can effectively do whatever you want with the data, and it works fairly well in smaller organizations.”
Periodic and complex reporting in processes like revenue management or cost management, however, is where the spreadsheet model really starts to break down.

Acrobat 9 Windows

T he spreadsheet has become the virtual “slide rule” for CMAs. It’s used for everything from preliminary su·ategic plans to financial statements. As with any farniliar method, it finds its way into numerous situations where better alternatives are available, most significantly in its widespread use as a de facto reporting tool.
The appeal of tlle spreadsheet as the quickest way to get a report out is not hard to appreciate. “Excel is probably tlle most comfortable environment for a lot of financial professionals,” AJok Ajmera, vice-president, professional services with Mississauga, Ont.-based Prophix Software, says. “There’s a very little learning curve, you can effectively do whatever you want witll tlle data, and it works fairly well in smaller organizations.”
Periodic and complex reporting in processes like revenue management or cost management, however, is where the spreadsheet model really starts to break down.

Acrobat 9 With ClearScan

The spreadsheet has become the virtual “slide rule” for CMAs. It’s used for everything from preliminary su·ategic plans to financial statements. As with any farniliar method, it finds its way into numerous situations where better alternatives are available, most significantly in its widespread use as a de facto reporting tool.
The appeal of tlle spreadsheet as the quickest way to get a report out is not hard to appreciate. “Excel is probably tlle most comfortable environment for a lot of financial professionals,” AJok Ajmera, vice-president, professional services with Mississauga, Ont.-based Prophix Software, says. “There’s a very little learning curve, you can effectively do whatever you want witll tlle data, and it works fairly well in smaller organizations.”
Periodic and complex reporting in processes like revenue management or cost management, however, is where the spreadsheet model really starts to break down.

ABBYY Mac

The spreadsheet has become the virtual “slide rule” for CiMAs. It’s used for everything from preliminary strategic plans to financial statements. As with any familiar method, it finds its way into numerous situations where better alternatives are available, most significantly in its widespread use as a de facto reporting tool.
The appeal of die spreadsheet as the quickest way to get a report out is not hard to appreciate. “Excel is probably the most comfortable environment for a lot of financial professionals,” Alok Ajmera, vice-president, professional sendees with Mississauga, Ont.-based Prophix Software, says. “There’s a very little learning curve, you can effectively do whatever you want with the data, and it works fairly well in smaller organizations.”
Periodic and complex reporting in processes like revenue management or cost management, however, is where the spreadsheet model really starts to break down.

Acrobat 8 Mac

T he spreadsheet has become the virtual “slide rule” for CMAs. It’s used for everything frorn preliminary strategic plans to financial statements. Aswith any familiar method, it finds its way into numerous situations where better alterna tives are available, most significantly in its widespread use as a de facto reporting tool.
T he appeal of the spreadsheet as the quickest
way to get a report out is not hard to appreciate.
“Excel is probably the most comfortable
environment for a lot of financial professionals,” avaJlaun:.:,JIIU:::’l;)It;IIIULauuy1111l::>WIUC::>PU:C1U uocd::>
a de facto reporting tool. T he appeal of the spreadsheet as the quickest
way to get a report out is not hard to appreciate. “Excel is probably me most comfortable environment for a lot of financial professionals,” AJok Ajmera, vice-president, professional services with Mississauga, Ont.-based Prophix Software, says. “T here’s a very little learning curve, you can effectively do whatever you want with the data, and it works fairly well in smaller organiza tions.”
Periodic and complex reporting in processes like revenue management or cost management, however, is where the spreadsheet model really starts to break down.

Winner: ABBYY FineReader for Mac looks the best to me. Acrobat 8 on the Mac is pretty terrible (in this example anyways).

Conclusion

Is there a “best” choice? It seems that in this example anyways, Adobe Acrobat 9 with ClearScan turned on gives fast results with good OCR while dramatically reducing the file size.

If you don’t really care about speed so much, FineReader produces good OCR results and for ScanSnap users, has the additional benefit of being integrated with ScanSnap Manager.

As with most things, the best software is the one that works the best for you. Have you found similar results? Any other tests of your own to share? Leave a note in the comments.

(Photo by Polina Sergeeva)

Comments ( 21 )

OCR And Orphan Works

As I have written about before, I always find it fascinating to read about different scanning projects, especially when it comes to scanning old stuff.

Over at the GalleyCat blog, Jason Boog writes about using Optical Character Recognition software to dig through orphan works.

What the heck are “orphan works”? I didn’t know either. According to Wikipedia:

An orphan work is a copyrighted work for which the copyright owner cannot be identified and contacted.

Here’s the project that the GalleyCat editor was working on:

While researching an essay about New York City poets and the Great Depression last year, this GalleyCat editor read through hundreds of pages from 1930s novels, periodicals, and self-published materials that couldn’t leave the New York Public Library.

He used his digital camera to take pictures and then ABBYY FineReader Express to OCR the text.

The results were impressive. Check out the GalleyCat post to see more.

(Photo by p0psicle)

Comments ( 0 )

Hazel Rule To OCR Documents Using PDFPen

The other day I posted an Applescript to OCR documents using PDFPen.

In the comments, awesome DocumentSnap reader Josh requested that it be done as a Hazel rule instead. Given that my love for Hazel is well documented, I am happy to oblige.

I created a folder and then created the following Hazel rule to run against it:

  • Extension is PDF
  • Date Last Modified is after Date Last Matched (to stop it from trying to re-OCR documents)

Then I asked it to run the following Applescript:

tell application "PDFpen"
open theFile as alias
tell document 1
ocr
repeat while performing ocr
delay 1
end repeat
delay 1
close with saving
end tell
end tell

Of course, if you are using PDFpenPro, replace the first line with “PDFpenPro”.

Here’s a screenshot (unfortunately the bottom of the script is cut off):

Hazel PDFPen Rule

Hope this helps out you Hazel and PDFPen fans out there. Enjoy.

Comments ( 22 )

PDFPen OCR Applescript To Automatically Make PDFs Searchable

I don’t know if it is because I have been glued to a computer since I was six years old, but my handwriting and printing is terrible. Really terrible. I think my 5 year old son and I have pretty similar handwriting skills.

Normally this is not a problem, except when I have to fill out a form. It’s a little embarrassing filling out some official form with my chicken scratch, which is one of the many reasons why I love PDFPen. Among many other things, it lets you fill out and edit any PDF document on your computer and then print it out.

However, that ability is not what this post is about. PDFPen will also OCR PDFs to make them searchable, and I wanted a way to OCR a bunch of documents automatically with an Applescript, similar to what has been done with Adobe Acrobat and with ABBYY FineReader.

I found two scripts out there. One from David Sparks at MacSparky, which some users reported problems with in newer PDFPen versions, and one from Michael Tsai at C-Command Software which will OCR a document with PDFPen and send it to EagleFiler.

Since both of these scripts were almost what I wanted, I decided to stand on the shoulder of giants and merge them together into this Applescript.

Here is the script:
-- Downloaded From: http://www.documentsnap.com
-- Last Modified: 2010-09-28
-- Includes code from MacSparky http://www.macsparky.com/blog/2009/5/24/pdfpen-ocr-folder-action-script.html
-- Includes code from C-Command Software http://c-command.com/scripts/eaglefiler/ocr-with-pdfpen

on adding folder items to this_folder after receiving added_items
try
repeat with added_item in added_items
my ocr(added_item)
end repeat
on error errText
display dialog "Error: " & errText
end try
end adding folder items to


on ocr(added_item)
tell application "PDFpen"
open added_item as alias
tell document 1
ocr
repeat while performing ocr
delay 1
end repeat
delay 1
close with saving
end tell
end tell
end ocr

PDFpen Users: Download The Text Script Here (Right-click and Save-As)
PDFpen Pro Users: Download The Text Script Here (Right-click and Save-As)

To implement, follow MacSparky’s excellent instructions.

I hope this is of use to someone, and thanks to David and Michael for their excellent Applescripts.

Comments ( 17 )

ScanSnap and Hazel Is A Match Made In Paperless Heaven

HighlighterThere are a lot of tricks out there for keeping your documents organized based on their location or filename, but the holy grail is to be able to keep them organized based on the actual contents of the documents themselves.

I have written before about how the Fujitsu ScanSnap S1500, the S1500M and the S1300 allow you to use a highlighter pen to automatically assign keywords to a PDF.

However, once you have those keywords assigned, how does that help you?

If you’re on Windows, you can use the “Distribute By Keyword” feature of the included ScanSnap Organizer to move the files to a cabinet, but Mac users are out of luck there.

I humbly submit that using a highlighter, OCR, and the awesomeness that is Hazel, Mac users can one-up even the mighty ScanSnap Organizer.

What Is Hazel?

As my clients have been learning lately, I have been engaged in a torrid love affair with a Mac application known as Hazel from Noodlesoft. At a very high level, it lets you create rules to automatically keep your files organized.

I have written about how you can use Hazel with Evernote, and David Sparks at Macsparky has a great guide for moving PDFs based on filename.

I wanted to do something that would marry the searchable goodness of the ScanSnap with the ninja skills of Hazel.

Set Up The ScanSnap For Keyword Highlighting

The first thing you’ll need to do is set up a ScanSnap Manager profile to read highlighted text and make keywords out of it.

First, on the Scanning tab, I have had best luck setting the Image quality to “Best” (300dpi). At anything lower, the ScanSnap wasn’t picking up the keywords consistently.

Image quality

Then on the File Option tab, make sure that “Set the marked text as a keyword for the PDF file” is checked. That will tell it to look for any highlighted text and turn it into a keyword in the PDF.

Set marked

You will, of course, want to choose a folder to save the PDF to. Make a note of this folder because we will need it when we switch to Hazel. In my case it is called ToMove.

Get Out Your Highlighter

Is it Hi-liter or Highlighter? I never know. Anyways, now take your pen and highlight the word or phrase that you want to move the file based on.

Essentially what we will be doing is saying “if the PDF contains this keyword, do something with it”.

All I have handy are grocery receipts, so you can see I highlighted “EXTRA FOODS”.

Grocery receipt highlighted

Scan And Check Keywords

Now scan your document using your shiny new ScanSnap Manager profile. When it is done, open up your new PDF in Preview, go to Tools > Inspector (or hit Cmd-I), and click on the magnifying glass. If everything worked properly, you should see the text that you highlighted.

PDF with keywords

Set Hazel To Move Based On Keyword

Let’s say we want to move any PDF with the keyword “EXTRA FOODS” to a folder called Filed Documents (we’d probably want to move it to a grocery-specific folder, but let’s just pretend).

Open up Hazel and on the left side, click the Plus to add a new folder. Add your ToMove folder that you used as a scan destination in ScanSnap Manager.

Hazel To Move

Now in the right pane, click the plus to add a new rule. Give it a name.

You can set a number of criteria and rules here, but to keep it simple we will leave it as “all conditions”, then set:

  • Kind is PDF
  • Keywords contain EXTRA FOODS

Next, set it to Move the file to folder Filed Documents

Hazel move based on keyword

Hit OK to save it. If you want to see what your rule will catch, you can click on the little Gear icon near the bottom and choose “Preview Rule Matches”. If everything is set up properly, your newly-scanned document should show there.

If it doesn’t show, check the PDF to make sure that it really has keywords and re-check your rule setup.

If your document shows in the preview, either wait for Hazel to do its thing, or click on the Hazel icon in the Menu bar, choose Run Rules, and choose the rule that you just created.

Set Hazel To Rename Based On Keyword

Let’s say that instead of moving a file based on a certain keyword, we want to give our files a name based on the highlighted text. Is this possible? Why yes, yes it is. Let’s use our new Hazel Ninja powers and do it.

Create a new Hazel rule as we did before, but this time for the criteria, set this:

  • Kind is PDF
  • Keywords is not blank

Next, in the “Do the following” section, choose “Move file” to folder “Filed Documents” (if you choose), and then set up the following:

  • Choose Rename file
  • In the with pattern section it will say “name” and then “extension”. Click on “name” and hit the delete key. We want to get rid of that.
  • Let’s give the filename a date. Drag “date created” up before extension. If you prefer, click the little down arrow in “date created” and choose Edit Date Pattern and change to whatever pattern you choose.
  • Drag “other” up between “date created” and “extension”. It will ask you to select a Spotlight Attribute. Scroll down to find Keywords and hit Select.
  • If you prefer, click on the little down arrow in “keywords” and change which keywords are selected and how they are formatted.
  • You might want to click between “date created” and “keywords” and put a dash, but that is up to you.

Your final rule should look something like this:

Hazel move rename keywords

Now when we scan that same Extra Foods receipt, our Hazel rule will move the file to Filed Documents and rename it like this.

Renamed PDF

Forget Keywords, Use Hazel To Move Based On Searchable Text

Let’s say you want to forget about this whole highlighter/keyword thing. You already have scanned and searchable PDFs. Can’t you just move based on the OCR’ed text in the documents? Let’s find out.

So you really, really like the vegetable kale and you want to move any scanned receipt that has the word Kale in it (can you tell all I had around for this demo is grocery receipts?).

First, here is our receipt:

Kale receipt

Next, we obviously need to be using a ScanSnap Manager profile that has “Convert to searchable PDF” checked on the File Options tab. Again you will have better results if you use 300dpi for Image quality.

Now we set up another Hazel rule, this time using the following criteria:

  • Kind is PDF
  • Contents contain Kale

Then do something with it such as move it to Filed Documents.

Hazel OCR Rule

Now when you scan a document that has the word “Kale” in it, Hazel will move it.

These were a few examples of things you can do in Hazel to be a document management ninja. Hopefully it will give you some ideas.

Comments ( 29 )

Lifehacker OCR Call For Votes

The folks over at Lifehacker are running one of their famous High Five calls for submissions, this time about readers’ favorite OCR tools.

OCR tools have been around for decades, but only recently have they been affordable (in many instances free) and accessible to people outside of government and corporate offices. This week we want to hear about your favorite OCR tool and what features make it so good at converting hard-copy print into machine-readable and editable text.

So, if you have a favorite program (or want to see what others are suggesting), head on over and have your say.

(Photo by: Laineys Repetoire)

Comments ( 1 )

Using Microsoft Office Document Imaging To OCR For Free

If you are a Windows user and already have Microsoft Office XP through 2007, chances are you already have the ability to OCR documents to get the text out of them.

It’s called Microsoft Office Document Imaging (MODI). I’m not going to lie, what I am about to show you is not exactly the best way to OCR documents. If you have software that came with your scanner, I’d stick to that.

However, if you don’t already have OCR software and all you want to do is get some text out of an image, the software you already have is better than nothing at all.

Finding Microsoft Office Document Imaging

First, you want to check to see if you already have it installed. In Office 2007, go to Start > Programs > Microsoft Office > Microsoft Office Tools, and you should see Microsoft Office Document Imaging.

If you don’t see it there, never fear. It’s an optional part of the Office install. In Control Panel, go to Add/Remove Programs, select Microsoft Office, click Change, and then select add features. You will find MODI under Microsoft Office Tools. Install it and you should be good to go.

Ah Microsoft, I Love You

It probably won’t surprise you to learn that Microsoft Office Document Imaging will not import PDFs (why would they support an Adobe product?!). It will only import TIFFs and Microsoft’s own Microsoft Document Imaging format (.MDI).

In this example, I’m going to assume that we want to get the text out of a PDF that has not been OCR’ed already. Sure you could use MODI to scan a document in, but I figure if you have the hardcopy document and a scanner, you’d probably just use the scanner’s software anyways.

Copying A PDF In

Since we can’t actually import a PDF, we’re going to do some copy & paste magic.

Open up your PDF in Acrobat Reader or whatever PDF reader you are using and either Select All or Select just the portion you want to OCR. Then hit Copy.

Select Info In PDF

(By the way, that’s my picture of a Fung Wah bus that made it into New York Magazine. Aren’t you proud of me?).

Then switch to MODI, and you would think you would go Edit > Paste right? Of course not! This is Microsoft!

Instead go to Page and then Paste Page. Voila, the image you just copied is now in Microsoft Office Document Imaging.

Saving The Text

So now that you have the image in MODI, what do you do with it? To OCR the text, go Tools and then Recognize Text Using OCR.

You can then save it as a TIF (though I understand that only MODI can read that TIF), or MDI. Since that is more than a little useless, I’m going to cover sending the text to Word.

Send Text To Word

To send the text (and graphics, if you’d like) go up to Tools and then Send Text to Word. The OCR’ed text will then appear in a Word document with all the images at the bottom, if you checked the “Maintain Pictures in Output” box.

So, again, this is not the greatest OCR process in the whole world, but hey. If you’re a Windows user you probably already have Office, so it’s good to know what is available if you ever need it.

Photo: Naufragio

Comments ( 8 )