Search this blog ...

Wednesday, June 27, 2012

Spaces wrongly added in OCR text of PDF generated by Abbyy FineReader - Solution

In my search for the holy grail of document scanning and management software, I’ve been trialing Abbyy FineReader 11 and Adobe Acrobat X to see if either offer any benefits over Lucion FileCenter on the scanning front.  FileCenter is a nice piece of software for managing of scans, but it falls short in my opinion on the OCR front. 

What I’m hoping to find in a product is something that offers a high level of OCR accuracy, and the ability to easily correct mistakes with a quick and simple interface (preferably all controlled by keyboard commands for improved processing time).

I’m after a product that could learn / be trained to detect document types and only OCR the data of actual interest.

I’m after a product that easily and automatically recognize document paper size and allows simple and effective trim/deskew/cleanup etc as necessary.

Anyway, these requirements culminated in my testing of FineReader and Acrobat X.

Acrobat X has the potential to be a useful piece of software, but it surprisingly performed quite average on the OCR front with the documents that I threw at it.  The OCR results were average – and that is being kind.  What astonished me however, was the product’s inability to provide a simple mechanism to correct the OCR mistakes.  I had to google search for a way to achieve it; and if anything, the results returned were more a workaround/hack than out-of-the-box intended functionality.  What it did do nicely was the simple one click scan to PDF.  It was just a pity the OCR let it down.

Next up was FineReader 11.  This software appeared to excel on the OCR front, and also allow a simple verification mechanism to correct issues detected.  It however was not without its warts.   Features that could turn this very good software in to excellent software:

  1. The ability to perform all document verification/correction steps completely using keyboard with absolutely no mouse interaction required.
  2. The ability to save scanning profiles for specific document types.  For example what type of fonts to expect, what paper size, what spelling mistakes/words to ignore.
  3. The ability for spell checker / verifier to ignore single letters that appear as part of a word grouping.  For example,  John Smith is fine.  But it sees an issue with the “J” in J Smith.  I should be able to add “J Smith” to my dictionary ignore list.
  4. The ability to completely skip OCR on specific pages of a document without silly workarounds.  Currently, you have to add a scan region on the page, and read the page.
  5. The ability to better recognize font families and the like.

One major flaw I found in the software, fortunately appears to have a cure. I was hoping others would have run in to it and provided a solution in some type of forum, but alas that was not the case.  Abbyy – if you are reading this post by any chance, I would strongly suggest you consider making a public support forum for people to discuss your software and offer tips etc.  It should definitely lead to more sales if people can find solutions to problems they encounter with your software.

Anyway, the issue I ran in to was the following:

When converting scan to PDF, words had invalid spacing/padding added not present in either the OCR text in Abbyy, or for that matter, in other output options such as HTML etc.

The cause:

The Windows machine likely does not have the matching font installed, and Abbyy likely not configured to leverage the font.

The solution:

First attempt to identify the font leveraged by the document.  If only a scan is available, extract some words containing sufficient sample characters and head on over to http://www.myfonts.com/WhatTheFont/ and attempt to determine the font.

If you can get hold of an electronic PDF of the same file (or similar from same provider), you should be able to go to Acrobat > File > Properties dialog > Fonts tab.  If an electronic version of the document was available, within Acrobat you will likely see that the font has been embedded, but only
a portion of the font was embedded (such that the characters leveraged by the document, and nothing else).

Using tools such as FontForge and mupdf-1.0-tools-windows.zip, you can potentially extract the embedded fonts subsets, and try and convert these to TTF.  But it is likely the TTF won't be sufficient.  Your best bet is to try and find the original TTF on the net.  You may have to purchase it.

One other issue, is that even if you can find the TTF, chances are licensing flags on the TTF will prevent it from being embedded in a PDF document.
To circumvent this, you can leverage ttfpatch and set the fsType flag to 0, meaning: Installable embedding allowed, fonts may be embedded in documents and permanently installed on the remote system.

Install the font then in the <windows>\fonts directory, and open Abbyy.
From Tools > Options > Read tab > Fonts > ensure the font is selected (and thus available for OCR).

If saving the file as PDF/A (which embeds fonts), and an error message appears stating the font is restricted from being embeeded, first ensure the fsType flag has been reset to 0, and also try and clear out Abbyy font cache.  Delete directory
C:\Documents and Settings\All Users\Application Data\ABBYY\FineReader\11.00\FontCache

Also, it is worthwhile installing Microsoft Typography Font Properties Extension
http://www.microsoft.com/typography/FreeToolsOverview.mspx
This will allow you to right click > properties on a TTF file, and see the Embedding restrictions on the file.

Good luck!

3 comments:

  1. I just called ABBYY and they were friendly with a quick fix!
    Go to TOOLS, choose OPTIONS, click on PDF/A tab, then check off ENABLE TAGGED PDF...

    ReplyDelete
  2. There is another solution for the additional spaces in text (not sure it will be suitable in all cases) : In FineReader, open the text pane, select all text, and change the font to Courier New ...

    ReplyDelete
  3. My solution in FineReader 12 is usage of following Save Options:

    - Save mode: Text under the page image
    - Font settings: Use Windows fonts

    ReplyDelete