Skip to content

Optimal Optical Character Recognition

My research is heavily reliant on the most seamless transfer I can get from hard-copy textbooks written in antiquated font, saved up in a dusty library somewhere, to the state of the art machine-readable format that I can use for my analysis. For that, I need the best optical character recognition tool I can get my hands on. 

And by far, the best on the market is ABBYY Cloud OCR SDK:

see https://cloud.ocrsdk.com.

In conjunction with AbbyyR, a package developed for R, one can submit single images or multiple documents for OCRing at ABBYY Cloud and obtain OCR’d documents close to 99% accuracy. The software recognizes a large range of fonts, corrects orientation and skew of documents, and provides extremely accurate reading of tabular formats, also exporting them to excel, text or searchable PDFs formats.  It also allows users to text their product for free, which allows users to acquire in-depth knowledge of the software and fine-tune their scripts for better results.

Free software alternatives to ABBYY include tesseract, a python library that does a decent job, but requires training sets for font recognition in most troubling cases, which places it behind ABBYY when it comes to OCRing old documents in gothic fonts, for example.

ABBYY’s pricing structure is also an attractive feature. At $100 for 1,000 pages, it makes it  affordable to most people, including undergraduates and graduate students often working on a limited budget. In addition, it’s high accuracy in document conversion also saves a lot of time in manually checking the inaccuracies in recognizing texts or particular features within the documents (such as tables). In most projects, this would be carried out by research assistants, which often do not come cheap. Abbyy allows researchers to bypass said constraints at a modest price point.

When pairing ABBYY Cloud with R, there are tutorials available on how to script the requests onto ABBYY’s server. The code structure is straightforward and can be adapted to loops, facilitating batch processing of documents. An overview of the inner-working of the R package can be found
here: https://cran.r-project.org/web/packages/abbyyR/vignettes/overview.html

Also visit ABBYY Cloud’s website for a description of the customizable fields, such as Font type, document type and desired output. For those, see ABBYY’s documentation here: http://ocrsdk.com/documentation

Overall, I really like the style of the service and I’m stoked that they’re offering it with a somewhat lower cost of entry to people who are doing academic research, such as myself. 

Published inArticle