OCR API - From Character Recognition to Information Extraction

Built on OCR technology, OCR APIs are trained to detect and extract information from documents. OCR APIs allow developers programmatic access to OCR technology and make integrating text recognition into various products easy.

Some popular applications of OCR APIs are converting images into text, recognizing handwritten text, detecting tables from PDFs, and processing documents in multiple languages. tables from PDFs, and process documents in multiple languages.


Need a smart solution for online OCRimage-to-text, image-to-tables, PDF-to-excelPDF-to-tablePDF-to-text, or split PDFs into pages?

Check out Nanonets' pre-trained OCR for invoicesID cards, purchase orders, bank statements, passports and 300+ such documents!


The OCR API landscape

Many believe that OCR is the solution to all data extraction challenges. Still, in reality, the products available to us as open-source tools or provided by legacy tech giants are far from perfect—they are too rigid, often inaccurate, and fail in the real world.

Most APIs are restricted to solving a very limited set of use cases and are averse to customizations. Often, a business planning to use OCR technology needs an in-house team to build on the OCR API available to them before it can apply it to its use case. 

The OCR technology available today is mostly a partial solution to the problem.

Why do most current OCR APIs fail

Difficulty in processing custom data

The biggest roadblock while adopting OCR is that it doesn’t allow working with custom data.  Every use case has nuances and requires algorithms to deal with different kinds of data. To get good results for any use case, it is important that a model can be trained on the data that you’ll be dealing with the most. 

This is not a possibility with most of the available OCR APIs. 

shipping containers on the dock
OCR on vertical text on a shipment container

Consider any task involving OCR in the wild, such as reading traffic signs or shipping container numbers. Current OCR APIs do not allow for vertical reading, which makes the detection task in the image above much harder. These use cases require you to get bounding boxes for characters in the images you will most deal with.

Need for extreme post-processing

All the OCR APIs currently extract text from given images. But it is up to us to build on top of the output so it can be useful for a particular use case. 

ocr apis comparison

The text extracted from the OCR models must be intelligently structured and loaded in a usable format to get meaningful resultsGetting text out of an invoice is not good enough. If you need to use it, you’ll have to build a layer of OCR software on top of it that allows you to extract fields such as dates, company names, amounts, product details, etc. 

While this may sound easy, the path to such an end-to-end product is filled with roadblocks due to inconsistencies in the input images and a lack of organization in the extracted text. 

The text extracted from the OCR models must be intelligently structured and loaded in a usable format to get meaningful results. This could mean you need an in-house team of developers to use existing OCR APIs to build software your organization can use.

Read more: How to intelligently process documents?

Lack in flexibility

Current OCR methods perform well on scanned documents with digital text. On the other hand, handwritten documents, images of text in multiple languages at once, images with low resolution, images with new fonts and varying font sizes, images with shadowy text, etc., can cause your OCR model to make many errors and leave you with poor accuracy. 

Amazon Textract performs poorly on handwritten text
Poor extraction of handwritten text

Rigid models that are averse to customization limit the scope of applications for the technology to where they can perform with at least reasonable effectiveness.

Technological barriers

grid of images to extract text from images

Tilted text in images

While current research suggests that object detection should be able to work with rotated images by training it on augmented data, it is surprising that very few available OCR tools adopt object detection in their pipeline. 

This has several drawbacks, including that your OCR model won’t pick up the tilted characters and words. 

Take, for example, reading number plates.

A camera attached to a street light will capture a moving car at a different angle, depending on the distance and direction of the car. In such cases, the text will appear to be tilted. Better accuracy might mean stronger traffic law enforcement and a decreased rate of accidents.

OCR in nature

OCR has historically evolved to handle documents. Though much of our documentation and paperwork happens with computers these days, several use cases still require us to be able to process images taken in various settings. 

One such example is reading shipping container numbers.

Classical approaches tend to find the first character and go in a horizontal line, looking for characters that follow. This approach is useless when trying to run OCR on images in the wild. These images can be blurry and noisy. The text in them can be at various locations; the font might be something your OCR model hasn’t seen before, the text can be tilted, etc.

Deep Learning Based OCR for Text in the Wild
Learn how to apply deep learning based OCR to recognize and extract unstructured text information from images using Tesseract and the OpenCV EAST engine.

Handwritten text, cursive fonts, font sizes

The OCR annotation process requires you to identify each character as a separate bounding box, and models trained to work on such data get thrown off when faced with handwritten text or cursive fonts. 

This is because a gap between any two characters makes it easy to separate one from another. These gaps don’t exist for cursive fonts. Without these gaps, the OCR model thinks that all the connected characters are one single pattern that doesn’t fit into any of the character descriptions in its vocabulary. These issues can be addressed by powering your OCR engine with deep learning.

Top 9 handwriting to text converters in 2024
Discover the top 9 handwriting to text conversion software in 2024. Learn how to quickly and easily convert handwriting to text.

Text in non-English languages

The OCR models provided by Google (Google Drive OCR) and Microsoft work well in English but do not perform well in other languages. 

This is mostly due to insufficient training data and varying syntactical rules for different languages. Any platform or company that intends to use OCR for data in their native languages will have to struggle with bad models and inaccurate results. 

You might want to analyze documents containing multiple languages simultaneously, like forms for government processes. Working with such cases is not possible with the available OCR APIs.

Difficult-to-read images

Noisy images can very often throw off your classifier to generate wrong results. A blurry image can confuse your OCR model between ‘8’ and ‘B’ or ‘A’ and ‘4’. 

💡
De-noising images is an active area of research and is being actively studied in deep learning and computer vision.

Making robust models for noise can help create a generalized approach to character recognition and image classification. Understanding de-noising and applying it to character recognition tasks can greatly improve accuracy.

Should I even consider using an OCR?

The short answer is yes.

OCR technology can enable data extraction and process automation anywhere, which usually involves a lot of paperwork or manual effort. 

Digitizing information accurately can help business processes become smoother, easier, and more reliable and reduce the manpower required for execution. 

For big organizations that deal with many forms, invoices, receipts, etc., digitizing all the information, storing and structuring the data, and making it searchable and editable is a step closer to a paper-free world.

Use cases for OCR APIs

Number plates: Number plate detection can implement traffic rules, track cars in your taxi service's parking lot, and enhance security in public spaces, corporate buildings, malls, etc.

Legal documents: Dealing with different documents - legal claim forms, affidavits, judgments, filings, etc., digitizing, databasing, and making them searchable.

Table extraction: Automatically extract tables in a document, get text in each cell, column headings for research, data entry, data collection, etc.

Banking: Analyze cheques, read bank statements and financial statements, ensure KYC compliance, and analyze applications for loans, accounts, and other services.

Restaurants: Extract information from menus of different restaurants and put it into a homogeneous template for food delivery apps.

Healthcare: Digitize and maintain patients' medical records, including histories of illnesses, diagnoses, medication, etc., and make them searchable for doctors' convenience.

Invoices: Automate reading bills, invoices, and receipts, extracting products, prices, date-time data, and company/service names for the retail and logistics industry.

Need for automation

Automating business processes has proven to be a boon for organizations. It has helped them become more transparent and made communication and coordination between different teams easier. 

It has also increased business throughput, employee retention rates, customer service and delivery, productivity, and performance. 

Intelligent automation has helped speed up business processes while simultaneously cutting costs. It has made processes less chaotic and more reliable and helped increase employee morale. Moving towards digitization is a must to stay competitive in today’s world.

What do OCR APIs need?

OCR has much potential, but most products available today do not make it easier for businesses to adopt the technology. 

OCR converts images with text or scanned documents into machine-readable text. What to do with the text is left up to the people using these OCR technologies, which might initially seem like a good thing. This allows people to customize the text they are working with as they want, given they’re ready to spend the resources required to make it happen. 

However, beyond a few use cases, such as reading scanned documents and analyzing invoices and receipts, these technologies fail to make their case for widespread adoption.

If you are using OCR in any form, it’s time to ask some difficult, important questions: 

  • How does your OCR deal with the input images?
  • Does it minimize the pre-processing required?
  • Can the annotation process be made easier?
  • How many formats does it accept our images in?
  • Do you lose information while pre-processing?
  • How does it perform in real-world problems?
  • What is the accuracy?
  • Does it perform well in any language?
  • What about difficult cases like tilted text, OCR in the wild, and handwritten text?
  • Is it possible to constantly improve your models?
  • How does it fare against other OCR tools and APIs?
  • How it uses the machine-readable text
  • Does it allow us to give it a structure?
  • Does it make iterating over the structure easier?
  • Can I choose the information I want to keep and discard the rest?
  • Does it make storage, editing, and searching easier?
  • Does it make data analysis easier?

Top 5 OCR APIs

Today, with OCR APIs integrating seamlessly with other applications, data extraction from images, PDFs, and documents has become easier.

Check out these top 5 OCR APIs for data extraction and image text recognition:

  1. Nanonets
  2. Google Cloud Vision API
  3. ABBYY
  4. AWS Textract
  5. Microsoft Azure AI Vision

Why Nanonets OCR API?

We at Nanonets have worked to build a product that solves these problems. We have productized a pipeline for OCR by treating it not just as character recognition but also as an object detection and classification task.

AI Invoice processing

ROI is too high to even quantify!

"Our business grew 5x in last 4 years, to process invoices manually would mean a 5x increase in staff, this was neither cost-effective nor a scalable way to grow. Nanonets helped us avoid such an increase in staff. Our previous process used to take six hours a day to run. With Nanonets, it now takes 10 minutes to run everything. I found Nanonets very easy to integrate, the APIs are very easy to use." ~ David Giovanni, CEO at Ascend Properties.

Want to see the difference intelligent automation can make for your team? Claim your personalized demo session now.

However, the benefits of using nanonets over other OCR APIs (like Power Automate) go beyond better accuracy. Here are a few reasons you should consider using the Nanonets OCR API.

Automated intelligent structured field extraction

Assume you want to analyze receipts in your organization to automate reimbursements. 

You have an app where people can upload their receipts. This app needs to read the text in those receipts, extract data, and put them into a table with columns like transaction ID, date, time, service availed, the price paid, etc. This information is updated constantly in a database that calculates the total reimbursement for each employee at the end of each month. 

Nanonets makes extracting text easy. It structures the relevant data into the required fields and discards the irrelevant data extracted from the image.

Supports 40+ global languages

If your company deals with data that isn’t in English, you probably already feel like you have wasted your time looking for OCR APIs like Japanese OCR that would actually deliver what they promise. We can provide an automated end-to-end pipeline specific to your use case by allowing custom training and varying the vocabulary of our models to suit your needs.

Recognizes text from any images

Reading street signs to help with navigation in remote areas, reading shipping container numbers to keep track of your materials, and reading number plates for traffic safety are just some of the use cases that involve images in the wild. 

Nanonets image text recognition API utilizes object detection methods to improve searching for text in an image and classify it even in images with varying contrast levels, font sizes, and angles.

Use your own data to train the model

Get rid of the rigidity your previous OCR services forced your workflow into. You won’t have to think of what is possible with this technology. 

With Nanonets, you can focus on making the most of it for your business. Using your own data for training broadens the scope of applications, like working with multiple languages simultaneously. It also enhances your model performance because test data is much more similar to training data.

Continuous learning 

Imagine you are expanding your transportation service to a new state.

You are faced with the risk of your model becoming obsolete due to the new language your truck number plates are in. Or maybe you have a video platform that needs to moderate explicit text in videos and images. With new content, you are faced with more edge cases where the model’s predictions are not very confident or, in some cases, false. 

To overcome such roadblocks, Nanonets OCR API allows you to easily retrain your models with new data and automate your operations anywhere faster.

No developmental efforts

No need to worry about hiring developers and acquiring talent to personalize the technology for your business requirements. 

Extract tabular data from PDFs Send extracted PDF tabular data to different business apps

Seamless data flow is just a step away.

Connect with over 5,000 apps via Zapier, APIs, and webhooks and automatically route extracted document data to your business apps, eliminating manual data entry—no coding required.

Nanonets takes care of your requirements, from the business logic to an end-to-end product that can be integrated easily into your organization's workflow without worrying about infrastructure requirements.

How to use Nanonets OCR APIs

Nanonets OCR API allows you to build OCR models with ease. You can upload your data, annotate it, set the model to train, and wait to get predictions through a browser-based UI.

Below, we will give you a step-by-step guide to training your own model using the Nanonets OCR API in 9 simple steps.

Step 1: Clone the repo

git clone https://github.com/NanoNets/nanonets-ocr-sample-python
cd nanonets-ocr-sample-python
sudo pip install requests
sudo pip install tqdm

Step 2: Get your free API key

Get your free API Key from https://app.nanonets.com/#/keys

Step 3: Set the API key as an environment variable

export NANONETS_API_KEY=YOUR_API_KEY_GOES_HERE

Step 4: Create a new model

python ./code/create-model.py
Note: This generates a MODEL_ID that you need for the next step

Step 5: Add model ID as an environment variable

export NANONETS_MODEL_ID=YOUR_MODEL_ID

Step 6: Upload the training data

Collect the images of the object you want to detect. Once you have the dataset ready in the folder images (image files), start uploading the dataset.

python ./code/upload-training.py

Step 7: Train Model

Once the Images have been uploaded, begin training the Model

python ./code/train-model.py

Step 8: Get the model state

The model takes ~30 minutes to train. You will get an email once the model is trained. In the meantime, you check the state of the model.

watch -n 100 python ./code/model-state.py

Step 9: Make the prediction

Once the model is trained. You can make predictions using the model

python ./code/prediction.py PATH_TO_YOUR_IMAGE.jpg

Conclusion

While many OCR products are available today, only a few have progressed to applying deep learning-based approaches. There is a dearth of such products that make the OCR process easier for users and organizations.

The right AI-powered OCR API can ensure secure capture, categorization, and extraction of unseen, unstructured documents or forms into structured data within seconds.