Achieve error-free AI-driven data capture from documents like invoices, receipts, driver's licenses, passports & more. Try Nanonets intelligent document processing for free and automate document data capture.

A computer would deserve to be called intelligent if it could deceive a human into believing that it was human.

— Alan Turing

What is document understanding?

The current view of AI research is that sentient and intelligent machines are just around the corner. Machines already understand verbal commands (think Siri, Google Assistant, etc.), distinguish pictures, drive cars, and play games better than we do. One of the most important goals of AI is to understand the meaning of letters and figures written in documents. Intelligent document understanding aims at capturing, extracting, and processing data from various types of documents.  This article explores the technicalities and features of document understanding and processing.

Table of Contents

The evolution of document understanding

The world is driven by data and information. From the ancient past, humans have always composed documents in various formats to record and preserve information for use and posterity.

The evolution of document understanding
Image: The oldest “document” available - the clay tablet was written c. 1750 BC.

Modern-day documents may be in the form of business forms, scholarly and news articles, invoices, letters, and text-based emails, and they all convey information through language, visual content, and layout structure. The understanding of the content of documents involves reading, interpreting, and extracting information from the written text and has hitherto been the hallmark of human intelligence that engages the human eye (or touch, in the case of Braille) and the intricate neural machinery of the human brain.   A scholastic definition of document understanding is that it is “the process to transform data into information by applying knowledge.”

Document Understanding - Breakdown
Image reproduced without modification from here.

With rapid advancements in technology, machines are now being developed to mimic the complex process of document understanding. Machine-based document understanding has evolved from handcrafted rule-based algorithms to current deep learning, computer vision, and Natural Language Processing (NLP) based methods that are close enough to the human brain in understanding the content of documents.

Today, automated document understanding is associated with logical and semantic analysis of documents to extract useful information for specific purposes. The information extracted may contain objective textual aspects such as dates, names, identification numbers, cost, etc., and also content-driven relationships. Thus, advanced document understanding tools have the capability to automatically extract meaningful data from the combination of the written text and its presentation, the logical structure of a document, and context.

Achieve error-free AI-driven data capture from documents like invoices, receipts, driver's licenses, passports & more. Try Nanonets for free and automate document data capture.

Challenges to machine-based document understanding

The obvious challenge to machine-based document understanding is the variety of formats in which documents are created.

Documents could be highly-structured, semi-structured, or unstructured. Structured documents have well-defined regions of interest from which information can be easily extracted by a machine. Some examples include questionnaires, registration forms, and claim forms.

Semi-structured and unstructured documents have poorly defined regions of interest, the extraction of data from which can be challenging to a non-specialized tool.  Examples of semi-structured documents are invoices, bills, purchase orders, etc., and those for unstructured documents are letters, memos, emails, videos, images, etc.  

Around 95% of businesses reportedly handle unstructured data. In almost all cases,, document understanding involves extraction of data from Visually Rich Documents (VRDs), in which, the layout and visual representation of information is critically associated with understanding the whole document.

Understanding machine-based documents - Document Understanding
Image: Types of documents that must be “understood”

The layout structure could vary for the same type of documents as well with varied locations of logical objects, such as names or dates. Even for the same type of documents, e.g., invoices, different companies may have different formats.  The understanding of the format of the document and identification of the appropriate fields to be extracted requires complex object detection and image segmentation protocols that mimic or edge close to the human ability of visual discerning.

Document Understanding - Machine-based documents
Image: Various invoice templates, recommended by MS-word

Components of document understanding

Document understanding comes under the larger activity of data digitization.  The following steps are usually followed in the document understanding process.

1. File entry into the digital platform: It is logical that the understanding of a document starts with looking at the document and defining it. The documents may be in the form of PDFs, word processing files, accounting spreadsheets, digitized images, or hard copies. The first step to the digitization of data is uploading the document into a digital platform for subsequent categorization, data extraction, and storage.
2. Categorization of uploaded documents: Document understanding tools can, through algorithmic approaches, identify and categorize the type of document that has been uploaded in order to pick out the data or information that needs to be extracted in the following step. Various departments and personnel in an organization would work with different kinds of documents. Document categorization automatically determines the class to which the document belongs. For this, AI and ML tools have predefined classes into which the documents can be automatically categorized. Further, sub-categorization could use topic group taxonomy, to define a finite set of relevant information items. In addition to pre-defined taxonomies, many AI-based document understanding tools allow the customized descriptions of new classes and topic groups to categorize the documents.
3. Extraction of relevant data: Understanding the document involves the extraction of relevant data from the documents that have been uploaded into the proprietary platform. Data extraction involves the acquisition of raw data from documents for further processing. Once the documents are imported into the digital platform of choice, data extraction software scans and captures the required data. Data extraction may either be non-discerning or discerning.

  • Non-discerning data extraction: In full extraction of data, all relevant data are extracted at the same time, directly from the source document without the need for additional logical/technological information. It is used when data must be extracted and loaded for the first time. This extraction reflects the current data available in the source system. Two technologies are typically used for data extraction from structured and un/semi-structured documents, respectively.
    Intelligent Character Recognition (ICR): ICR converts handwritten texts and paper-type formats into computer-readable information. It is used for unstructured documents such as letters and other handwritten business correspondence. Some ICR software are A2iA Mitek, Parascript FormXtra.AI, and LEADTOOLS ICR SDK.
    Optical Character Recognition (OCR): OCR converts structured documents into machine-readable files to capture data. Advanced OCR tools also perform data pre-processing activities such as noise removal, binarization, line segmentation.
    Non-discerning data extraction requires further human intervention to understand the document and process the data as required.

  • Discerning or smart data extraction: While rudimentary versions of OCR software simply convert the entire content of documents into a machine-readable format, more advanced OCR tools, such as zonal OCRs can preferentially extract desired information from documents, rather than the entire content. Content preference may be based on layout features, textual features, search patterns, format features, special indicators, and tabular features. The zonal OCR software identifies the structure and hierarchy of a document through code or API. The OCR engine then splits the document into zones that could correspond to a particular field. These zones are determined through the design of appropriate OCR templates. These zones are usually location-based, as shown in the following figure, in which the user simply draws a square around data that must be extracted. Then, instead of reading the entire page as a single entity, the text in the zones are identified and extracted as specified in the template. The zonal OCR can be programmed to ignore graphical elements that need not be read, and this reduces the amount of information that needs to be parsed in order to pick out the needed data. This enhances the data extraction speed and accuracy of the OCR engine.

Smart data extraction - Document Understanding
Smart data extraction

Many OCR tools use AI and ML approaches to allow learning of the understanding process.  Some commonly used AI-based OCR tools are Nanonets OCR API,  Tesseract, OCRopus,  Ocular, SwiftOCR, and Calamari. The Nanonets OCR API uses state-of-art AI algorithms that allow the design of custom OCR models. Data can be uploaded, annotated, and the model can be trained easily and seamlessly integrated with existing systems. For training and learning of the AI models, a certain amount of human validation would be required to test a small sample of the model’s performance to check for accuracy or incorporate course correction to the algorithms for more accurate data understanding.

In most AI-based document understanding software, the data extraction component is integrated with data quality checks and data preparation software to clean and organize data after scraping. It also incorporates data integration tools to combine multiple data types and sources and aggregate them in one place.  Good AI-data extraction tools can extract structured, poorly structured, and unstructured data, pull data from multiple sources, and export extracted data in multiple readable formats.

4. Final validation and export: After extraction of information from the document, some amount of human intervention is required to test for the accuracy of data extracted. Non-discerning data extraction requires more manual entry to extract the relevant data from the entire content of the digitized document. In discerning data extraction, however, depending on the degree of sophistication of the document understanding tool employed, varying levels of human intervention may be required for validation. AI and ML-based document understanding would require minimum manual tasks because of the accuracy of data captured through the human-like understanding of the document data.

Once the data extracted from the document has been validated, it is entered into an ERP or other data system or placed in a repository for further processing or storage.

Document Understanding
Document Understanding 

The output of intelligent document understanding solutions include not only the relevantly isolated and classified data from documents but also useful, actionable metadata about the analyzed documents.  A good document understanding tool would provide a vision of the document’s larger context and relevance.

Achieve error-free AI-driven data capture from documents like invoices, receipts, driver's licenses, passports & more. Try Nanonets for free and automate document data capture.

AI-driven document understanding

The AI and ML tools to document understanding use statistical methods, neural networks, decision trees, and rule learning techniques. A full end-to-end AI system for document understanding typically employs the following tools:

  • Computer-vision-based document layout analysis tool: This partitions the document page into regions with distinct content to differentiate between relevant and irrelevant regions and categorize the type of content recognized. The zonal OCR tool could also be used to locate and transcribe the selected text.
  • Information extraction tool: This uses the OCR output or document layout to recognize the information embodied in the data extracted. The AI and ML algorithms look for specific types of information, such as reference numbers, names, addresses, cost, etc.,
Information extraction - AI-driven document understanding
Information extraction - AI-driven document understanding

AI-based OCR with Nanonets

Nanonets is an OCR software that leverages AI & ML capabilities to automatically extract unstructured/structured data from PDF documents, images, and scanned files. Unlike traditional OCR solutions, Nanonets does not require separate rules and templates for each new document type.

Relying on AI-driven cognitive intelligence, Nanonets can handle semi-structured and even unseen document types while improving over time. The Nanonets algorithm & OCR models learn continuously. They can be trained or retrained multiple times and are customizable. You can also customize the output, to only extract specific tables or data entries of your interest.

While offering a great API & documentation for developers, the software is also ideal for organizations with no in-house team of developers.

The benefits of using Nanonets over other automated OCR software go far beyond cost savings, accuracy, and scale. Nanonets additionally provides unique benefits that place it far ahead of the competition:

  • A truly no-code tool
  • No post-processing is needed
  • Works with custom data
  • Easily handles data constraints
  • Works with multiple languages
  • Continuous learning
  • Infinite customization

Take Away

Document understanding is not a trivial activity, although it appears to come naturally for human beings.  The “naturalness” of the understanding process is deceptive; it takes years of schooling and training even for humans to understand documents. The difficulty in document understanding arises from the need to combine the different dimensions of data in order to filter out irrelevant semantics. Thus, developing a tool for document understanding necessitates interweaving complementary techniques to extract information from the language, logical structure, and context. The development and use of AI and other technologically advanced tools for document understanding are necessary because they can save time and money for the business in document management, in addition to integrating all activities of the company under a single digital platform.