Automate your workflow with Nanonets
Request a demo Get Started

With the shift from physical to digital documents, extracting data from scanned documents through OCR and machine learning has become crucial for convenience. To enable accurate data extraction from scans, research facilities and corporations have advanced computer vision and Natural Language Processing (NLP).

Deep learning now allows extracting far beyond just text from scans – tables, key-value pairs, and more can be extracted. Many OCR data extraction solutions provide products to extract data from scanned documents, meeting the needs of individuals and businesses for document data extraction.

This article explores current technology for extracting data from scanned documents. We'll look at a Python tutorial for extracting data from scans, as well as popular market solutions offering top scanned document data extraction capabilities through OCR and machine learning.

Turn scanned PDFs into editable documents with ease
Turn scanned PDFs into editable documents with ease

What is Data Extraction?

Data extraction is the process of converting unstructured data into interpretable information by programs to allow further data processing by humans.

Here, we list several of the most common types of data to be extracted from scanned documents.

Text Data

The most common and the most important task in data extraction from scanned documents is extracting text. This process, while seemingly straightforward, is, in fact, very difficult as scanned documents are often presented in the format of images. In addition, the methods of extraction are highly dependent on the types of text.

While the text is present in densely printed formats the majority of the time, the ability to extract sparse text from less well-scanned documents or from handwritten letters with drastically varying styles is equally important. Such a process will allow programs to convert images to machine-encoded text, where we can further organize them from unstructured data ( without certain formatting) into structured data for further analysis.

💡
Want to understand the deep learning algorithms that power such processes? Head on to our LayoutLM Explained blog

Tables

Tabular forms are the most popular approach for data storage, as the format is easily interpretable with human eyes. The process of extracting tables from scanned documents requires technology beyond character detection – one must detect the lines and other visual features in order to perform a proper table extraction and further convert that information into structured data for further computation.

Computer vision methods (described in detail in the following sections) are heavily used to achieve high-accuracy table extraction.

Extracting tables from scanned documents
Extracting tables from scanned documents

Key-Value Pairs

Key-value pairs (KVPs) are a common alternative format used for data storage in documents.

Extract Key Value Pairs from Scanned Documents
Extract Key Value Pairs from Scanned Documents

KVPs are essentially two data items -- a key and a value -- linked together as one. The key is used as a unique identifier for the value to be retrieved. A classic KVP example is the dictionary, where the vocabularies are the keys and the corresponding definitions are the values. These pairs, while usually unnoticed, are actually being used very frequently in documents: questions in surveys such as name, age, and prices of items in invoices are all implicitly KVPs.

However, unlike tables, KVPs often exist in unknown formats and are sometimes even partially handwritten. For example, keys could be pre-printed in boxes and values are handwritten when completing the form. Therefore, finding the underlying structures to automatically perform KVP extraction is an ongoing research process even for the most advanced facilities and labs.

Figures

Finally, it is also very important to extract or capture data from figures within a scanned document. Statistical indicators such as pie charts and bar charts often include crucial information for scanned documents. A good data-extracting process should be able to infer from the legends and numbers to partially extract data from figures like barcodes or QR codes for further use.


Looking to automate data extraction from scanned documents? Give Nanonets™ a spin for higher accuracy, greater flexibility, post-processing, and a broad set of integrations!


Technologies Behind Scanned Document Data Extraction

Data extraction revolves around two main processes: Optical Character Recognition (OCR) followed by Natural Language Processing (NLP).

OCR extraction is the process of converting text images into machine-encoded text, while the latter is the analysis of the words to infer meanings. Often accompanied by the OCR are other computer vision techniques, such as box and line detection, to extract aforementioned data types such as tables and KVPs for more comprehensive extraction.

The core improvements behind the data-extraction pipeline are tightly connected to the advances in deep learning that contributed greatly to the fields of computer vision and natural language processing (NLP).

What is deep learning?

Deep learning has a major role behind the hype of the artificial intelligence era and has been constantly pushed to the forefront in numerous applications. In traditional engineering, our goal is to design a system/function that generates an output from a given input; deep learning, on the other hand, relies on the inputs and outputs to find the intermediate relationship that can be extended to new unseen data through the so-called neural network.

A neural network, or a multi-layer perceptron (MLP), is a machine-learning architecture inspired by how human brains learn. The network contains neurons, which mimic biological neurons and “activate” when given different information. Sets of neurons form layers, and multiple layers are stacked together to form a network to serve the prediction purposes of multiple forms (i.e., image classifications or bounding boxes for object detections).

Deep learning

In the field of computer vision, a type of neural network variation is heavily applied -- convolutional neural networks (CNNs). Instead of traditional layers, a CNN adopts convolutional kernels that slide through tensors (or high-dimensional vectors) for feature extraction. Together with traditional network layers in the end, CNNs are very successful in image-related tasks, and further formed the basis for OCR extraction and other feature detection.

On the other hand, NLP is reliant on another set of networks, which focuses on time-series data. Unlike images, where one image is independent of the other, text prediction can be largely beneficial if words prior or after are also taken into account. In the past few years, a family of networks, namely long short-term memories (LSTMs), has taken previous results as inputs to predict the current results. Bilateral LSTMs were also often adopted to enhance the prediction output, where both results prior and after were considered. In recent years, however, the concept of transformers that use an attention mechanism is starting to rise due to its higher flexibility, leading to better results than traditional networks handling sequential time series.

Applications of Scanned Documents Data Extraction

The main goal of data extraction is to convert data from unstructured documents to structured formats, in which a highly accurate retrieval of text, figures, and data structures can be very helpful for numerical and contextual analysis. These analyses can be very helpful especially for businesses:

Business

Business corporations and large organizations deal with thousands of pieces of paperwork with similar formats on a daily basis – Big banks receive numerous identical applications, and research teams have to analyze piles of forms to conduct statistical analysis. Therefore, automation of the initial step of extracting data from scanned documents significantly reduces the redundancy of human resources and allows workers to focus on analyzing data and reviewing applications instead of keying in information.

  • Verifying Applications -- Companies receive tons of applications, whether handwritten or through only application forms. At most times, these applications may be accompanied by personal IDs for verification purposes. Scanned documents of IDs such as passports or cards usually come in batches with similar formats. Therefore, a well written data extractor can quickly convert the data (texts, tables, figures, KVPs) into machine-understandable texts, which could substantially reduce the man hours on these tasks and focus on application selection instead of extraction.
  • Payment Reconciliation – Payment Reconciliation is the process of comparing bank statements to ensure the matching of numbers between accounts, which heavily revolves around data extraction from scanned documents – a challenging issue for a company with considerable size and various sources of income stream. Data extraction can ease this process and allow employees to focus on faulty data and explore potential fraudulent events in the cash flow.
  • Statistical Analysis – Feedback from customers or experiment participants is used by corporations and organizations to improve their products and services, and a comprehensive feedback evaluation will usually need a statistical analysis. However, survey data may exist in numerous formats or hidden in between text with various formats. Data extraction could ease the process by pointing out obvious data from documents in batches, ease the process of finding useful processes, and ultimately increase efficiency.
  • Sharing Past Records – From healthcare to switching bank services, big industries often require new customer information t elsewhere. For example, a patient switching hospitals due to moving may have pre-existing medical records that could be helpful to the new hospital. In such cases, good data extraction software comes in handy as all it is required is for the individual to bring a scanned history of records to the new hospital for them to automatically fill in all the information. Not only would this be convenient, but it could also avoid extensive risks, especially in the healthcare industry, of important patient records being overlooked.

Looking to extract data from scanned documents? Give Nanonets™ a spin for higher accuracy, greater flexibility, post-processing, and a broad set of integrations!


Tutorials

To provide a clearer view on how to perform data extraction, we show two sets of methods on performing data extraction from scanning documents.

Building from Scratch

One may build a simple data extracting OCR engine via PyTesseract engine as the following:

try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract

# If you don't have tesseract executable in your PATH, include the following:
pytesseract.pytesseract.tesseract_cmd = r'<full_path_to_your_tesseract_executable>'
# Example tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract'

# Simple image to string
print(pytesseract.image_to_string(Image.open('test.png')))

# List of available languages
print(pytesseract.get_languages(config=''))

# French text image to string
print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra'))

# In order to bypass the image conversions of pytesseract, just use relative or absolute image path
# NOTE: In this case you should provide tesseract supported images or tesseract will return error
print(pytesseract.image_to_string('test.png'))

# Batch processing with a single file containing the list of multiple image file paths
print(pytesseract.image_to_string('images.txt'))

# Timeout/terminate the tesseract job after a period of time
try:
    print(pytesseract.image_to_string('test.jpg', timeout=2)) # Timeout after 2 seconds
    print(pytesseract.image_to_string('test.jpg', timeout=0.5)) # Timeout after half a second
except RuntimeError as timeout_error:
    # Tesseract processing is terminated
    pass

# Get bounding box estimates
print(pytesseract.image_to_boxes(Image.open('test.png')))

# Get verbose data including boxes, confidences, line and page numbers
print(pytesseract.image_to_data(Image.open('test.png')))

# Get information about orientation and script detection
print(pytesseract.image_to_osd(Image.open('test.png')))

# Get a searchable PDF
pdf = pytesseract.image_to_pdf_or_hocr('test.png', extension='pdf')
with open('test.pdf', 'w+b') as f:
    f.write(pdf) # pdf type is bytes by default

# Get HOCR output
hocr = pytesseract.image_to_pdf_or_hocr('test.png', extension='hocr')

# Get ALTO XML output
xml = pytesseract.image_to_alto_xml('test.png')

For more information regarding the code, you may checkout their official documentation.

In simple words, the code extracts data such as texts and bounding boxes from a given image. While fairly useful, the engine is no where as strong as the ones provided by advanced solutions due to their substantial computational power for training.

Using Google Document API



def async_detect_document(gcs_source_uri, gcs_destination_uri):
"""OCR with PDF/TIFF as source files on GCS"""
    import json
    import re
    from google.cloud import vision
    from google.cloud import storage
    # Supported mime_types are: 'application/pdf' and 'image/tiff'
    mime_type = 'application/pdf'

    # How many pages should be grouped into each json output file.
    batch_size = 2

    client = vision.ImageAnnotatorClient()

    feature = vision.Feature(
        type_=vision.Feature.Type.DOCUMENT_TEXT_DETECTION)

    gcs_source = vision.GcsSource(uri=gcs_source_uri)
    input_config = vision.InputConfig(
        gcs_source=gcs_source, mime_type=mime_type)

    gcs_destination = vision.GcsDestination(uri=gcs_destination_uri)
    output_config = vision.OutputConfig(
        gcs_destination=gcs_destination, batch_size=batch_size)

    async_request = vision.AsyncAnnotateFileRequest(
        features=[feature], input_config=input_config,
        output_config=output_config)

    operation = client.async_batch_annotate_files(
        requests=[async_request])

    print('Waiting for the operation to finish.')
    operation.result(timeout=420)

    # Once the request has completed and the output has been
    # written to GCS, we can list all the output files.
    storage_client = storage.Client()

    match = re.match(r'gs://([^/]+)/(.+)', gcs_destination_uri)
    bucket_name = match.group(1)
    prefix = match.group(2)

    bucket = storage_client.get_bucket(bucket_name)

    # List objects with the given prefix.
    blob_list = list(bucket.list_blobs(prefix=prefix))
    print('Output files:')
    for blob in blob_list:
        print(blob.name)

    # Process the first output file from GCS.
    # Since we specified batch_size=2, the first response contains
    # the first two pages of the input file.
    output = blob_list[0]

    json_string = output.download_as_string()
    response = json.loads(json_string)

    # The actual response for the first page of the input file.
    first_page_response = response['responses'][0]
    annotation = first_page_response['fullTextAnnotation']

    # Here we print the full text from the first page.
    # The response contains more information:
    # annotation/pages/blocks/paragraphs/words/symbols
    # including confidence scores and bounding boxes
    print('Full text:\n')
    print(annotation['text'])

Ultimately, Google's document AI allows you to extract a lot of information from documents with high accuracy. In addition, the service is offered for specific usages, too, including text extraction for both normal and in-the-wild images.

Please refer to here for more.

Current Solutions Offering OCR Data Extraction

Besides large corporations with APIs for document data extraction, there are several solutions that provide highly accurate PDF OCR services. We present several options of PDF OCR that are specialized in different aspects, as well as some recent research prototypes that seem to provide promising results*:

*Side Note: There are multiple OCR services that are targeted towards tasks such as images-in-the wild. We skipped those services as we are currently focusing on PDF document reading only.

  • Google API -- As one of the biggest online service providers, Google offers stunning results in document extraction with their pioneering computer vision technology. One can use their services for free if the usage is pretty low, but the price stacks up as the API calls increase.
  • Deep Reader -- Deep Reader is a research work published in ACCV Conference 2019. It incorporates multiple state-of-the-art network architectures to perform tasks such as document matching, text retrieval, and denoising images. There are additional features such as tables and key-value-pair extraction that allow data to be retrieved and saved in an organized manner.
  • Nanonets™ -- With a highly skillful deep learning team, Nanonets™ PDF OCR is completely template and rule independent. Therefore, not only can Nanonets™ work on specific types of PDFs, it could also be applied onto any document type for text retrieval.
Nanonets - Extract Data from Scanned Documents
Nanonets - Extract Data from Scanned Documents 

Looking to extract data from scanned documents? Give Nanonets™ a spin for higher accuracy, greater flexibility, post-processing, and a broad set of integrations!


Conclusion

In conclusion, this article presents a thorough explanation of data extraction from scanned documents, including the challenges behind it and the technology required for this process.

Two tutorials of different methods are presented, and current solutions that offer it out of the box are also presented for reference.