Introduction

The widespread use of the internet and the blossoming of crowdsourcing techniques have empowered the use of Google’s AI in various fields. The simple speech-to-text followed by human-language understanding of a virtual assistant; the deep network classification of images that helped users with their search; Google’s deep learning techniques have helped shape our daily routine behind the scenes.


Looking for an OCR solution that overcomes the shortcomings of Google Document AI? Give Nanonetsa spin for higher accuracy, greater flexibility, and wider document types!


What one may not be aware of, however, is that Google offers a variety of hardware and software products as services that could be purchased by customers for various needs. For instance, the Google Vision API allows users to adopt state-of-the-art vision methods on tasks such as object detection, segmentation, and even optical character recognition (OCR). The hardwares such as GPUs and machines are also available for rental by companies to train their customized models and set up fast and computationally efficient servers.

One major service that may have been overlooked due to the kaleidoscopic range of services Google offers is the Document AI. Similar to the Google Vision API, Google Document AI utilizes cutting-edge methods to extract information from piles of paperwork. This article dives into a description and technology of Google Document AI, followed by short explanations of its range of capabilities and applications, and some of its competitors which may be useful in different scenarios.

What is Google Document AI?

Google Document AI automates data processing of documents at scale. It is built from the decades of AI research of Google, and therefore provides detailed information regarding a particular document beyond words.

Besides providing a generic document analysis and retrieval, Google Document AI also supports specific formats such as receipts, invoices, payslips, and specific forms that are often processed in large batches by organizations.

Getting Started with Document AI

One may head to the Google Document AI page and test out one of their documents or one of your own to see the quality of extraction. The output will be in the form of JSON format that could be downloaded and analyzed.

Finally, besides the automatic approach of document retrieval, Google Document AI now incorporates a human-in-the-loop concept that would allow users to suggest mistakes within the document retrieval. This process will be incorporated into the learning process and constantly improve the AI’s document retrieval ability.


Looking for an OCR solution that overcomes the shortcomings of Google Document AI? Give Nanonetsa spin for higher accuracy, greater flexibility, and wider document types!


Types of Data Supported by Google Document AI

Text

The main goal of Google Document AI is to extract the text within a document. This would ease up the process of scanning through forms that require significant human effort. Besides the text, Google Document AI also determines where there are line breaks and sentence breaks. This allows further personalization and processing by users on the JSON  output after retrieving the important information from Google Document AI. For example, depending on the industry/purpose of utilizing the services, one may perform further data analysis or provide responses to forms after extracting the necessary information from a PDF document. The text could also be typed or handwritten, providing more flexibility which Google Document AI can handle.

Document overview information

Google Document AI also provides some information regarding the given document, including the orientation of the pages, anchors, and the languages detected from the document (which may contain more than one). This information could be useful when tidying or organizing the batches of documents. For instance, the documents can be automatically rotated to portrait for further investigation/OCR scanning or can be separated into piles depending on the language of a particular response.

Key-Value Pairs and Table Extractions

Besides text and overview information about a document, one most important feature that needs to be extracted from documents is data. Manual data extraction is a repetitive process that could be daunting and error-prone, not to mention the difficulties when documents are scanned as images and not text.

In most cases, data is not stored in paragraphs or sentences, but in tabular forms and key-value pairs (KVPs), which are essentially two linked data items, key and value, where the key is used as a unique identifier for the value (i.e., Name: John or Age: 19). Specifically, when dealing with documents such as forms, these data types exist more than often, and text extraction will simply not be enough. In addition, unlike tables, KVPs could often exist in unknown formats and are often partially hand-written in forms. Even with state-of-the-art text extraction, it may still be difficult to determine KVPs with only text and not taking into account the features on paper (e.g, bounding boxes, lines).

Luckily, Google Document AI also performs extractions from the aforementioned two data types, allowing users to not only retrieve text in a clean and organized manner, but also automatically obtain data from underlying data structures for further use.

Example Uses of Google Document AI

Google Document AI
Google Document AI

Prerequisites

To use any services provided by the Google Vision API, one must configure the Google Cloud Console and perform a series of steps for authentication. The following is a step-by-step overview of how to set up the entire Vision API service.

  1. Create a Project in Google Cloud Console: A project needs to be created in order to begin using any Vision service. The project organizes resources such as collaborators, APIs, and pricing information.
  2. Enable Billing: To enable the vision API, you must first enable billing for your project. The details of pricing will be addressed in later sections.
  3. Enable Vision API
  4. Create Service Account: Create a service account and link to the project created, then create a service account key. The key will be output and downloaded as a JSON file onto your computer.
  5. Set Up Environment Variable GOOGLE_APPLICATION_CREDENTIALS

A more detailed procedure of the aforementioned steps can be found from the official documentation given by Google Cloud from here:

https://cloud.google.com/vision/docs/quickstart-client-libraries

Code

The Document AI can be separated into two aspects: the text extraction, and the understanding of text. For text extraction, one can refer to the code for detect text in PDFs:

def async_detect_document(gcs_source_uri, gcs_destination_uri):
    """OCR with PDF/TIFF as source files on GCS"""
    import json
    import re
    from google.cloud import vision
    from google.cloud import storage

    # Supported mime_types are: 'application/pdf' and 'image/tiff'
    mime_type = 'application/pdf'

    # How many pages should be grouped into each json output file.
    batch_size = 2

    client = vision.ImageAnnotatorClient()

    feature = vision.Feature(
        type_=vision.Feature.Type.DOCUMENT_TEXT_DETECTION)

    gcs_source = vision.GcsSource(uri=gcs_source_uri)
    input_config = vision.InputConfig(
        gcs_source=gcs_source, mime_type=mime_type)

    gcs_destination = vision.GcsDestination(uri=gcs_destination_uri)
    output_config = vision.OutputConfig(
        gcs_destination=gcs_destination, batch_size=batch_size)

    async_request = vision.AsyncAnnotateFileRequest(
        features=[feature], input_config=input_config,
        output_config=output_config)

    operation = client.async_batch_annotate_files(
        requests=[async_request])

    print('Waiting for the operation to finish.')
    operation.result(timeout=420)

    # Once the request has completed and the output has been
    # written to GCS, we can list all the output files.
    storage_client = storage.Client()

    match = re.match(r'gs://([^/]+)/(.+)', gcs_destination_uri)
    bucket_name = match.group(1)
    prefix = match.group(2)

    bucket = storage_client.get_bucket(bucket_name)

    # List objects with the given prefix.
    blob_list = list(bucket.list_blobs(prefix=prefix))
    print('Output files:')
    for blob in blob_list:
        print(blob.name)

    # Process the first output file from GCS.
    # Since we specified batch_size=2, the first response contains
    # the first two pages of the input file.
    output = blob_list[0]

    json_string = output.download_as_string()
    response = json.loads(json_string)

    # The actual response for the first page of the input file.
    first_page_response = response['responses'][0]
    annotation = first_page_response['fullTextAnnotation']

    # Here we print the full text from the first page.
    # The response contains more information:
    # annotation/pages/blocks/paragraphs/words/symbols
    # including confidence scores and bounding boxes
    print('Full text:\n')
    print(annotation['text'])

If the text is not in PDF but in image format, one may also refer to the cloud vision API for text detection. Finally, in order to extract meanings from text, one may use the NLP APIs provided by Google

All the details regarding the codes can be found here.


Looking for an OCR solution that overcomes the shortcomings of Google Document AI? Give Nanonetsa spin for higher accuracy, greater flexibility, and wider document types!


Applications of Google Document AI

Applications of Google Document AI are vast and highly demanded by industries and individual users. Here we roughly divide the applications into personal and business use and provide a few examples within each field.

Personal

While automation of document reading is mostly used for large-scale production to reduce labour costs, fast and accurate extraction of data and text can also be beneficial for improving personal routine and organization.

  • ID-Scans and Data Conversion: Personal IDs and passports are often stored in PDF files and scanned across different websites. They contain various data, particularly KVPs (e.g., given name to date of birth), which are often needed for online applications, but we have to manually find and type in the identical information again and again. Proper data extraction from PDFs can allow us to quickly convert data into machine-understandable text. Processes like filling in forms will then become trivial tasks for numerous programs and the only manual efforts left would be quick scan-throughs for double-checking.
  • Invoice Data Extraction: Budgeting is a crucial aspect of our daily lives. While the development of spreadsheets has simplified the tasks already, automatic extraction of data still remains a process that, if empowered by machines, eases up much of the budgeting process. Users can quickly perform analysis based on the results of Google Document AI and determine/find purchases that are abnormal or are not affordable.

Business

Business corporations and large organizations deal with thousands of paperwork with similar formats daily -- Big banks receive numerous identical applications, and research teams have to analyse piles of forms to conduct statistical analysis. Therefore, automation of the initial step of extracting data from documents significantly reduces the redundancy of human resources and allows workers to focus on analysing data and reviewing applications instead of keying in information.

  • Payment Reconciliation: Payment Reconciliation is the process of comparing bank statements against your accounting to make sure the amounts are matched correctly. For small firms where their clients and cash flows are from fewer sources and banks, reconciliation may be fairly straightforward. However, as company scale expands, and money inflows and outflows become more diverse, this process will soon become daunting and labor-intensive, exponentially increasing the probability of error. Therefore, numerous automated methods were proposed to alleviate the pipeline from human efforts. The initial stage of payment reconciliation is data extraction from documents, which can be a challenging issue for a company with considerable size and various sectors. Google Document AI can alleviate this process and allow employees to focus on faulty data and explore potential fraudulent events about the cash flow.
  • Statistical Analysis: Feedback from customers, citizens, or even experiment participants are required by corporations and organizations to improve on their product/service and planning. To comprehensively evaluate feedback, statistical analysis is often required. However, the survey data may exist in numerous formats or may be hidden between text either typed or handwritten. Google Document AI could ease the process by pointing out obvious data from documents in batches, alleviate the process of finding useful processes, and ultimately increase efficiency.

Technology Behind Google Document AI

Google Document AI OCR Technology
Google Document AI OCR Technology

The proprietary AI technology adopted by Google Document AI falls within the fields of Computer vision and Natural Language Processing (NLP). Computer vision is the field of trying to allow machines to understand images, where NLP is the process of interpreting useful information given a series of words or sentences. Essentially, Google Document AI leverages the technology in computer vision, particularly in optical character recognition, to detect words/phrases from a given PDF, then use these words and phrases as inputs to an NLP network to find important meanings behind them. The following is a brief description of basic techniques used in these fields.

Computer vision

Since the development of deep learning, traditional methods of image processing to obtain/detect features are being overthrown due to the large accuracy discrepancies. Current techniques in computer vision are now mainly built upon convolutional neural networks (CNNs).

CNNs are specific types of neural networks that utilize a traditional tool in image and signal processing: kernels. The kernels are small matrices that hover over an image to perform dot products, which should allow certain features to be selected. One main difference between traditional kernels and kernels in CNNs, however, is that the weights/constants within kernels are pre-set in conventional image processing, but learnt in CNNs. Presetting the kernel constants allows machines to perform only specific and simple tasks such as line and corner detections, but restricts the performance on tasks such as text detection. This is due to the fact that the features of different text are too complicated and therefore the constants of the kernels that could portray the relationship between features and actual text are simply not easy to be determined through manual effort.

An interesting fact worth noting is that the concept of CNN actually came into existence decades ago, but is only put into use recently with the exponential rise of computational hardware that made deep learning processes feasible. Nowadays, state-of-the-art approaches to vision tasks, from classification and segmentation to anomaly detection and content generation, all revolve around CNNs.

In simple words, text, key-value pairs, and tables are all features from the PDF that could be detected via the Google Document AI with the help of CNNs.

Natural Language Processing

Similar to the path of computer vision’s recent progress, deep learning has also shed light on a long ongoing research field in computer science: NLP. NLP is the process of understanding the words or series of words combined in a paragraph to suggest meanings. This task is considered at many times to be even harder than understanding images as even the same phrase could be interpreted differently under different context.

In the past few years, research has been focused on a type of neural network, namely long-short term memory (LSTM), which determines the output of the next event based on not only the current input but also previous input along the time-series data. Recently, however, the focus has been slightly shifted to a different family of networks called transformers. Transformers focus on learning the attentions of a series of events. In this case, particular vocabularies within a sentence may deserve more attention than others despite them existing longer or shorter from the current word you are investigating. The results of transformers largely outperform previous networks in numerous tasks including word navigation and understanding semantics.


Looking for an OCR solution that overcomes the shortcomings of Google Document AI? Give Nanonetsa spin for higher accuracy, greater flexibility, and wider document types!


Competition and Alternatives

Although Google Document AI has achieved stunning success and partners with numerous industries in delivering fast and accurate analyses of documents, there are several alternatives that provide similar services. The followings are several competitors with Google’s Document AI service:

Amazon Textract

Similar to Google, Amazon Web Service has been serving large corporations on the internet for a long time now, allowing them to have gained long years of AI research experiences to develop the Amazon Textract engine that performs similarly to Google’s Document AI. Amazon Textract goes beyond just words and text, but also the meanings behind text as well as features such as table extraction. The processor is also bound by security laws and users can easily get insights into its security processes.

ABBYY and Kofax

On the other hand, there are also other solutions such as ABBYY and KOFAX that are dedicated to PDF OCR. Their products include friendly UIs for PDF OCR reading. However, with their non-engineering nature, they are more difficult to be incorporated directly onto other programs to form a fully automated process. In addition, their services are OCR-only, which means data extraction and more advanced deep learning techniques to fully understand the documents are not offered.

Nanonets™

Nanonets™ is a company that specializes in OCRs across all kinds of documents - from receipts, statements, to invoices. Their deep learning models are trained across hundreds of thousands of specific targets, allowing them to perform extremely well on not only specialized tasks but also generalizable across unseen documents. Non-engineers can adopt either their friendly user interface and computer scientists can easily use their APIs and incorporate the OCR capabilities into other tasks for PDF document reading.

Conclusion

In conclusion, this article describes the concepts behind Google Document AI, with a simple code provided suggesting how to use the service. Comparison with alternative services and the pros and cons are listed to show which service is best for specific purposes.