Extracting Data From Scanned Documents
Looking to extract data from scanned documents? Try Nanonets™ advanced AI-based OCR Scanner to extract and organize information from scanned documents automatically.
As the world has turned from papers and handwritings to digital documents for convenience, the importance of converting images and scanned documents into meaningful data has skyrocketed.
To keep up with the need for highly accurate document data extraction, numerous research facilities and corporations (i.e., Google, AWS, Nanonets etc,.) focused deeply into the technologies in the fields of computer vision and Natural Language Processing (NLP).
The blossoming of deep learning technologies has ensured a giant leap into the kind of data which can be extracted; we are no longer constrained from only extracting text, but also other data structures such as tables and key-value pairs. Many solutions now offer various products to fulfill the needs of individuals and business owners in document data extraction.
This article dives into the current technology used for data extraction from scanned documents, followed by a short hands-on tutorial in Python. We’ll also look at some of the popular solutions currently in the market providing the best offerings in this field.
What is Data Extraction?
Data extraction is the process of converting unstructured data into interpretable information by programs to allow further data processing by humans. Here we list several of the most common types of data to be extracted from scanned documents.
The most common and the most important task in data extraction from scanned documents is extracting text. This process, while seemingly straightforward, is in fact very difficult as scanned documents are often presented in the format of images. In addition, the methods of extraction are highly dependent on the types of text. While text is present in dense printed formats the majority of the time, the ability to extract sparse text from less well-scanned documents or from handwritten letters with drastically varying styles are equally important. Such a process will allow programs to convert images to machine-encoded text, where we can further organize them from unstructured data ( without certain formatting) into structured data for further analysis.
Tabular forms is the most popular approach for data storage, as the format is easily interpretable with human eyes. The process of extracting tables from scanned documents requires technology beyond character detection -- one must detect the lines and other visual features in order to perform a proper table extraction and further convert those information into structured data for further computation. Computer vision methods (described in detail in the following sections) are heavily used to achieve high accuracy table extraction.
An alternative format that we often adopt in documents for data storage is key-value pairs (KVPs).
KVPs are essentially two data items -- a key and a value -- linked together as one. The key is used as a unique identifier for the value to be retrieved. A classic KVP example is the dictionary, where the vocabularies are the keys and the corresponding definitions are the values. These pairs, while usually unnoticed, are actually being used very frequently in documents: questions in surveys such as name, age, and prices of items in invoices are all implicitly KVPs.
However, unlike tables, KVPs often exist in unknown formats and are sometimes even partially handwritten. For example, keys could be pre-printed in boxes and values are handwritten when completing the form. Therefore, finding the underlying structures to automatically perform KVP extraction is an ongoing research process even for the most advanced facilities and labs.
Finally, it is also very important to extract or capture data from figures within a scanned document. Statistical indicators such as pie charts and bar charts often include crucial information for documents. A good data extracting process should be able to infer from the legends and numbers to partially extract data from figures for further use.
Looking to automate data extraction from scanned documents? Give Nanonets™ a spin for higher accuracy, greater flexibility, post-processing, and a broad set of integrations!
Technologies Behind Data Extraction
Data extraction revolves around two main processes: Optical Character Recognition (OCR) followed by Natural Language Processing (NLP).
OCR extraction is the process of converting text images into machine encoded text, while the latter is the analyses on the words to infer meanings. Often accompanied with the OCR are other computer vision techniques such as box and line detection to extract aforementioned data types such as tables and KVPs for more comprehensive extraction.
The core improvements behind the data-extraction pipeline are tightly connected to the advances in deep learning that contributed greatly to the fields of computer vision and natural language processing (NLP).
What is deep learning?
Deep learning, has a major role behind the hype of the artificial intelligence era, and has been constantly pushed to the forefront in numerous applications. In traditional engineering, our goal is to design a system/function that generates an output from a given input; deep learning, on the other hand, relies on the inputs and outputs to find the intermediate relationship that can be extended to new unseen data through the so-called neural network.
A neural network, or a multi-layer perceptron (MLP), is a machine learning architecture inspired by how human brains learn. The network contains neurons, which mimic biological neurons and “activate” when given different information. Sets of neurons form layers, and multiple layers are stacked together to form a network to serve the prediction purposes of multiple forms (i.e., image classifications or bounding boxes for object detections).
In the field of computer vision, a type of neural network variation is heavily applied -- convolutional neural networks (CNNs). Instead of traditional layers, a CNN adopts convolutional kernels that slide through tensors (or high-dimensional vectors) for feature extraction. Accompanied with traditional network layers in the end, CNNs are very successful in image-related tasks, and further formed the basis for OCR extraction and other feature detection.
On the other hand, NLP is reliant on another set of networks, which focuses on time-series data. Unlike images, where one image is independent from one another, text prediction can be largely benefitted if words prior or after are also taken into account. In the past few years, a family of networks, namely long short-term memories (LSTMs), which takes previous results as inputs to predict the current results. Bilateral LSTMs were also often adopted to enhance the prediction output, where both results prior and after were considered. In recent years however, a concept of transformers that uses an attention mechanism is starting to rise due to its higher flexibility leading to better results than traditional networks handling sequential time-series.
Applications of Data Extraction
The main goal of data extraction is to convert data from unstructured documents to structured formats, in which a highly accurate retrieval of text, figures, and data structures can be very helpful for numerical and contextual analysis. These analyses can be very helpful in especially for businesses:
Business corporations and large organizations deal with thousands of paperwork with similar formats on a daily basis -- Big banks receive numerous identical applications, and research teams have to analyse piles of forms to conduct statistical analysis. Therefore, automation of the initial step of extracting data from documents significantly reduces the redundancy of human resources and allows workers to focus on analysing data and reviewing applications instead of keying in information.
- Verifying Applications -- Companies receive tons of applications, whether handwritten or through only application forms. At most times, these applications may be accompanied by personal IDs for verification purposes. Scanned documents of IDs such as passports or cards usually come in batches with similar formats. Therefore, a well written data extractor can quickly convert the data (texts, tables, figures, KVPs) into machine-understandable texts, which could substantially reduce the man hours on these tasks and focus on application selection instead of extraction.
- Payment Reconciliation -- Payment Reconciliation is the process of comparing bank statements to ensure the matching of numbers between accounts, which heavily revolves around data extraction from documents -- a challenging issue for a company with considerable size and various sources of income stream. Data extraction can ease this process and allow employees to focus on faulty data and explore potential fraudulent events about the cash flow.
- Statistical Analysis -- Feedback from customers or experiment participants are used by corporations and organizations to improve on their products and service, and a comprehensive feedback evaluation will usually need a statistical analysis. However, survey data may exist in numerous formats or hidden in between text with various formats. Data extraction could ease the process by pointing out obvious data from documents in batches, ease the process of finding useful processes, and ultimately increase efficiency.
- Sharing Past Records -- From healthcare to switching bank services, big industries often require new customer information which may have already been existent elsewhere. For example, a patient switching hospitals due to moving may have pre-existing medical records that could be helpful to the new hospital. In such cases, a good data extraction software comes in handy as all it is required is for the individual to bring a scanned history of records to the new hospital for them to automatically fill in all the information. Not only would this be convenient, it could also avoid extensive risks especially in the healthcare industry of important patient records being overlooked.
Looking to extract data from scanned documents? Give Nanonets™ a spin for higher accuracy, greater flexibility, post-processing, and a broad set of integrations!
To provide a clearer view on how to perform data extraction, we show two sets of methods on performing data extraction from scanning documents.
Building from Scratch
One may build a simple data extracting OCR engine via PyTesseract engine as the following:
try: from PIL import Image except ImportError: import Image import pytesseract # If you don't have tesseract executable in your PATH, include the following: pytesseract.pytesseract.tesseract_cmd = r'<full_path_to_your_tesseract_executable>' # Example tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract' # Simple image to string print(pytesseract.image_to_string(Image.open('test.png'))) # List of available languages print(pytesseract.get_languages(config='')) # French text image to string print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra')) # In order to bypass the image conversions of pytesseract, just use relative or absolute image path # NOTE: In this case you should provide tesseract supported images or tesseract will return error print(pytesseract.image_to_string('test.png')) # Batch processing with a single file containing the list of multiple image file paths print(pytesseract.image_to_string('images.txt')) # Timeout/terminate the tesseract job after a period of time try: print(pytesseract.image_to_string('test.jpg', timeout=2)) # Timeout after 2 seconds print(pytesseract.image_to_string('test.jpg', timeout=0.5)) # Timeout after half a second except RuntimeError as timeout_error: # Tesseract processing is terminated pass # Get bounding box estimates print(pytesseract.image_to_boxes(Image.open('test.png'))) # Get verbose data including boxes, confidences, line and page numbers print(pytesseract.image_to_data(Image.open('test.png'))) # Get information about orientation and script detection print(pytesseract.image_to_osd(Image.open('test.png'))) # Get a searchable PDF pdf = pytesseract.image_to_pdf_or_hocr('test.png', extension='pdf') with open('test.pdf', 'w+b') as f: f.write(pdf) # pdf type is bytes by default # Get HOCR output hocr = pytesseract.image_to_pdf_or_hocr('test.png', extension='hocr') # Get ALTO XML output xml = pytesseract.image_to_alto_xml('test.png')
For more information regarding the code, you may checkout their official documentation.
In simple words, the code extracts data such as texts and bounding boxes from a given image. While fairly useful, the engine is no where as strong as the ones provided by advanced solutions due to their substantial computational power for training.
Using Google Document API
def async_detect_document(gcs_source_uri, gcs_destination_uri): """OCR with PDF/TIFF as source files on GCS""" import json import re from google.cloud import vision from google.cloud import storage # Supported mime_types are: 'application/pdf' and 'image/tiff' mime_type = 'application/pdf' # How many pages should be grouped into each json output file. batch_size = 2 client = vision.ImageAnnotatorClient() feature = vision.Feature( type_=vision.Feature.Type.DOCUMENT_TEXT_DETECTION) gcs_source = vision.GcsSource(uri=gcs_source_uri) input_config = vision.InputConfig( gcs_source=gcs_source, mime_type=mime_type) gcs_destination = vision.GcsDestination(uri=gcs_destination_uri) output_config = vision.OutputConfig( gcs_destination=gcs_destination, batch_size=batch_size) async_request = vision.AsyncAnnotateFileRequest( features=[feature], input_config=input_config, output_config=output_config) operation = client.async_batch_annotate_files( requests=[async_request]) print('Waiting for the operation to finish.') operation.result(timeout=420) # Once the request has completed and the output has been # written to GCS, we can list all the output files. storage_client = storage.Client() match = re.match(r'gs://([^/]+)/(.+)', gcs_destination_uri) bucket_name = match.group(1) prefix = match.group(2) bucket = storage_client.get_bucket(bucket_name) # List objects with the given prefix. blob_list = list(bucket.list_blobs(prefix=prefix)) print('Output files:') for blob in blob_list: print(blob.name) # Process the first output file from GCS. # Since we specified batch_size=2, the first response contains # the first two pages of the input file. output = blob_list json_string = output.download_as_string() response = json.loads(json_string) # The actual response for the first page of the input file. first_page_response = response['responses'] annotation = first_page_response['fullTextAnnotation'] # Here we print the full text from the first page. # The response contains more information: # annotation/pages/blocks/paragraphs/words/symbols # including confidence scores and bounding boxes print('Full text:\n') print(annotation['text'])
Ultimately, Google's document AI allows you to extract numerous information from documents with high accuracies. In addition, the service is offered for specific usages too, including text extraction for both normal and in the wild images.
Please refer to here for more.
Current Solutions Offering Data Extraction
Besides large corporations with APIs for document data extraction, there are several solutions that provide highly accurate PDF OCR services. We present several options of PDF OCR that are specialized in different aspects, as well as some recent research prototypes that seem to provide promising results*:
*Side Note: There are multiple OCR services that are targeted towards tasks such as images-in-the wild. We skipped those services as we are currently focusing on PDF document reading only.
- Google API -- As one of the biggest online service providers, Google offers stunning results in document extraction with their pioneering computer vision technology. One can use their services for free if the usage is pretty low, but the price stacks up as the API calls increase.
- Deep Reader -- Deep Reader is a research work published in ACCV Conference 2019. It incorporates multiple state-of-the-art network architectures to perform tasks such as document matching, text retrieval, and denoising images. There are additional features such as tables and key-value-pair extraction that allow data to be retrieved and saved in an organized manner.
- Nanonets™ -- With a highly skillful deep learning team, Nanonets™ PDF OCR is completely template and rule independent. Therefore, not only can Nanonets™ work on specific types of PDFs, it could also be applied onto any document type for text retrieval.
Looking to extract data from scanned documents? Give Nanonets™ a spin for higher accuracy, greater flexibility, post-processing, and a broad set of integrations!
In conclusion, this article presents a thorough explanation towards data extraction from scanned documents, including the challenges behind it and the technology required for this process.
Two tutorials of different methods are presented, and current solutions that offer it out of the box are also presented for reference.