How to use deep learning for data extraction from financial documents
Importance of the problem
If you have a relative working in the banking industry, ask the person what annoys him/her most about the job. You will surely receive an answer that is related to the task of data entry i.e. the practice of manually entering serial numbers and names from financial documents into the bank’s database. This can also lead to the following problems:
- Lapses in Data Entry: Manual data entry would result in a high proportion of errors due to mistakes made by the employees. These mistakes would take a considerable amount of time and capital to correct.
- Cost Incurred: The financial institution would have to spend a considerable amount of capital to train employees when their time could be spent in more productive ventures.
- Maintenance of hard copies : Banks are forced to maintain physical copies of all the financial documents that are processed. This can lead to problems regarding scalability.
It might interest your relative to know that the entire process of data entry can be automated. HOW! he or she might ask. Using Optical Character Recognition (OCR), Computer vision and deep learning.
Nanonets supports data extraction from all major financial documents. Get started today!
Let’s take a short detour and define what Optical Character Recognition is. In simple terms, Optical Character Recognition involves examining a document and identifying the text which is present within the document. It can be thought of as text recognition.
In this article, we will go over the process of applying OCR to financial documents and the various steps involved in this process. Let us try and get a brief understanding of the process from the following story.
In the movie Charlie and the Chocolate Factory, Willy Wonka shows off a machine that he claims can send a bar of chocolate to the customer via television. He brings in a giant bar of chocolate, points his machine at it and with the press of a button is able to transmit that piece of chocolate into a television screen at the other end of the room. Although we haven’t been able to replicate this experiment with chocolates, OCR will help us achieve a similar effect on word documents.
We can follow exactly the same process, we bring in our document, place it in the scanner, and with the press of a button, we can get a digitized copy of document (meaning that each character on the document has been converted into its corresponding machine-encoded form) that can be easily processed and stored. The rest of this document will focus on algorithms that enable us to carry out this conversion from written or printed text into its corresponding digital form.
The tools that we will employ for carrying out OCR in the first part of this article are OpenCV (an open-source computer vision and machine learning library), Tesseract (an open-source OCR engine) and regular expressions using python. In the second part of the article, we make use of Named Entity Recognition(NER) by spaCy (an open-source library for natural language processing) to extract useful information from the obtained raw text. To understand the process of OCR clearly, we will use a financial document like the one shown in the figure below as an input to the text extraction algorithm.
Fig.2: Loan document
Before getting into any of it further, let’s get an overview of steps. The individual steps that are used to carry out OCR on the loan document are illustrated using the flowchart given below:
Go-To Deep Learning Approach
Deep Learning when fed with large amounts of clean processed data, shows amazing results. In this section, we describe a few approaches to help you get a brief idea.
EATEN: Entity-aware Attention for Single Shot Visual Text Extraction
- A usual approach to the problem consists of extracting text by making use of OCR and then identifying the various fields like name, phone number (entities) by using either Named Entity Recognition (NER) or regular expressions. The unstable performance of NER and the fragility of the post-processing algorithms are the main bottlenecks in the approach.
- EATEN tries to alleviate this problem by proposing an end-to-end trainable system which can be used to extract entities from an image in a single-shot.
Fig9: Architecture of EATEN (source: EATEN: Entity-aware Attention for Single Shot Visual Text Extraction. https://arxiv.org/abs/1909.09380#)
- The model consists of a feature extractor which is used to extract visual features from the image followed by an Entity-aware attention network which consists of a series of Entity-aware decoders. Each decoder is responsible for predicting a certain set of predefined entities/entity.
- EATEN reports a significant improvement over general OCR methods (evaluation method considered is mean entity accuracy). It is also able to perform better in cases where parts of the image are blurred in which case traditional OCR methods would fail.
The figure and results stated above were taken from the following paper: https://arxiv.org/abs/1909.09380
dhSegment: A generic deep-learning approach for document segmentation
- A major drawback of our approach was that we made use of prior information regarding the position of our region of interest (table) in our document. Document analysis could help overcome this problem by automatically performing page extraction, text line detection, document layout analysis etc.
- Most traditional methods for document analysis employ a large number of varied segmentation techniques that are specific to a type of document and the class of problem. This paper proposes a generic deep learning model that can outperform these traditional approaches.
Fig 10:Architecture of dhSegment: (source: dhSegment: A generic deep-learning approach for document segmentation. https://arxiv.org/abs/1804.10371)
- As shown in Fig 10, the network architecture consists of a CNN which employs a Res-Net architecture followed by contracting and expansive paths (characteristic of the U-Net architecture). Given an input image, the network outputs the attributes associated with each pixel in the image.
- A minimal number of standard image processing techniques are applied to the predictions of the network to get the desired output. The figure below illustrates this process.
Fig 11: Document analysis using dhSegment
The figure and results stated above were taken from the following paper:dhSegment: A generic deep-learning approach for document segmentation. https://arxiv.org/abs/1804.10371
From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction
- As noted earlier in the article, OCR tends to produce erroneous output when the quality of the input image is poor or when the input image has not been pre processed correctly. This paper proposes a method for correcting these errors.
- The paper presents an unsupervised method for extracting data and training a sequence-to-sequence neural machine translation (NMT) model.
Architecture and Training
- As mentioned above, the model uses NMT in building a sequence-to-sequence model. The data required to train the model is automatically extracted from a corpus containing OCR errors.
- Extracting the list of OCR errors and their corresponding correctly spelt words is done by making use of 2 similarity measures namely similarity in meaning and similarity on a character level. Following this, a dictionary is used to distinguish the correctly spelt word from the erroneous word.
- By coupling the above-mentioned steps with other well-known language processing techniques, the authors were able to obtain an accuracy of 59.5% (accuracy indicates the percentage of erroneous words that were fixed correctly) on a corpus containing 200 hand-selected words with OCR errors and their correct spellings.
The results stated above were taken from the following paper : From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction. https://arxiv.org/pdf/1910.05535v1.pdf
A simple demonstration on a financial document
In this section, we’ll see how simple techniques using OpenCV, Deep Learning algorithms like Tesseract can give you considerable results. All of the code used in this section can be found on my github page: https://github.com/Varghese-Kuruvilla/OCR
1. Extracting the Region of Interest from the image:
Although there are multiple ways of extracting the desired region of interest from the image, we have made use of OpenCV. The following code snippet is self-explanatory.
We have made use of OpenCV’s contour detection APIs to extract the table from the document.
After extracting the entire table, each cell is extracted from the table by making use of the same technique. Finally, neighbouring cells are merged to obtain a series of images, each containing a single row of the table.
2. Performing Optical Character Recognition using Tesseract:
For performing Optical Character Recognition, we have made use of Tesseract, one of the most popular open-source OCR algorithms. (For more details about Tesseract check out this great Nanonets blog post: https://nanonets.com/blog/ocr-with-tesseract/).
An important step is to preprocess the image before handing it over to Tesseract. Some of the image processing operations that would aid in obtaining better quality output from Tesseract are image rescaling, binarization, noise removal and deskewing to name a few. We have processed the image by simply converting the cropped image to grayscale followed by simple binary thresholding before passing it to Tesseract.
Which, after parsing through an OCR, looks like the raw text below -
Fig3: The image on the top shows the .png file and the image on the bottom is the output from Tesseract
3. Parsing the output from Tesseract:
We make use of regular expressions to parse the output from Tesseract and create a CSV file containing all the required information.
Fig4: CSV File obtained after parsing the output from tesseract
On observing Fig3, we can see that the tel_no and pan_no fields have empty lists associated with them. Upon closer inspection, it was found that Tesseract simply failed to recognize the text in this case. This is illustrated by using the figure below:
Fig5: Tesseract failed to recognize the text in this case
4. NER using SpaCy
Let’s overlook the limitations of our OCR Algorithm for the time being and assume that we have obtained a reasonable output. Simply storing this data in a database without validating it is unreasonable. The customer may have filled in his name instead of his pan number, prompting the customer to correct it at this stage would prevent a lot of confusion later in the pipeline.
Data validation can be performed using numerous Natural Language Processing tools. We employ a combination of Named Entity Recognition and Regular expressions for data validation. Named Entity Recognition(NER) aims to classify words in a document into certain pre-defined categories. We implement NER by making use of spaCy, an open-source Natural Language processing library. The figure below illustrates NER carried out by spaCy:
Fig 7: NER carried out by spaCy. The algorithm has successfully identified different entities in the text. (Source https://miro.medium.com/max/2594/1*rq7FCkcq4sqUY9IgfsPEOg.png )
Readers who are keen on understanding how spaCy’s NER algorithm works should refer to the explosion AI blog (https://explosion.ai/blog/deep-learning-formula-nlp). The main steps that are followed are clearly summarized by the authors.
To grasp all the different steps mentioned in the blog post a clear understanding of RNNs, LSTMs etc is mandatory. I found the following resources to be extremely helpful when I was starting out.
- [1706.03762] Attention Is All You Need
Phew, all that reading must have left you exhausted (provided you bothered to spend time going through the above articles and papers). The good news is that even if you didn’t understand a single word of how the model works, you can still implement NER by making use of spaCy’s incredibly easy to use API calls.
Let’s come back to our original problem. Our loan document contains fields like name, fathers/husbands name and residential address. We validate these fields by making use of NER to predict labels for these fields. If the labels turn out to be correct, we can be certain that the information entered by the customer is correct. For other fields such as relationship with co-applicant, period of stay, email address and mobile number which have a standard pattern, we employ regular expressions to parse and validate them.
The code snippet below illustrates the patterns fed to spaCy’s matcher object to validate our data using NER.
The regular expressions used to parse more structured fields (like pan number, mobile number etc) are shown in the following code snippet.
Based on the output of the matcher object and python Regex we populate a dictionary which indicates if the corresponding fields in the loan document contain valid entries. This is clearly illustrated by the following figure:
From the figure shown above, it is evident that we have successfully validated a majority of the fields that we obtained from the output of OCR. Tesseract simply fails to extract the pan_no (Pan number of the customer) and tel_no(customer’s telephone number). Also, the email id fails to produce a match with its corresponding regular expression. By tweaking the regular expressions and training spaCy’s model with custom data it is possible to obtain reasonable output from the validation algorithm.
On the face of it, it seems like we have successfully set up a reasonable OCR pipeline. Well, not really. A closer look at the code and you start to see the skeletons in the closet. The above approach is rather a simple one. Regardless, it is important to understand limitations of even the most advanced methods.
- Quality of the input image: The performance of Tesseract largely depends upon the quality of the input image. An image that is perfectly thresholded, has a minimal amount of noise, is perfectly straight etc will perform better than an image that is devoid of these features.
- Scalability: In the article, we have made use of a large amount of prior information regarding the layout of our document. For example, we made use of the fact that our region of interest (table) is the largest contour in the document. Such a model wouldn’t scale well for other types of documents.
- Dependence on external factors: Factors like the customer’s handwriting, type or make of the scanner used etc can affect the performance of Tesseract to a large degree. The figure below illustrates the output of Tesseract when the image contains samples of bad handwriting.
Fig 6: Output from tesseract when the handwriting is poor
- A keen reader might have noticed that I have made use of names like Mr Peter Brown and Mr John Watson. My first choice was Mr Prasad Kumar, but it turns out that the model didn’t contain a lot of Indian names and addresses as part of its training dataset. The good news is that training spaCy on a custom dataset is incredibly easy. The bad news, finding such a dataset is incredibly tough.
- Secondly, using NER for validation is not a very good idea. If you went through the architecture of the model, it is clear that spaCy predicts labels by taking into account contextual information. It would perform much better if we feed it an entire sentence instead of individual words.
In this article, we went through the various steps involved in performing Optical Character Recognition using Tesseract. Further, we tried validating the data using Named Entity Recognition and regular expressions. We found that the pipeline performs reasonably well in a controlled environment but isn’t very robust. Finally, we conclude by reviewing some of the current state of the art deep learning approaches employed in the field of Optical Character Recognition.
OCR with Nanonets
The Nanonets OCR API allows you to build OCR models with ease. You can upload your data, annotate it, set the model to train and wait for getting predictions through a browser-based UI.
1. Using a GUI: https://app.nanonets.com
2. Using NanoNets API: https://github.com/NanoNets/nanonets-ocr-sample-python
A step-by-step guide to training your own model using the Nanonets API -
Step 1: Clone the Repo
git clone https://github.com/NanoNets/nanonets-ocr-sample-python cd nanonets-ocr-sample-python sudo pip install requests sudo pip install tqdm
Step 2: Get your free API Key
Get your free API Key from https://app.nanonets.com/#/keys
Step 3: Set the API key as an Environment Variable
Step 4: Create a New Model
Note: This generates a MODEL_ID that you need for the next step
Step 5: Add Model ID as Environment Variable
Step 6: Upload the Training Data
Collect the images of the object you want to detect. Once you have dataset ready in folder images (image files), start uploading the dataset.
Step 7: Train Model
Step 8: Get Model State
The model takes ~30 minutes to train. You will get an email once the model is trained. You can check the state of the model using:
watch -n 100 python ./code/model-state.py
Step 9: Make Prediction
python ./code/prediction.py PATH_TO_YOUR_IMAGE.jpg