This blog discusses Named-entity Recognition (NER) - a method of structured data information extraction from documents. Using Natural language processing it classifies named entities mentioned in unstructured text into structured pre-defined categories.

The problem is quite simple to define. You have documents say invoices, receipts, Purchase orders, etc that come into your company's numerous workflows. These documents are manually digitized by a human operator and fed into a software system which is time consuming and error-prone. You want to automate this digitization using Deep Learning.

The solution to this problem is not that simple, unfortunately. This blog will aim to give you an overview as well as the caveats of the techniques that can be used to solve this problem. So in the future, you know what to pick from your arsenal and know where to get started. Within this blog we start from understanding the structure of the data you get, explore the various techniques available to solve this problem and give our take on what works best. Hope you like it!

Table of Contents


Have an OCR problem in mind? Want to extract data from documents? Head over to Nanonets and build OCR models for free!

Understanding The Data

Firstly we need to get the OCR results on the images. Since we’ll be attempting to use visual as well as text features. OCR for the most part can be assumed to be a solved problem and there are many OCR APIs like Tesseract.

Different OCR engines provide us with a different kind of output. Take for example your OCR outputs consist of blocks of information. Each block is associated with a type. For us the relevant ones are the ones of block type Text. Extracting all blocks with block type text lets us see what each block of text looks like. Inside each block of text, there are lines and within each line, there are characters. The blocks, lines as well as characters are associated with bounding box information. For Tesseract, the format is different and the bounding box information can be acquired with line numbers, paragraph numbers. It is also possible to obtain character boxes as well as text boxes. To learn more, visit this blog post.

We can build a pipeline that utilizes said bounding box information or we can build an engine completely only based on line and block information. Depending on the use-case and the kind of data one might have to deal with, the approach must change.

Template Matching & Why It Does Not Work

To build a template based OCR solution, we will have to create a list of patterns that each field might represent and write regular expressions for each. Every line would then be iterated over to search which lines match which pattern. Depending on which pattern is matched, we will be able to infer which field an extracted line belongs to. Once we have all the fields extracted, we can finally put them in an organised format for future use. You can find a great resource to pattern matching here.

The above mentioned approach has many limitations like being vulnerable to changing layouts, getting confused with multiline items and the risk of ending up with a product built on lines and lines of unreadable code. Pattern matching is not advisable for data which is highly unstructured and when the structure itself varies often. Also creating the list of patterns that might work for a generalised invoice can be a daunting task in itself and often, this is only applied in situations where a tool is being utilized in-house for specific kinds of documents that are set in their layouts.

Solving the Problem with Machine Learning

Machine learning approaches provide us with an alternative solution that allows us to circumvent our dependence on knowing document templates to extract text information and automate processes. We will look at a few of these approaches, try to understand the concepts associated with them and try to chalk out an end to end solution for our problem.

Named Entity Recognition

Named Entity Recognition allows us to evaluate a chunk of text and find out different entities from it - entities that don't just correspond to a category of a token but applies to variable lengths of phrases. The models take into consideration the start and end of every relevant phrase according to the classification categories the model is trained for.

All the lines we extracted and put into a dataframe can instead be passed through a NER model that will classify different words and phrases in each line into, if it does find any, different invoice fields.

The SpaCy NER tags includes DATE for invoice dates, ORG for vendor names, MONEY for prices and sum total, PRODUCT for items sold, CARDINAL for phone numbers, invoice numbers, etc. We can apply transfer learning and retrain NER models provided by popular NLP libraries like SpaCy, NLTK and Stanford CoreNLP by adding our new entities in their vocabulary and retraining the models. We can also search for patterns based on POS tagging using regular expressions or one of the above mentioned tools.

The key here is designing a strong NER tagging model. To do so, we need to be sure of two things -

  • How we represent word and character level features
  • What algorithms and architectures are used to make predictions.

Training Models for NER Tagging

We can utilize TF-IDF Vectorizer, n-grams or skip-grams to extract our feature representations, utilize GloVe Word2Vec for transfer word embeddings weights and re-train our embeddings using Keras, Tensorflow or PyTorch. Using embeddings will substantially increase our model strength since it will allow us to capture semantic information better.

We can build one our model  using traditional machine learning algorithms like SVMs and Naive Bayes as detailed here, deep learning architectures like RNNs, BiLSTMs, CRFs as is done here and here or using transformers like BERT/XLNet as done here. This repository provides all the transformer architectures and example training and inference scripts for several NLP tasks, including NER tagging. The example script utilizes NER tagging for text summarization but can be repurposed to fit our task.

Once each block is classified into an invoice field, we can put together all the text present in different blocks to produce our results, organised in a JSON or XML format.

Understanding the Output

The way to create our final outputs will depend on how we want to look at the fields extracted. Considering we are going line by line in our pipeline, it would be easier and less prone to errors if we were to club together different extracted named entities into a sort of nested structure. How different lines and chunks of text are extracted and tagged into invoice fields from the predictions derived using NER tagging on phrases and words is explained in the inference evaluation section.

Take for example the items bought that are mentioned in the invoice. It is important that each item is mapped with its price and the quantity. To do this we can create an object called ‘Inventory’ which will include different named entities like the product, the price and the quantity. This is similar to what is described here as nested entities.

Building a Dataset

For all of these situations, besides the strength of our learning algorithms, a deciding factor on how our model performs will be the strength of the data we train our model on. We might need to create a custom dataset for different fields and the phrases that represent these fields. For example,

The field ‘Balance’ can be associated with

  • ‘Balance due’,
  • ‘Payment due’,
  • ‘Amount’,
  • ‘Amount to be paid’,
  • ‘To be paid’,
  • ‘Amount due’, etc.

This process can be done semi-automatically by extracting OCR outputs, finding chunks of text that contain some key words like ‘Amount’, ‘Due’, ‘Balance’, etc. and extract all the phrases found in all the invoices that match the patterns. But creating an exhaustive list is usually a difficult task and we would want to consider using word embeddings and document embeddings to extract features out of these phrases and map them to our fields. So if we encounter a new phrase or word, we can compare our word embeddings to find if the field is similar in its semantic information to do our classification. This is similar to what SpaCy documentation called entity linking using a knowledge base.

Evaluating the NER Tagging Model

When we are able to extract named entities, it is usually done by classifying words or phrases into different fields. Besides finding out the accuracy of our named entity recognition during training, we also need some way of knowing how our tagging model is performing on unseen data. We will, in the coming sections, look at how to evaluate our training process, how to evaluate a continuous training loop and how to measure our inference performance.

NER Training

We evaluate the training process by creating a testing dataset and finding some key metrics -

  • Accuracy: the ratio of correct predictions made against the size of the test data.
  • Precision: the ratio of true positives and total predicted positives.
  • Recall:  the ratio of true positives and total actual positives.
  • F1-Score: harmonic mean of precision and recall.

Different metrics take precedence when considering different use-cases. In the case of invoice processing, we know that a goof-up in the numbers or missing an item can lead to losses for the company. This means that besides needing a good accuracy, we also need to make sure the false positives for money related fields are at a minimum - so aiming for a high precision value might be ideal. We also need to make sure that details like invoice number and dates are always extracted and done correctly since they are needed for legal and compliance purposes, so maintaining a high recall value for these fields might take precedence.

NER Re-training

Digitization of documents is a difficult task to completely automate and trying to aim for 100% automation is seldom practical. When models are encountered with unseen data, building humans in the loop solutions that allow humans to review the information extracted from text, correct predictions and allow for re-training can help constantly improve model accuracy while minimizing the costs algorithmic errors might incur.

Ideally, you’d want to set up a pipeline where you collect data for a few days/weeks/months (depending on the scale you are operating), retrain your model on this new data and evaluate the predictions to see if your model has improved or not.

Here’s what the DevOps pipeline for Microsoft Azure looks like -

source

Versioning datasets, writing differential tests and including them in your continuous integration pipelines can help keeping the most stable algorithms deployed. The differential tests are meant to evaluate, in this case, the metrics we mentioned above, compare the models before and after re-training and decide if deploying the new models is a viable option. The differential tests will not allow models that are inferior post re-training to get deployed.

A note on human error and it’s costs

Though automating the process of digitizing invoices reduces workload for humans and the chances of fatigue related errors, wrong annotations can cost the model, and not always equally. This paper talks about calculating annotation costs (they use a metric based on the length of the text re-annotated incorrectly) and trying to minimize the same in your retraining loop. The metric mentioned in the above paper will not be applicable to the use-case of invoice processing but devising an algorithm that limits re-annotations with higher possible costs can help maintain your prediction quality for a long time.

NER Inference

Remember how we split our XML data into lines and chunks? OCR solutions, especially the commercial ones produce one of the most reliable OCR outputs for digital documents in the industry. Often, an entire chunk will belong to the same invoice field. For an unseen document, assuming our OCR predictions are satisfactorily accurate and that we have a different way of dealing with nested entities (more on this in the next paragraph), we can devise a way to measure confidence in our predictions for each chunk instead of just relying on the confidence provided by our NER tagging model.

We can extrapolate the named entity prediction information for words and phrases to assign invoice fields to each of the chunks and lines extracted from text. We can measure confidence for each chunk by aggregating the named entity prediction confidence for each field that occured in the chunk.

For chunks with single field predictions, the mechanism is pretty straight forward. An average value for all the named entity confidence scores is returned.

If multiple fields occur multiple times in the same chunk, we extract as many confidence scores as the number of fields present in the chunk. Here we need to define a threshold value.

Let’s do this with a basic example.

Here, we pick a threshold confidence of 0.5.
Say our algorithm detected 2 fields.
We have two NEs for field 1 with confidence scores 0.6 and 0.7.
We have three NEs for field 2 with confidence scores 0.3, 0.5 and 0.4.

Clearly, the mean confidence for field 2 is below the threshold, so the block would be classified as field 1 with a confidence of 0.65.

What if for the second field, the confidence scores are 0.5, 0.5 and 0.6?

This would mean two fields cross the threshold. In this case we would check if the two fields together fall into a nested field template. For example, a line item would consist of a product as well as the price. If that is the case, we aggregate all the confidence values and return the line item as the predicted field for the block with a confidence of, in this case, 0.58.

If it doesn’t match any such nested field template, we simply return the field with the higher confidence, like we did in the first thresholding example.

We can also devise more complex block inference values calculations and the corresponding metrics on a weighted average approach instead of a thresholding approach where relations between different entities are either defined via rules according to our use-case and the data we might encounter more frequently or can be learned using relation extraction methods as is done here.

Using Embeddings and CNNs

Tesseract OCR provides us with bounding box information of every chunk of text, every line and every character detected in invoices. This provides us with rich spatial information which unfortunately has been left unexploited in all the approaches mentioned above. This spatial information can be utilized by placing our OCR text in a grid form and using convolutional networks to learn how different named entities correlate with the spatial information as is done here.

The gridded texts are formed with the proposed grid positional mapping method, where the grid is generated with the principle that is preserving texts' relative spatial relationship in the original scanned document image. The rich semantic information is encoded from the gridded texts at the very beginning stage of the convolutional neural network with a word embedding layer. This allows it to access spatial as well as semantic information of the document and indeed, for receipts and invoices, the model has performed better than NER tagging based approaches. The authors use a spatial pyramid based CNN architecture for this purpose.

Using Graph Convolutional Networks

The idea of merging spatial and semantic information for information extraction has been applied in this paper as well. The graph embeddings produced by graph convolution summarize the context of a text segment in the document, which are further combined with text embeddings for entity extraction using a standard BiLSTM-CRF model.

Here each word is modeled as a node and its spatial relationships with its neighboring words modeled as edges. Every node in the graph is associated with a label, which in our case would be our invoice fields, and we want to predict the label of the nodes without ground-truth. The ablation studies revealed that their field extraction did not get too affected if only graph embeddings were used but saw a substantial setback in performance when only word embeddings were utilized.

Summary

We looked at several different machine learning approaches to extract relevant fields from invoices that included out of the box NER tagging methods, training our own NER tagging models, using machine learning classifiers as well as deep learning architectures for the same. We also looked at a few methods for information extraction from text that made use of spatial information along with word embeddings. We discussed creating custom datasets and evaluating our training procedures as well as inference procedures. We looked at continuous learning loops and spoke of some best practices regarding the same.

OCR with Nanonets

The Nanonets Platform allows you to build OCR models with ease. You can upload your data, annotate it, set the model to train and wait for getting predictions through a browser based UI without writing a single line of code, worrying about GPUs or finding the right architectures for your deep learning models.

Start using Nanonets for Automation

Try out the model or request a demo today!

TRY NOW