Invoice Parser - Invoice Data Extraction for PDFs and Scanned Documents
If you've ever had to process an invoice manually, you know just how time-consuming and tedious the process can be. Not to mention, it's prone to mistakes since it's easy to miss something when you're doing everything by hand.
That's where invoice parsers come in. These tools automate the process of extracting data from invoices, making it quick and easy to get the information you need. This can save you a lot of time and hassle and help ensure that your invoices are processed accurately. Let’s dive right in.
What is an invoice parser?
An invoice parser is a type of software that is designed to read and interpret invoice documents. This can include PDFs, images and other types of files.
The purpose of an invoice parser is to extract key information from an invoice, such as the invoice id, total amount due, the invoice date, the customer name, and so on. Invoice parsers can help ensure accuracy by avoiding mistakes that can occur from manual data extraction.
Invoice parsers can be standalone programs or be integrated into larger business software systems. These tools make it easier for teams to generate reports or export the data to other applications, such as Excel and are often used alongside other business management applications.
There are many different invoice parsing software solutions on the market, so choosing one that meets your specific needs is essential.
Looking to automate your manual AP Processes? Book a 30-min live demo to see how Nanonets can help your team implement end-to-end AP automation.
How does an invoice parser work?
To understand how invoice parsers work, it is important to have a working knowledge of parsers.
Parsers are used to interpret and process documents written in a specific markup language. They break the document down into smaller pieces, called tokens, and then analyze each token to determine its meaning and how it fits into the overall structure of the document.
To do this, parsers must have a strong understanding of the grammar of the markup language used. This allows them to identify individual tokens and correctly understand the relationships between them. Depending on the parser, this process can be either manual or automatic. Manual parsers require someone to step through the document and identify each token, while automatic parsers use algorithms to detect and process tokens automatically. Either way, parsers play an essential role in making sense of documents written in markup languages.
In data extraction, invoice parsing can analyze an invoice document and extract relevant information.
Consider, for example, the case where you have been given many invoices and want to store data from them in a structured format. Invoice parsing enables you to load all the files and run optical character recognition (OCR) so that the data can be read and all the key-value pairs extracted within a few minutes. Next, you can use some post-processing algorithms to store them into more readable formats like JSON or CSV. You can also build processes and workflows using invoice parsing to automate the extraction of invoices from your business's records.
Invoice parsing with Python
Python is a programming language for various data extraction tasks, including invoice parsing. This section will teach you how to use Python libraries to extract data from invoices.
Building a generic state-of-the-art invoice parser that can run on all data types is difficult, as it includes various tasks such as reading text, handling languages, fonts, document alignment, and extracting key-value pairs. However, with help from open-source projects and some ingenuity, we could at least solve a few of these problems and get started.
For example, we’ll use a tool called tabula on a sample invoice — a python library to extract tables for invoice parsing. To run the below code snippet, make sure both Python and tabula/tabulate are installed on the local machine.
from tabula import read_pdf from tabulate import tabulate # PDF file to extract tables from file = "sample-invoice.pdf" # extract all the tables in the PDF file #reads table from pdf file df = read_pdf(file ,pages="all") #address of pdf file print(tabulate(df)) print(tabulate(df))
- ------------ ---------------- 0 Order Number 12345 1 Invoice Date January 25, 2016 2 Due Date January 31, 2016 3 Total Due $93.50 - ------------ ---------------- - - ------------------------------- ------ ----- ------ 0 1 Web Design $85.00 0.00% $85.00 This is a sample description... - - ------------------------------- ------ ----- ------
We could extract the tables from a PDF file with a few lines of code. This is because the PDF file was well formatted, aligned, and electronically created (not captured by camera). In contrast, if the document had been captured by a camera instead of being electronically produced, it would have been much harder for these algorithms to extract the data—this is where optical character recognition comes into play.
Let's use tesseract, a popular OCR engine for python, to parse through an invoice.
import cv2 import pytesseract from pytesseract import Output img = cv2.imread('sample-invoice.jpg') d = pytesseract.image_to_data(img, output_type=Output.DICT) print(d.keys())
This should give you the following output -
dict_keys(['level', 'page_num', 'block_num', 'par_num', 'line_num', 'word_num', 'left', 'top', 'width', 'height', 'conf', 'text'])
Using this dictionary, we can get each word detected, their bounding box information, the text in them, and their confidence scores.
You can plot the boxes by using the code below -
n_boxes = len(d['text']) for i in range(n_boxes): if float(d['conf'][i]) > 60: (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i]) img = cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2) cv2.imshow('img', img) cv2.waitKey(0)
You’ll see the following output:
This is how we can use and recognize the regions of an invoice. However, custom algorithms must be built for key-value pair extraction. We’ll learn more about this in the following sections.
Set up touchless AP workflows and streamline the Accounts Payable process in seconds. Book a 30-min live demo now.
Challenges with legacy rule-based invoice parsers
Today, many organizations still rely on legacy systems for invoice-data extraction.
These "rule-based" systems parse each line item on invoices and then compare them against a set of rules to determine whether the information should be added to their database.
This method has been used for a long time but has several drawbacks. Let's look at some common problems faced by legacy invoice parsers.
- Page tilt while scanning: One problem with rule-based invoice parsers is that they can have difficulty with "page tilt." This occurs when the fields in an invoice are not positioned in a straight line, making it difficult for the parser to accurately identify and extract the data. This can often be caused by printers that do not print evenly or by manual input of data that may not be aligned correctly.
- Format change: One of the most common issues a business faces is invoices that are not formatted in a standard format. This can cause problems when trying to extract data from an invoice. For example, different fonts could be used, and the invoice layout may change from one month to another. It is difficult to parse the data and determine what each column represents. For example, some new fields could be added to the invoice, or some existing fields might be placed in different positions. Or there could be a completely new structure altogether because of which an ordinary rule-based parser will not be able to recognize invoices correctly.
- Table Extraction: Rule-based table extractors are often the most straightforward and easy way to extract data from a table. However, they have their limitations when dealing with tables that do not contain any headers or include null values in specific columns because these scenarios will cause an infinite loop during processing which results in either wasting time on loading infinitely long rows into memory (or outputting nothing at all) if there were dependent expressions involving those attributes as well. Additionally, when tables span multiple pages, rule-based parsers treat them as different tables instead of one and thus mislead the extraction process.
Build an AI-based invoice parser with Nanonets
Invoice parsers with optical character recognition (OCR) and deep learning can extract data from invoices that have been scanned or converted to PDFs. This data can then populate accounting software, track expenses, and generate reports.
Deep learning algorithms can learn how to identify specific elements in an invoice, such as the customer's name, address, and product information. This allows for more accurate data extraction and can reduce the time needed to manually input data into a system. However, building such algorithms requires a lot of time and expertise, but don’t worry; Nanonets has your back!
Nanonets is an OCR software that uses artificial intelligence to automate the extraction of tables from PDF documents, images, and scanned files. Unlike other solutions, it doesn’t require separate rules and templates for each new document type. Instead, it relies on cognitive intelligence to handle semi-structured and unseen documents while improving over time. You can also customize the output to only extract tables or data entries of your interest.
It is fast, accurate, easy to use, allows users to build custom OCR models from scratch, and has some neat Zapier integrations. Digitize documents, extract tables or data fields, and integrate with your everyday apps via APIs in a simple, intuitive interface.
Why is Nanonets the Best PDF Parser?
- Nanonets can extract on-page data while command line PDF parsers only extract objects, headers & metadata such as (title, #pages, encryption status, etc.)
- Nanonets PDF parsing technology isn't template-based. Apart from offering pre-trained models for popular use cases, Nanonets PDF parsing algorithm can also handle unseen document types!
- Apart from handling native PDF documents, Nanonet's in-built OCR capabilities allow it to handle scanned documents and images as well!
- Robust automation features with AI and ML capabilities.
- Nanonets handle unstructured data, common data constraints, multi-page PDF documents, tables, and multi-line items with ease.
- Nanonets is a no-code tool that can continuously learn and re-train itself on custom data to provide outputs requiring no post-processing.
Book this 30-min live demo to make this the last time that you'll ever have to manually key in data from invoices or receipts into ERP software.
Automated invoice parsing with Nanonets
Integrate your existing tools with Nanonets and automate data collection, export storage, and bookkeeping.
Create completely touchless invoice processing workflows.
Nanonets can also help in automating invoice parsing workflows by:
- Importing and consolidating invoice data from multiple sources - email, scanned documents, digital files/images, cloud storage, ERP, API, etc.
- Capturing and extracting invoice data intelligently from invoices, receipts, bills, and other financial documents.
- Extracting data from barcodes or QR codes
- Categorizing and coding transactions based on business rules.
- Setting up automated approval workflows to get internal approvals and manage exceptions.
- Reconciling all transactions
- Integrating seamlessly with ERPs or accounting software such as Quickbooks, Sage, Xero, Netsuite, and more.