PDF and JSON Formats

PDFs are one of the most used data formats for business documents. Many businesses and organizations depend on various tools to create and read these PDF documents. However, sometimes it's hard to access specific/important information from these PDFs.

This is where Javascript Object Notation (JSON) comes into the picture, and it's one of the most loved data formats used for information exchange. Especially when it comes to web applications, most of the data is communicated using this JSON format through APIs and DB queries.

PDF to JSON
PDF to JSON

In this blog post, we'll be looking at different techniques for exporting PDFs into JSON data. We'll also learn to extract complex parts of PDFs such as tables and particular text. Lastly, we'll look at some custom workflows that can help automate the process of converting PDFs to JSON using OCR and Machine Learning. Below, you can find the table of contents:

The Need for PDF to JSON Conversion

Out there, almost every business relies on documents for information sharing. These can be documentation, invoices, tax filings, receipts, medical reports and many more. We can find most of these documents in PDF formats. But if you want to search for any critical information from these or want to build a super dashboard that can help us analyse and store all this information, then collecting data from the PDFs can be a complex task.


Want to extract information from PDF documents and convert them into a JSON format? Check out Nanonets to automate export of any information from any PDF document into JSON format!


If the PDFs are electronically generated, we can copy-paste information into data sources; else, we might have to use OCR and machine learning techniques to extract information from PDFs because these will not be editable. Also, the data in the PDFs are not organised; all the text and tables will be just put in a straightforward manner. Therefore, we might have to search for information manually. But when it comes to JSON, everything is organised in key-value pairs. Here's an example, consider the following PDF invoice.

If you want to build a web-based dashboard to store all your invoices and see how your business is performing, you might have to manually upload all the information from the PDFs to your database. When it comes to the data inside PDFs, we can see different font sizes, tables with several rows and columns that might feel unorganised most of the time when we want to extract information. But in JSON, everything is more organised in key-value pairs; therefore, searching and storing information and extending it will be easier for companies.

{
  "company_name": "Company Name",
  "Invoice_date": "Date ",
  "Invoice_total":"$0.00",
  "Invoice_line_items: "",
  "Invoice_tax": ""
} 

If you can see the above JSON format, the data is more organised, and you could also share this information on the web more conveniently. This is why exporting crucial data from PDFs into JSON will be helpful for a lot of companies.

Business Benefits that Come with JSON

JSON data format has a lot of advantages over PDFs for businesses. Here's why:

  1. JSON is Faster: JSON syntax is easy to use; whenever you're trying to parse through any JSON data, the execution is much faster when compared to PDFs and other data formats. This is because the syntax is lightweight and executes the response in a faster way.
  2. More Readable: JSON data is more readable; we'll have a straightforward data mapping with keys and values. Therefore, if you're searching for something or organising the data from PDFs, JSON will be more convenient. Additionally, JSON supports the nesting of data, and with this, data from tables can be stored more efficiently.
  3. Convenient Schema: JSON is universal for most operating systems and programming languages; Therefore, if you're building any software or web application to automate your business, JSON should be the right data format. Also, most web browsers support JSON format; hence we don't have to put in additional effort to use third-party software to read through JSON data.
  4. Easy Sharing: JSON is the best tool for sharing data of any size, even large tables or text etc. This is because JSON stores data in the arrays, so data transfer makes it more accessible. For this reason, JSON is a superior file format for web APIs and web development.

These are some of the reasons why one should choose JSON over PDFs to store crucial information from documents. However, with JSON data, one must also need to make sure to save the JSON data in a more secure way as it doesn't have any error calls when integrated with the web services. In the next section, let's look at some of the challenges that we may face when converting PDFs to JSON format.


Want to extract information from PDF documents and convert them into a JSON format? Check out Nanonets to automate export of any information from any PDF document into JSON format!


Challenges with Converting from PDF to JSON

As mentioned earlier, all the information in the PDFs is organised in a more straightforward way. We can see text in different font sizes and alignments. Therefore, it's really complicated for parsers to read through PDFs and convert them into JSON format. Also, before exporting PDFs into JSON, one must check if these PDFs are electronically created or not.

Electronically created PDFs are usually documents that are first made with software like MS Word or Google Docs and then exported into PDFs. In this case, we could use any algorithm or copy paste data into JSON format. If the PDFs are not electronically made and are created by scanning/image capturing through cameras, we'll need to use tools like OCR to read the data and then export the data into JSON format. Let's look at some of the challenges in exporting from PDFs to JSON.

Extracting Text:

  1. Detecting fonts: People use different fonts, colours, and alignments inside PDF documents. Therefore, it is really hard for parsers to read these. Also, while exporting this, we'll have to define specific rules so that after the parser extracts the data, all the information should be mapped correctly in the JSON format. In such cases, regular expressions are widely used to pick out specific text and then to export it to the correct key in the JSON format.
  2. Detecting text from scanned documents: As discussed, when the PDFs are not electronically generated, we will have to use an OCR and choosing an OCR is crucial. Though a lot of users try open-source tools like tesseract, they have their own set of limitations. For example, if the text is improperly captured or misaligned when capturing, tesseract might not work, and choosing other tools can be expensive.

Extracting Tables:

  1. Identifying Tables: Most business documents contain tabular information, and determining these tables from PDF documents and converting them into JSON is a challenging task. There are some libraries based on Python and Java that can help extract tables from electronically made PDF documents.
  2. Identifying Tables from Scanned PDFs: When the PDFs are scanned, most packages don't work. In this case, if we choose an open-source OCR like tesseract, it could extract text but can lose all the table formatting. Therefore, it's challenging to pick outline items in an incorrect format. This is where we'll have to use Machine Learning and Deep Learning-based algorithms. Some popular algorithms are based on CNNs, and there has been lots of research going on in improving these algorithms almost on a daily basis. But to build these algorithms in house, we might need lots of data to train and use custom pipelines. We'll look at these pipelines in detail in the following sections.

Below are some of the research papers that solve the problem of table extraction from documents:

In the next section, let’s look at how to parse data from PDF to generate JSON files.

Parsing Data from PDFs and Generating JSON Files

Parsing through PDFs isn't a complicated task if you have developer experience. Firstly, we'll have to check if our PDF files contain text data or consist of scanned images. We’d have to check if we can extract text data and pipe the files through an OCR library if no text was returned. This could be achieved using a Python library or by relying on some Linux command-line utilities.

Pdftotext is one of the most popular libraries to parse through electronically made PDFs. We could use this to convert all the PDF’s data into text format and then push it into a JSON format. Here are some of the instructions on how we can use pdftotext and parse through PDF on a Linux machine.

First, install command-line tools:

sudo apt-get install poppler-utils

Next, use the pdftotext command and add the PDF file’s source path and destination text file location.

pdftotext {PDF-file} {text-file}

With this, we should be able to extract all the readable text from the PDF files.

To generate a JSON file, we will have to again work on a script based on our data that can parse through the text and export them into relevant key-value pairs. Here’s an example script that we wrote in Python that converts a simple .txt file into JSON format.

import json
  
filename = 'data.txt'
 
dict1 = {}
  
with open(filename) as fh:
  
    for line in fh:
        command, description = line.strip().split(None, 1)
        dict1[command] = description.strip()
  
# creating json file
# the JSON file is named as test1
out_file = open("test1.json", "w")
json.dump(dict1, out_file, indent = 4, sort_keys = False)
out_file.close()

Consider the data inside the text file to be:

invoice_id #234
invoice_name Invoice from AWS
invoice_total $345

Here, we first imported the inbuilt JSON library. We now create a dictionary data type to store all the key-value pairs from the text files. Next, we iterate through every line in the file and strip it into command, description and keep it in the created dictionary. Lastly, we make a new JSON file and use the json.dump method to dump the dictionary into the JSON file with a specific configuration that includes sorting and indentation.

However, our data from PDFs will not be as organised as given in the example; therefore, we might have to use custom pipelines and scripts to go through complicated text formatting. In such cases, tools like Nanonets will be of great choice, and we’ll also look at how Nanonets solves this problem in a much easier way in the following sections.

Before that, let’s look at one more library that converts PDF to JSON using node.js:

pdf2json is a node.js module that parses and converts PDF from binary to JSON format; it’s built with pdf.js and extends it with interactive form elements and text content parsing outside the browser.

Here’s an example of using this module to parse your JSON files:

First, make sure to have npm installer and install the module using the following command:

npm install pdf2json

Next, in your node server, you can use the following snippet that loads the pdf2json and exports pdf’s to JSON:

let fs = require('fs'),
        PDFParser = require("pdf2json");
 
    let pdfParser = new PDFParser();
 
    pdfParser.on("pdfParser_dataError", errData => console.error(errData.parserError) );
    pdfParser.on("pdfParser_dataReady", pdfData => {
        fs.writeFile("./pdf2json/test/F1040EZ.json", JSON.stringify(pdfData));
    });
 
    pdfParser.loadPDF("./pdf2json/test/pdf/fd/form/F1040EZ.pdf");

The above code snippet uses an example JSON file from the module and exports it into a JSON file, we can check this out in the ./test/target/ folder in your project. Below, you’ll find a screenshot of how the module exports the JSON files:

For parsing through PDFs in tables, these libraries might not work. We’ll have to either use OCRs and put them in JSON format or bring in OCRs and Machine Learning algorithms to extract tabular data into JSON. Here’s a screenshot of how Nanonets OCR returns JSON data:

As we can see, the Nanonets OCR engine could identify all the tables from the uploaded PDFs and with a single click, we could download the data into CSV or JSON format. We’ll also be looking at other tools and their performance in the following sections.


Want to extract information from PDF documents and convert them into a JSON format? Check out Nanonets to automate export of any information from any PDF document into JSON format!


Customised Data Conversion from PDF to JSON

Sometimes, while extracting the data from business documents, we might require customisation. For example, say if we only want certain pages or tables, we can't do it directly. In this case, we might need to provide additional rules to the parsers, which is again time-consuming. But let's see how we can do the customisation and the actions that most people need.

Below are some of the actions that are required for customisation in PDF to JSON conversion:

  • Extract only particular text or pages from PDFs
  • Extract all the tables from PDF documents
  • Extract particular columns from certain tables in PDFs
  • Filter text from PDFs before exporting them into JSON
  • Creating nested JSON based on the extracted data from PDFs
  • Format JSON structure based on data
  • Create, delete, update values of certain fields in JSON after extraction

These are some of the actions that are often required for storing our data in different ways, or say if we are building APIs for an application. Let's see how we can achieve these.

Extracting Particular Text: In PDFs, we could extract the particular text using regular expressions; for example, say if we want all the emails and phone numbers using regex, we can pick them out. If the PDFs are in scanned format, we need to either train them on a deep learning algorithm that can understand the layouts of the PDFs and extract fields based on the coordinates and annotation made to the training data. One of the most popular open-source repositories for understanding document layouts and extracting text is LayoutML, and it trains on BERT models for custom text extraction. However, we should have enough data to achieve higher accuracy in extracting text.

Table Customisation: As discussed, tables can be extracted using libraries like Camelot and Tabula-py or using OCR and deep learning-based algorithms. But for customisation, we will have to use libraries like pandas; this will allow us to create, update, and serialise the data from the tables. It uses a custom data type called a data frame, which is widely used for manipulating and customising the table data. Other advantages of using pandas include writing custom functions that can perform certain math operations during the extraction process.

Formatting JSON Data: After exporting PDFs into JSON, formatting them is a straightforward task, as we have a more customisable data type which is the key-value pairs. We could either develop simple scripts or use online tools to search through these key-value pairs and format them. Some of the most common parameters for formatting include indentation, separators, sorting keys, circular checks, data checks. If the JSON is being used as an API, we could use Postman or any browser extensions to format the data and interact with the APIs.

Automated PDF to JSON Converter

Some businesses need automated solutions for going through their documents to generate reports or extract data. For these use cases, they’ll have to build custom workflows or APIs that perform specific tasks. For example, say if we’re going through a set of medical reports which are in a PDF format, and you want to extract the patient details and treatment provided, here’s how the workflow looks like:

  1. We’ll have to read through PDFs and pick out the text using programming libraries or OCR.
  2. Next, we’ll have to filter through the text and extract the selected information like patient_id or patient_name. If these PDFs are scanned, then we might have to build a program that can extract these text from images using DL algorithms.
  3. We’ll have to parse through tables and pick our crucial information from tables.
  4. We’ll have to export all the data into the desired format, say a database or an excel sheet.

Most of these workflows are built as web applications. This will help businesses to store all their documents in the cloud and process them quickly. If any confidential data is involved in it, they either self-host all their services or build offline software with intelligent algorithms. Sometimes, they utilise APIs and RPA solutions to develop robots and automate these workflows. Now let’s see how APIs and Webhooks can build these automated solutions to convert PDF’s into JSON.

For making each of these points communicate to each other, we can either use APIs or Webhooks.

Webhooks: Webhooks are automated responses sent from apps when something happens. They have a message—or payload—and are sent to a unique URL—essentially used to connect tasks in workflows. For example, say we have two tasks, in the first task, we have to extract text from PDF, and in the second task, we’ll need to extract tables. Normally, we’ll have to manually trigger from task one to task two, but using webhooks, as soon as text is extracted from a PDF, the webhook automatically takes in the PDF and exports the tables. For more information about using webhooks, do check out this guide.

APIs: APIs are one of the familiar ways to communicate information on the Web. For tasks like converting PDF to JSON, building APIs would be one of the easiest ways to do such tasks. To create an API server, firstly, we'll have to choose a web framework. Since we're working with the intersection of OCR and Deep Learning algorithms, python is the first go-to language. There are some awesome frameworks in Python like Django, Flask and FastAPI that can achieve our task in a much easier way.

However, to build a well-organised API server, one has to be good at organising all the data schemas; this will help manage workflows much more smoothly. We can also connect our data to some third party API's like Nanonets to perform extraction tasks. For example, say we have a workflow that has logic to save all the PDFs from Emails, direct uploads and different software. We could simply use the Nanonets API to convert all the PDF data into the required JSON format.


Want to extract information from PDF documents and convert them into a JSON format? Check out Nanonets to automate export of any information from any PDF document into JSON format!


Common Issues while Exporting PDF to JSON

  1. Configuring modules: In case if the PDFs are electronically created, most developers use different modules or frameworks of various programming languages to extract text from these PDFs and extract it to a JSON format. But they face several issues while setting these up in different environments.
  2. Multiple framework configurations: For building custom workflows for our task, we’ll have to use various libraries, say pdftotext to extract pdf data, tabula to extract tables, pandas to process our tables and finally 'json' to export the data into JSON. This is a simple scenario using Python. But when we’re working on JavaScript, we might not find frameworks related to the extraction of tables. Also, these modules are limited, and some only return the meta-data of PDFs
  3. Language and Special Characters: While working on extracting text from PDFs sometimes even popular libraries will be unable to read special characters or a specific language. Most of the modules support English. However, when we are using tools like OCR, we can work with more than 30+ languages.

Nanonets™ Advantage in PDF to JSON Conversion

In this section, we’ll look at how Nanonets can help us make PDFs to JSON extraction more customisable and easier.

Nanonets is a cloud-based OCR that can help automate your manual data entry using AI. We’ll have a dashboard where we can build/train our OCR models on our own data and export them in the form of JSON/CSV or any desired format. Here are some of the advantages of using Nanonets for PDF to JSON conversion.  
  1. Custom Rules: We have an option to add custom rules where we can choose particular fields on our documents to extract. For example, if your business documents have 100 fields and you just want to extract around 30 fields, Nanonets™ can help you do that by just selecting necessary fields on the model. This applies to all the documents.
  2. Pre-processing: On Nanonets™, you can also post-process your data after extraction. For example, if there are any errors on the extracted data, you can write some scripts to clean the extracted data and export into desired format.
  3. Fraud Checks: If there’s any financial or confidential data in our documents, Nanonets™ models can also perform fraud checks. It basically looks for edited/blurred text from the scanned documents and notifies the admins. Duplicate documents or information can also be identified through these models.
  4. Table Extraction: One decisive advantage for using Nanonets™ as our PDF to JSON converter is that it can also pick out tables for us with the highest accuracy and export the data as nested JSON even for complex tables. Therefore, we need not work on Key-Value Pair extraction and Table extraction separately; everything gets done in one shot.
  5. Extract from Poorly Scanned Images: Nanonets™ models will be able to extract text from scanned PDFs even if the images are of low-resolution or oriented with a slight angle using powerful deep learning techniques.