This blog post serves as the perfect starting point for anyone currently looking to perform OCR on PDF files and images. We start by introducing a set of free online OCR tools and links. We then delve into a python code tutorial which takes you through the process of performing OCR on PDF files and images, and discuss more specific OCR functionalities and their implementation towards the end.

Table of Contents


Have an OCR problem in mind? Want to reduce your organization's data entry costs? Head over to Nanonets and build OCR models to start automating manual effort and processes using advanced AI.


Introduction

The total number of PDF documents in the world is estimated to have crossed 3 trillion. The adoption of these documents can be attributed to their inherent nature of being independent of platforms, thus having a consistent and reliable rendering experience across environments.

There are many instances arising everyday where there is a need to read and extract text and tabular information from PDFs. People and organisations which traditionally did this manually have started looking at technological alternatives which can replace manual effort using AI.

A few use cases for extracting data from PDF documents are given below. If your use case falls under any of those mentioned below, we recommend clicking on the links given below which will redirect you to our specialized blogs explaining and providing solutions for each of these use cases.

OCR stands for Optical Character Recognition, and employs AI to convert an image of printed or handwritten text into machine readable text. There are various open-source and closed-source OCR Engines existing today. It should be noted that often times, the job is not complete after OCR has read the document and given an output consisting of a stream of text, and layers of technology are built over it to use the now machine readable text and extract relevant attributes in a structured format.

Free Online OCR Tools

There are a bunch of free online OCR tools which can be used for performing OCR online. It simply is a matter of uploading your input files, waiting for the tool to process and give output, and then downloading the output in required format.

Here is a list of free online OCR Tools -

The next section is a python code tutorial to perform OCR on PDFs and images.

Python Code - Read your first PDF File Using Pytesseract

Tesseract is a popular OCR engine, and Pytesseract is a python wrapper built around it. Let us take an example of the PDF invoice shown below and extract text from it.

invoice-sample.pdfc

The first step is to install all prerequisites in your system.

Tesseract

Installing the Tesseract OCR Engine is the first step here.

  • Windows -  installation is easy with the precompiled binaries found here. Do not forget to edit “path” environment variable and add tesseract path.
  • Linux / Mac - can be installed with few commands.
  • TIP - The easiest way to install on Mac is using homebrew. Follow steps here.

After the installation verify that everything is working by typing command in the terminal or cmd:

$ tesseract --version

And you will see the output similar to:

tesseract 5.1.0
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 9e : libpng 1.6.37 : libtiff 4.4.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.5.0
 Found NEON
 Found libarchive 3.6.1 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.5.2
 Found libcurl/7.77.0 SecureTransport (LibreSSL/2.8.3) zlib/1.2.11 nghttp2/1.42.0

Pytesseract

Python wrapper for tesseract. You can install this using pip.

$ pip install pytesseract

pdf2image

Tesseract takes image formats as input, which means that we will be required to convert our PDF files to images before processing using OCR. This library will help us achieve this. You can install this using pip.

$ pip install pdf2image

OCR using Pytesseract

Now, we are good to go. Reading text from pdfs is now possible in few lines of python code.

import pdf2image
from PIL import Image
import pytesseract

image = pdf2image.convert_from_path('invoice-sample.pdf')
for pagenumber, page in enumerate(image):
    detected_text = pytesseract.image_to_string(page)
    print(detected_text)

Running the above python code snippet on the above pdf invoice example ('invoice-sample.pdf'), we obtain the below output from the OCR engine.

We can see that the detected_text variable in the above code snippet has stored the text contents of the pdf file detected by the OCR engine.

This wraps up our section on reading text from pdf files using tesseract. The next section demonstrates the use of another new python library which does not require any prerequisites (such as tesseract, pytesseract, pdf2image), is easier to install, more user-friendly, and offers extra OCR functionalities such as spatial formatting, table extraction into Excel / CSV, creating searchable PDFs and more.

Note : If your use case is invoice OCR -

  • read our blog on how to code an invoice parser. The blog guides you towards creating your own invoice parser in python which performs OCR on invoice pdf / image files, detects relevant features (such as invoice amount, buyer, seller, date of invoice, etc.) and extracts them in structured format.
  • If you want an automated hassle-free software which performs invoice OCR and feature extraction seamlessly using advanced AI models, try Nanonets Invoice OCR.
  • For other advanced OCR use cases and their solutions, explore our Products and Solutions using the dropdowns at the top right of the page.

Python Code - Advanced Functions for Image and PDF OCR in Python

Our team has released a free library to contribute towards the cause of quality free OCR tools being made available for educational and research purposes.

Note that you do not need to have any of the prerequisites (such as tesseract, pytesseract, pdf2image) which were required in the previous section to start using our library. It works perfectly as a standalone solution for a lot of basic free OCR needs.

Salient Features of the library -

  • Recognises PDF and image formats, no preprocessing required.
  • Retains spatial formatting of original document accurately.
  • Can detect and extract tables in Excel / CSV format from PDF / image.
  • Create searchable PDFs from scanned PDFs on the fly.

I am sharing a small code snippet below to get you started.

You can install the package using pip.

pip install ocr-nanonets-wrapper

To get your first prediction, run the code snippet below. You have to add your API key in the third line to authenticate yourself.

This software is perpetually free. You can get your free API key (with unlimited requests allowed) by signing up on https://app.nanonets.com/#/keys.

from nanonets import NANONETSOCR
model = NANONETSOCR()
model.set_token('REPLACE_API_KEY')

We are all set now to make the first prediction. You can give inputs by specifying a local file or a URL. Note that the file/URL can be both PDF or image file, and can  have .pdf, .jpg or .png file format.

We will use the below image to make the first prediction.

pred.png
prediction_json = model.convert_to_prediction('pred.png')
prediction_json

We have stored the output of the OCR engine in prediction, which is a json object. This object contains predicted words and their spatial positioning in the document. This object helps you to store the json and create your own methods to interpret and format the OCR output.

However, you can directly get OCR outputs in desired formats using other functions in the package.

1. Extract Text from File as String

Run the code snippet below after authenticating, to extract all text from your input file and store it in a string.

string = model.convert_to_string('INPUT_FILE',formatting='none') 
print(string)

# formatting can be => none / lines / lines and spaces / pages
# output examples of these different formatting options shown below 

You can change formatting option. The default setting is 'lines and spaces' which extracts all text from your file and converts it into a string while retaining all spaces and newlines thus maintaining the spatial structure of the original file.

Let us see how formatting parameter works. We will read the below image using different formatting modes.

test.png

You can see how formatting mode changes the output string in below screenshots.

formatting = 'none'
formatting = 'lines'
formatting = 'lines and spaces'

As you can see, the formatting = 'lines and spaces' mode works really well if you want to read your file and print it in the orientation matching your original file. Let me share another example here. Consider the below file where we run the convert_to_string method with formatting = 'lines and spaces' mode.

multi.png

2. Convert PDF / Image to Text File

This method works similar to convert_to_string method shown above. The difference is while convert_to_string returns a string, this method creates a .txt file directly with the output of the convert_to_string method.

The formatting parameter works the same way as it does for the convert_to_string parameter. You can optionally specify the file name for the output .txt file.

model.convert_to_txt('INPUT_FILEPATH', output_file_name = 'OUTPUT.txt')

3. Get Bounding Box Information

You can use the package to extract text from your files and store bounding box information. The output is a list of dictionaries containing each word and it's spatial position in the file.

test.png
boxes = model.convert_to_boxes('test.png')
boxes

4. Extract Tables from File (Convert to CSV)

This method allows you to extract all tables from your file. You can either store the information in a json object, or you can directly get the results in a .csv file.

As an example, we will extract tables from below image.

tables.png

You can run the below code snippet to get a .csv file with all tables extracted from the input file. I have run it on above sample image and attached the output .csv.

model.convert_to_csv('tables.png',output_file_name='OUTPUTFILE.csv')
OUTPUTFILE.csv

Instead, if you want to get a json object containing all the tables, you can run the below snippet on the same file.

tables_json = model.convert_to_tables('tables.png')
tables_json

Note :

  • This function (convert_to_csv() and convert_to_tables()) is a trial offering 1000 pages of use.
  • To use this at scale, please create your own model at app.nanonets.com --> New Model --> Tables.

5. Convert to Searchable PDF

You can directly convert your PDF or image file to a searchable PDF using the below code snippet. This will create a .pdf file as output. You will be able to search and detect all the text present in this output .pdf file.

inv.png
model.convert_to_searchable_pdf('inv.png',output_file_name='output.pdf')

This code snippet creates a searchable pdf with file name output.pdf, which has machine recognizable text. You can search for text / numbers and lookup using the search functionality on your PDF viewer.

searching in output.pdf
searching in output.pdf

Have an OCR problem in mind? Want to reduce your organization's data entry costs? Head over to Nanonets and build OCR models to start automating manual effort and processes using advanced AI.


Have an enterprise OCR / Intelligent Document Processing use case ? Try Nanonets

We provide OCR and IDP solutions customised for various use cases - accounts payable automation, invoice automation, accounts payable automation, Receipt / ID Card / DL / Passport OCR, accounting software integrations, BPO Automation, Table Extraction, PDF Extraction and many more. Explore our Products and Solutions using the dropdowns at the top right of the page.

For example, assume you have a large number of invoices that are generated every day. With Nanonets, you can upload these images and teach your own model what to look for. For eg: In invoices, you can build a model to extract the product names and prices. Once your annotations are done and your model is built, integrating it is as easy as copying 2 lines of code.
Here are a few reasons you should consider using Nanonets -

  1. Nanonets makes it easy to extract text, structure the relevant data into the fields required and discard the irrelevant data extracted from the image.
  2. Works well with several languages
  3. Performs well on text in the wild
  4. Train on your own data to make it work for your use-case
  5. Nanonets OCR API allows you to re-train your models with new data with ease, so you can automate your operations anywhere faster.
  6. No in-house team of developers required

Visit Nanonets for enterprise OCR and IDP solutions.

Sign up to start a free trial.


Have an OCR problem in mind? Want to reduce your organization's data entry costs? Head over to Nanonets and build OCR models to start automating manual effort and processes using advanced AI.