Automate your workflow with Nanonets
Request a demo Get Started

Portable Document Format (PDF) files are commonly used for sharing documents electronically. Individuals and businesses use PDF files to share information alike. Often we need to extract some information from the PDF files for further processing. However, extracting text from a PDF file can be challenging, especially if the document contains complex formatting and layout. Fortunately, there are several ways to do this.

Here, we will provide the most commonly used method to extract text from PDFs using Python. Python comprises several libraries that enable efficient PDF text extraction.

The article explores some popular Python libraries for extracting text from PDF files and the step-by-step text extraction process from PDFs.


Python Libraries for PDF Processing

Python has several well-integrated libraries that effectively handle unstructured data sources such as PDF files. Here is a list of a few Python libraries for PDF processing.

  • PyPDF2: It is a Python library for PDF that can help split, merge, crop, and transform pages of PDF files. PyPDF2 also allows you to extract text from PDF files.
  • PyMuPDF: PyMuPDF is a Python wrapper for the MuPDF C library. It allows you to read, write, and manipulate PDF files in Python. Also, you can access the PDF document metadata, extract text and images, and decrypt a PDF document with PyMuPDF.
  • ReportLab: It is an open-source Python library that can be used to create and manipulate PDF files. It provides a high-level API for creating PDF documents from scratch and supports embedding images and fonts.
  • Pdf2dox: It is a Python library to extract data using the PyMuPDF library from PDF files.

Convert complex PDF to text format instantly using PDF to Text converter. Try it for free.


Setting up the Development Environment

Before we discuss the steps for text extraction from PDF, it is essential to set up a development environment and install the required Python libraries to extract text.

  • Install Python: If you haven't already, you'll need to install Python on your system. You can download the latest version of Python from here.
  • Install pip: To check whether you have pip installed in Python, run
py -m ensurepip --default-pip 

If pip doesn't run automatically, download it here and run the following code to install or upgrade

pip.python get-pip.py

  • Install the required library: Install any Python library to work with PDF files. Here, we will install the commonly used library, PyPDF2. To install it, run the following command.
pip install PyPDF2

Once you've installed Python and the required libraries, your development environment is set. You can use any text editor or IDE to write Python code, such as Visual Studio Code, PyCharm, or Sublime Text.

Extracting Text from PDF Using Python – Step-by-Step Process

We will use the PyPDF2 Python library to extract files.

Input PDF:

# importing required modules
from PyPDF2 import PdfReader

# creating a pdf reader object
reader = PdfReader('nanonet.pdf')

# printing number of pages in pdf file
print(len(reader.pages))

# getting a specific page from the pdf file
page = reader.pages[0]

# extracting text from page
text = page.extract_text()
print(text)

Output:

Now, let's understand each code separately.

  • reader = PdfReader('nanonets.pdf')

From the PyPDF2 module, we created an object of the PDFReader class. It will take the required positional argument of the path to the pdf file.

  • print(len(reader.pages))

The pages property provides a List of PageObjects. Here, we can use the built-in len() Python function to get the number of pages in the pdf file.

  • page = reader.pages[0]

We can also get a specific pdf file page by tapping into the page index. List indexing starts from 0 in Python, so this command will give us the file's first page.

  • text = page.extract_text()

print(text)

We will use this command to extract text from the pdf page.

Pre-processing extracted text to clean and normalize it

Different pre-processing techniques, such as removing stopwords, lowercasing, removing punctuation, stemming, or lemmatization, are used to clean and normalize the extracted text in Python.

Input: Python is a popular programming language used for data analysis and machine learning. It is easy to learn and has a wide range of libraries for various applications.

Code:

text = "Python is a popular programming language used for data analysis and machine learning. It is easy to learn and has a wide range of libraries for various applications."
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
filtered_text = [word for word in tokens if not word.lower() in stop_words]
clean_text = [word.lower() for word in filtered_text if word.isalpha()]

print(clean_text)

Output: ['python,' 'popular,' 'programming,' 'language,' 'used,' 'data,' 'analysis,' 'machine,' 'learning,' 'easy,' 'learn,' 'wide,' 'range,' 'libraries,' 'various,' 'applications']

This step has removed stop words like "is," "a," "for," "and," "it," and "has," and also lowercase all the words in the text.

Saving extracted text to a file or database

Run the following code:

with open('extracted_text.txt', 'w') as f:
    f.write(' '.join(clean_text))

This code will open a file named extracted_text.txt in write mode. The f.write() method writes the pre-processed text to the file. It converts the list of words in clean_text to a string by joining the words with a space character (' '), then writes the resulting string to the file.

So, the result is that the pre-processed text is saved to a file named extracted_text.txt in the current working directory.


Convert complex PDF to text format instantly using PDF to Text converter. Try it for free.


Advanced Techniques for Improving Text Extraction Accuracy

Several advanced techniques can be used to improve text extraction accuracy. Here are some techniques:

  • Optical Character Recognition (OCR): OCR is a process that converts scanned images into machine-encoded text. OCR can be used to extract text from PDF files that contain images or scanned pages. Several PDF OCR engines are available, including Tesseract, Google Cloud Vision, and Amazon Textract.
  • Pre-processing Techniques: Pre-processing techniques involve manipulating the PDF file before the actual extraction process. This includes techniques like de-skewing, de-noising, and thresholding to remove noise, skew and other distortions that may affect the accuracy of the extraction process.
  • Layout analysis: It involves identifying and classifying the different elements of a PDF file, such as text blocks, tables, and images. This information can improve text extraction accuracy by identifying the document's structure.
  • Machine learning tools: Several text extraction tools, such as Nanonets, use machine learning techniques to extract text from PDF files accurately.

Tips for Optimizing Performance and Reducing Memory Usage in Python

Effectively managing memory in Python can be complex, necessitating understanding Python's data structures and objects. Here are a few tips for optimizing performance and reducing memory usage while running code in Python.

1. Use Built-in Python Functions & Libraries

Using built-in Python functions is an effective way to accelerate your code. Incorporating these functions into your code when appropriate is recommended because they are optimized and well-tested for performance.

These functions are fast because they are executed in C, a high-performance language. Examples of these functions include max, min, all, map, and many others.

2. Utilize Pytorch DataLoader

Training a large dataset can be memory-intensive. Using PyTorch's DataLoader provides a solution to this issue by enabling the creation of multiple mini-batches of data from the entire dataset. Each mini-batch, which can contain several samples determined by available memory, is loaded seamlessly into the model, allowing for the efficient training of large datasets.

3. Use List Comprehension Over Loops

In Python, loops are common, but list comprehensions offer a more concise and faster way to make new lists. It is better than the append method for adding elements to a Python list.

4. Import Statement Overhead

In Python, the placement of the import statement can impact your code's performance and memory usage. Importing a package outside a function can result in faster code execution but may also require more memory than importing the package inside a function. Considering the trade-offs between performance and memory usage is important when deciding where to place your import statements in Python.

5. Data Chunks

Chunking or loading data in small batches is a useful technique to prevent memory errors when working with large datasets in Python. In many cases, all the data is not needed at once, and attempting to load everything in a single batch can cause the program to crash due to memory limitations. By processing the data in smaller chunks, it is possible to avoid these memory errors and save the results as needed. Therefore, chunking data is common in data processing and analysis to prevent memory-related issues.

6. String Concatenation

Two common ways to concatenate strings in Python are using the '+' operator or the join() method. While the '+' operator is widely used, the join() method is more effective and faster to concatenate strings. The main reason is that at each step, the '+' operator makes a new string and copies the old string, whereas the join() method works differently, resulting in faster concatenation.


Convert complex PDF to text format instantly using PDF to Text converter. Try it for free.


Takeaway

Handling memory errors in Python can be challenging, but various methods exist to optimize memory usage and prevent memory overflows. The first step is identifying the issue's core reason and applying applicable memory optimization methods. If the issue persists, related processes can be optimized, or the operation can be broken down into smaller chunks using an outside database service.

With these tips and techniques, it's possible to optimize memory usage and avoid memory-related issues when working with large datasets in Python. While Python libraries offer a convenient way to extract text from PDF files, it's worth considering other automated tools for text extraction, such as Nanonets.