Program to Extract Text From PDF in Python

Overview

PDFs are one of the most common ways to share data. PDF stands for Portable Document Format. It is a proprietary extension created by Adobe. The PDF files use a .pdf extension. PDF files can be read using Python for analysis. Reading PDFs using Python generates text data that can be used for various other purposes, like using the data in text-based searches, optimizing parameters for an algorithm, analysis of a pdf for plagiarism, etc. Although pdf files can contain images and other data types, we will focus mainly on extracting text data from pdf files.

How to extract Text from PDF in Python?

Python can be used to extract text data from pdf files. Extracting data from a pdf file requires parsing a pdf file and extracting the content of the file. It requires using file IO and knowing how the data is stored in the file. However, there are multiple Python libraries that can be used to extract text from a pdf file. Some of them are: PyPDF2, Textract, tika, pdfPlumber, pdfMiner. We will discuss two libraries for extracting text from a pdf file.

PyPDF2 is a free, open-source Python library for retrieving text data from a pdf file. It is capable of performing a lot of operations on a pdf.
pdfPlumber is a free and open-source library. In addition to text extraction, we can also use it to extract tables from a pdf file and be used for visual debugging.

The libraries mentioned above are compelling and beginner-friendly. In this article, we will utilize a small functionality of these libraries.

Extract text from PDF in Python using PyPDF

Installation of package

To install the package PyPDF, we will use the pip package manager. Make sure that this package manager is already installed in your system. Run the following command to install the library.

Example

PDF Input

Let us create a file example pdf file using Google Docs or Microsft Word. To illustrate the use of libraries, I have created a file example.pdf with the following content.

History of PDF
Created by enhancing and merging two existing technologies – PostScript and Adobe
Illustrator – the new PDF file format was completed and released into the world on 15th June
1993 after three years of work. Little did Warnock know that his pet project would soon
change the way information was managed forever. In 2007, Adobe supplied its PDF format
to the International Organization for Standardisation (ISO), and in 2008, standardized the
PDF format, enabling it to become an open electronic document format.

Below is an image representing the pdf file.

[IMAGE 1 START] [IMAGE 1 FINISH]

Code

Output

Number of Pages: 1
Text:
History of PDF
Created by enhancing and merging two existing technologies – PostScript and Adobe
Illustrator – the new PDF file format was completed and released into the world on 15th June
1993 after three years of work. Little did Warnock know that his pet project would soon
change the way information was managed forever. In 2007, Adobe supplied its PDF format
to the International Organization for Standardisation (ISO), and in 2008, standardized the
PDF format, enabling it to become an open electronic document format.

Explanation

Let us understand the code line by line.

First of all, we need to open the file. So we use the function open to open the file for reading in binary mode.

Next, we create a reader object using the functionality from the library PyPDF.

We can get the number of pages using the reader object using the property numPages. In our example, the number of pages is equal to 1.

Now we create a page object using the first page, by passing in the index $0$ in the function getPage of pdfReader object.

Now we can extract the text from the page object using the function extractText(). In this way, we can extract text data from a pdf file using the library PyPDF.

Extract text from PDF in Python using pdfPlumber

Installation of package.

We will use the pip package manager to download the package pdfPlumber. Run the following command on the terminal to download the required package.

Example

To illustrate the use of this library, we will use the same file example.pdf as used in the above example.

Code

Output

Number of Pages: 1
Page Number: 0
History of PDF
Created by enhancing and merging two existing technologies – PostScript and Adobe
Illustrator – the new PDF file format was completed and released into the world on 15th June
1993 after three years of work. Little did Warnock know that his pet project would soon
change the way information was managed forever. In 2007, Adobe supplied its PDF format
to the International Organization for Standardisation (ISO), and in 2008, standardized the
PDF format, enabling it to become an open electronic document format.

Explanation

First, we open the file and create an object to read the file using the library pdfPlumber.

The text data is stored in the pages property which is a list of all pages from the pdf file. We extract the text from each page using the function extract_text()

In this way, we can use pdfPlumber to extract text from a pdf in Python.

Conclusion

PyPDF and pdfPlumber are the two famous libraries available in the pip package manager for extracting text from a pdf in Python.
Using PyPDF, we open the file and create a reader object. We then, use the reader object to get each page and extract text data.
Using pdfPlumber, the pdf file is represented as a list. We get the data by iterating the list and extracting the text from each page in the list.