How to Read PDF File in Python

PDF(Portable Document Format) is one of the most popular and widely used digital media. It is used to display and exchange documents assuredly, independent of software, hardware, or operating system.
In this article, we will see how to read pdf file in Python. For that, we are using a third-party Python module PyPDF2. This module is capable of extracting document information, splitting documents page by page, merging documents, cropping pages, merging multiple pages into a single page, encrypting and decrypting PDF files etc.
To install PyPDF2,
pip install PyPDF2
import PyPDF2 pdf_FileOb = open('test.pdf', 'rb') pdf_Reader = PyPDF2.PdfFileReader(pdf_FileOb) print("The number of pages: ", pdf_Reader.numPages) page_Ob = pdf_Reader.getPage(0) print(page_Ob.extractText()) pdf_FileOb.close()
Output:
The number of pages: 1 Take Risks In Your Life If You Win, You Can Lead ! - Swami Vivekananda
Now lets see what all this code means.
The first step is to import PyPDF2 module. After that, we are opening our PDF file using in open() function in the binary mode.
The next step is to create an object of the opened file using the PdfFileReader class of the PyPDF2 module. We get a pdf reader object from this. The numpages
property gives the number of pages in the pdf file. The getpage()
function takes the page number as an argument and returns the page object. The function extractText()
extract text from the selected pdf page. And finally, after doing all the operations on the PDF file, we have to close the file object. This can be done using close()
.
You may find some similarities between the PyPDF2 operations and built-in file operations. Keep in mind that this module is not completely perfect. It may be unable to work with some particular PDF files