How to extract text from PDF in Python

PDF(Portable Document Format) is one of the most popular and widely used digital media.
In this article, we will see how to extract text from a pdf file in Python. For that, we are using a third-party Python module PyPDF2.
To install PyPDF2,
pip install PyPDF2
import PyPDF2 #creating a pdf file object pdfFileOb = open('test.pdf', 'rb') #creating a pdf reader object pdfReader = PyPDF2.PdfFileReader(pdfFileOb) #printing number of pages in the pdf file print(pdfReader.numPages) #creating a page object pageOb = pdfReader.getPage(0) #extracting text from page print(pageOb.extractText()) #closing the pdf file object pdfFileOb.close()
Output:
1 Take Risks In Your Life If You Win, You Can Lead ! - Swami Vivekananda
Now lets see what all this code means.
pdfFileOb = open('test.pdf', 'rb')
-We opened the test.pdf in binary mode and saved the file object as pdfFileObj.pdfReader = PyPDF2.PdfFileReader(pdfFileOb)
-Here, we create an object of PdfFileReader class of PyPDF2 module and pass the pdf file object & get a pdf reader object.print(pdfReader.numPages)
–numPages property gives the number of pages in the pdf file. For example, in our case, it is 1(see first line of output).pageOb = pdfReader.getPage(0)
-Now, we create an object of PageObject class of PyPDF2 module. pdf reader object has function getPage() which takes page number (starting from index 0) as argument and returns the page object.print(pageOb.extractText())
-Page object has function extractText() to extract text from the pdf page.pdfFileOb.close()
-we close the pdf file object.
Note that PyPDF2 might make mistakes when extracting text from a PDF and may even be unable to open some PDFs at all. PyPDF2 may be unable to work with some of your particular PDF files.
Subscribe
Login
Please login to comment
0 Discussion