Back

Explore Courses Blog Tutorials Interview Questions
0 votes
6 views
in Python by (16.4k points)
edited by

I'm attempting to extract the text from a pdf file using python

With the help of PyPDF2 module, I tried this following code:

import PyPDF2

pdf_file = open('sample.pdf')

read_pdf = PyPDF2.PdfFileReader(pdf_file)

number_of_pages = read_pdf.getNumPages()

page = read_pdf.getPage(0)

page_content = page.extractText()

print page_content

But, When I execute the code, I'm getting this kind of output, which is actually different from the PDF document, which I included.

!"#$%#$%&%$&'()*%+,-%./01'*23%4

5'%1$#26%3/%7/))/8%&)/26%8#3"%3"*%313/9#&)

%

How to extract the text from the PDF document? 

1 Answer

0 votes
by (26.4k points)

I was searching for a straightforward answer for use for python 3.x and windows. There doesn't appear to be help from textract, which is actually unfortunate, yet on the off chance that you are searching for a straightforward answer for windows/python 3 checkout the tika package, truly straightforward for reading the pdfs.

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

from tika import parser # pip install tika

raw = parser.from_file('sample.pdf')

print(raw['content'])

Tika is also insatlled in the Java. So, you want a Java runtime installed.

Want to learn python in detail? Come and join the python certification course and get ceritified.

Related questions

0 votes
1 answer
0 votes
1 answer
asked Jul 12, 2019 in Python by Sammy (47.6k points)
+1 vote
1 answer
asked Oct 31, 2019 in Data Science by chandra (29.3k points)

Browse Categories

...