Extracting Text From PDF File in Python Using PyPDF2

Extracting Text From PDF File in Python Using PyPDF2

In this blog we will extract text from pdf using PyPDF2 library.

What is PyPDF2?

PyPDF2 is a free and open source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. PyPDF2 can retrieve text and metadata from PDFs as well.

Installation

There are several ways to install PyPDF2. The most common option is to use pip.

Using pip

PyPDF2 requires Python 3.6+ to run. Using pip we can install PyPDF2:

pip install PyPDF2

Anaconda

Install the PyPDF2 library in your system, if it is not installed.

Import the library

In [17]:
import PyPDF2

PyPDF2 version

In [18]:
PyPDF2.__version__
Out[18]:
'1.26.0'

Open a pdf

Declare path of input file

In [19]:
inputFile = "input/extracting-text-from-pdf-file-in-python/sample.pdf"
In [20]:
pdf = open(inputFile, "rb")

Read the PDF

Check available classes and methods of PyPDF2

In [21]:
dir(PyPDF2)
Out[21]:
['PageRange',
 'PdfFileMerger',
 'PdfFileReader',
 'PdfFileWriter',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_version',
 'filters',
 'generic',
 'merger',
 'pagerange',
 'parse_filename_page_ranges',
 'pdf',
 'utils']

We can see some classes and methods are available in PyPDF2 library.

class PdfFileReader(builtins.object)

 |  PdfFileReader(stream, strict=True, warndest=None, overwriteWarnings=True)
 |  
 |  Initializes a PdfFileReader object.  This operation can take some time, as
 |  the PDF stream's cross-reference tables are read into memory.
 |  
 |  :param stream: A File object or an object that supports the standard read
 |      and seek methods similar to a File object. Could also be a
 |      string representing a path to a PDF file.
 |  :param bool strict: Determines whether user should be warned of all
 |      problems and also causes some correctable problems to be fatal.
 |      Defaults to ``True``.
 |  :param warndest: Destination for logging warnings (defaults to
 |      ``sys.stderr``).
 |  :param bool overwriteWarnings: Determines whether to override Python's
 |      ``warnings.py`` module with a custom implementation (defaults to
 |      ``True``).
 |  
 |  Methods defined here:
 |  
 |  __init__(self, stream, strict=True, warndest=None, overwriteWarnings=True)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  cacheGetIndirectObject(self, generation, idnum)
 |  
 |  cacheIndirectObject(self, generation, idnum, obj)
 |  
 |  decrypt(self, password)
 |      When using an encrypted / secured PDF file with the PDF Standard
 |      encryption handler, this function will allow the file to be decrypted.
 |      It checks the given password against the document's user password and
 |      owner password, and then stores the resulting decryption key if either
 |      password is correct.
 |      
 |      It does not matter which password was matched.  Both passwords provide
 |      the correct decryption key that will allow the document to be used with
 |      this library.
 |      
 |      :param str password: The password to match.
 |      :return: ``0`` if the password failed, ``1`` if the password matched the user
 |          password, and ``2`` if the password matched the owner password.
 |      :rtype: int
 |      :raises NotImplementedError: if document uses an unsupported encryption
 |          method.
 |  
 |  getDestinationPageNumber(self, destination)
 |      Retrieve page number of a given Destination object
 |      
 |      :param Destination destination: The destination to get page number.
 |           Should be an instance of
 |           :class:`Destination<PyPDF2.pdf.Destination>`
 |      :return: the page number or -1 if page not found
 |      :rtype: int
 |  
 |  getDocumentInfo(self)
 |      Retrieves the PDF file's document information dictionary, if it exists.
 |      Note that some PDF files use metadata streams instead of docinfo
 |      dictionaries, and these metadata streams will not be accessed by this
 |      function.
 |      
 |      :return: the document information of this PDF file
 |      :rtype: :class:`DocumentInformation<pdf.DocumentInformation>` or ``None`` if none exists.
 |  
 |  getFields(self, tree=None, retval=None, fileobj=None)
 |      Extracts field data if this PDF contains interactive form fields.
 |      The *tree* and *retval* parameters are for recursive use.
 |      
 |      :param fileobj: A file object (usually a text file) to write
 |          a report to on all interactive form fields found.
 |      :return: A dictionary where each key is a field name, and each
 |          value is a :class:`Field<PyPDF2.generic.Field>` object. By
 |          default, the mapping name is used for keys.
 |      :rtype: dict, or ``None`` if form data could not be located.
 |  
 |  getFormTextFields(self)
 |      Retrieves form fields from the document with textual data (inputs, dropdowns)
 |  
 |  getIsEncrypted(self)
 |  
 |  getNamedDestinations(self, tree=None, retval=None)
 |      Retrieves the named destinations present in the document.
 |      
 |      :return: a dictionary which maps names to
 |          :class:`Destinations<PyPDF2.generic.Destination>`.
 |      :rtype: dict
 |  
 |  getNumPages(self)
 |      Calculates the number of pages in this PDF file.
 |      
 |      :return: number of pages
 |      :rtype: int
 |      :raises PdfReadError: if file is encrypted and restrictions prevent
 |          this action.
 |  
 |  getObject(self, indirectReference)
 |  
 |  getOutlines(self, node=None, outlines=None)
 |      Retrieves the document outline present in the document.
 |      
 |      :return: a nested list of :class:`Destinations<PyPDF2.generic.Destination>`.
 |  
 |  getPage(self, pageNumber)
 |      Retrieves a page by number from this PDF file.
 |      
 |      :param int pageNumber: The page number to retrieve
 |          (pages begin at zero)
 |      :return: a :class:`PageObject<pdf.PageObject>` instance.
 |      :rtype: :class:`PageObject<pdf.PageObject>`
 |  
 |  getPageLayout(self)
 |      Get the page layout.
 |      See :meth:`setPageLayout()<PdfFileWriter.setPageLayout>`
 |      for a description of valid layouts.
 |      
 |      :return: Page layout currently being used.
 |      :rtype: ``str``, ``None`` if not specified
 |  
 |  getPageMode(self)
 |      Get the page mode.
 |      See :meth:`setPageMode()<PdfFileWriter.setPageMode>`
 |      for a description of valid modes.
 |      
 |      :return: Page mode currently being used.
 |      :rtype: ``str``, ``None`` if not specified
 |  
 |  getPageNumber(self, page)
 |      Retrieve page number of a given PageObject
 |      
 |      :param PageObject page: The page to get page number. Should be
 |          an instance of :class:`PageObject<PyPDF2.pdf.PageObject>`
 |      :return: the page number or -1 if page not found
 |      :rtype: int
 |  
 |  getXmpMetadata(self)
 |      Retrieves XMP (Extensible Metadata Platform) data from the PDF document
 |      root.
 |      
 |      :return: a :class:`XmpInformation<xmp.XmpInformation>`
 |          instance that can be used to access XMP metadata from the document.
 |      :rtype: :class:`XmpInformation<xmp.XmpInformation>` or
 |          ``None`` if no metadata was found on the document root.
 |  
 |  read(self, stream)
 |  
 |  readNextEndLine(self, stream)
 |  
 |  readObjectHeader(self, stream)
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  documentInfo
 |  
 |  isEncrypted
 |  
 |  namedDestinations
 |  
 |  numPages
 |  
 |  outlines
 |  
 |  pageLayout
 |      Get the page layout.
 |      See :meth:`setPageLayout()<PdfFileWriter.setPageLayout>`
 |      for a description of valid layouts.
 |      
 |      :return: Page layout currently being used.
 |      :rtype: ``str``, ``None`` if not specified
 |  
 |  pageMode
 |      Get the page mode.
 |      See :meth:`setPageMode()<PdfFileWriter.setPageMode>`
 |      for a description of valid modes.
 |      
 |      :return: Page mode currently being used.
 |      :rtype: ``str``, ``None`` if not specified
 |  
 |  pages
 |  
 |  xmpMetadata
In [22]:
pdf_reader = PyPDF2.PdfFileReader(pdf)
In [23]:
pdf_reader
Out[23]:
<PyPDF2.pdf.PdfFileReader at 0x7f8e185362d0>
In [24]:
dir(pdf_reader)
Out[24]:
['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_authenticateUserPassword',
 '_buildDestination',
 '_buildField',
 '_buildOutline',
 '_checkKids',
 '_decrypt',
 '_decryptObject',
 '_flatten',
 '_getObjectFromStream',
 '_getPageNumberByIndirect',
 '_override_encryption',
 '_pageId2Num',
 '_pairs',
 '_writeField',
 '_zeroXref',
 'cacheGetIndirectObject',
 'cacheIndirectObject',
 'decrypt',
 'documentInfo',
 'flattenedPages',
 'getDestinationPageNumber',
 'getDocumentInfo',
 'getFields',
 'getFormTextFields',
 'getIsEncrypted',
 'getNamedDestinations',
 'getNumPages',
 'getObject',
 'getOutlines',
 'getPage',
 'getPageLayout',
 'getPageMode',
 'getPageNumber',
 'getXmpMetadata',
 'isEncrypted',
 'namedDestinations',
 'numPages',
 'outlines',
 'pageLayout',
 'pageMode',
 'pages',
 'read',
 'readNextEndLine',
 'readObjectHeader',
 'resolvedObjects',
 'stream',
 'strict',
 'trailer',
 'xmpMetadata',
 'xref',
 'xrefIndex',
 'xref_objStm']
In [25]:
print(pdf_reader.getNumPages())
2
In [26]:
totalPages = pdf_reader.numPages
print(totalPages)
2

Basic document metadata provided in a PDF File documentInfo

In [27]:
metadata = pdf_reader.getDocumentInfo()
metadata
Out[27]:
{'/Author': 'Nutan',
 '/Creator': 'Microsoft® Word 2019',
 '/CreationDate': "D:20220810231755+05'30'",
 '/ModDate': "D:20220810231755+05'30'",
 '/Producer': 'Microsoft® Word 2019'}
In [28]:
metadata['/Author']
Out[28]:
'Nutan'

Extract first page text

In [29]:
page = pdf_reader.getPage(0)
In [30]:
print(page.extractText())
 
INTRODUCTION TO MACHINE LEARNING
 
 
O
ver the past two decades Machine Learning has become one of the main
-
 
stays of information technology and with that, a rather central, albeit usuall
y
 
hidden, part of our life. With the ever increasing amounts of data becomin
g
 
available there is good reason to believe that smart data analysis will becom
e
 
even more pervasive a
s a necessary ingredient for technological progress
.
 
The purpose of this chapter is to provide the reader with an overview ove
r
 
the vast range of applications which have at their heart a machine learnin
g
 
problem and to bring some degree of order to the zoo
 
of problems. Afte
r
 
that, we will discuss some basic tools from statistics and probability theory
,
 
since they form the language in which many machine learning problems mus
t
 
be phrased to become amenable to solving. Finally, we will outline a set o
f
 
fairly 
basic yet effective algorithms to solve an important problem, namel
y
 
that of classification. More sophisticated tools, a discussion of more genera
l
 
problems and a detailed analysis will follow in later parts of the book
.
 
1
 
A Taste of Machine Learnin
g
 
 
Mac
hine learning can appear in many guises. We now discuss a number o
f
 
applications, the types of data they deal with, and finally, we formalize th
e
 
problems in a somewhat more stylized fashion. The latter is key if we want t
o
 
avoid reinventing the wheel for 
every new application. Instead, much of th
e
 
ar
t
 
of machine learning is to reduce a range of fairly disparate problems t
o
 
a set of fairly narrow prototypes. Much of th
e
 
scienc
e
 
of machine learning i
s
 
then to solve those problems and provide good guarantees 
for the solutions
.
 
 
 
Most readers will be familiar with the concept of web pag
e
 
rankin
g
. Tha
t
 
is, the process of submitting a query to a search engine, which then find
s
 
webpages relevant to the query and which returns them in their order o
f
 

-
 

s
 

h
 

Extract top 5 pages text

In this pdf there are only 2 pages. If we have more pages, we can extract.

In [31]:
for i in range(0, totalPages):
    pages = pdf_reader.getPage(i)
    if i <= 5:
        print(pages.extractText())
        continue
    pass
 
INTRODUCTION TO MACHINE LEARNING
 
 
O
ver the past two decades Machine Learning has become one of the main
-
 
stays of information technology and with that, a rather central, albeit usuall
y
 
hidden, part of our life. With the ever increasing amounts of data becomin
g
 
available there is good reason to believe that smart data analysis will becom
e
 
even more pervasive a
s a necessary ingredient for technological progress
.
 
The purpose of this chapter is to provide the reader with an overview ove
r
 
the vast range of applications which have at their heart a machine learnin
g
 
problem and to bring some degree of order to the zoo
 
of problems. Afte
r
 
that, we will discuss some basic tools from statistics and probability theory
,
 
since they form the language in which many machine learning problems mus
t
 
be phrased to become amenable to solving. Finally, we will outline a set o
f
 
fairly 
basic yet effective algorithms to solve an important problem, namel
y
 
that of classification. More sophisticated tools, a discussion of more genera
l
 
problems and a detailed analysis will follow in later parts of the book
.
 
1
 
A Taste of Machine Learnin
g
 
 
Mac
hine learning can appear in many guises. We now discuss a number o
f
 
applications, the types of data they deal with, and finally, we formalize th
e
 
problems in a somewhat more stylized fashion. The latter is key if we want t
o
 
avoid reinventing the wheel for 
every new application. Instead, much of th
e
 
ar
t
 
of machine learning is to reduce a range of fairly disparate problems t
o
 
a set of fairly narrow prototypes. Much of th
e
 
scienc
e
 
of machine learning i
s
 
then to solve those problems and provide good guarantees 
for the solutions
.
 
 
 
Most readers will be familiar with the concept of web pag
e
 
rankin
g
. Tha
t
 
is, the process of submitting a query to a search engine, which then find
s
 
webpages relevant to the query and which returns them in their order o
f
 

-
 

s
 

h
 

pages are relevant and which pages match the query. Such knowledge can b
e
 
gained 
from sever
al sources: the link structure of webpages, their content
,
 
the frequency with which users will follow the suggested links in a query, o
r
 
from examples of queries in combination with manually ranked webpages
.
 
Increasingly machine learning rather than guessw
ork and clever engineerin
g
 
is used t
o
 
automat
e
 
the process of designing a good search engine [RPB06]
.
 
 
A rather related application i
s
 
collaborative filterin
g
. Internet book
-
 
stores such as Amazon, or video rental sites such as Netflix use this informa
-
 
tio
n extensively to entice users to purchase additional goods (or rent mor
e
 
movies). The problem is quite similar to the one of web page ranking. A
s
 
before, we want to obtain a sorted list (in this case of articles). The key dif
-
 
ference is that an explicit q
uery is missing and instead we can only use pas
t
 
purchase and viewing decisions of the user to predict future viewing an
d
 
purchase habits. The key side information here are the decisions made b
y
 
simila
r
 
users, hence the collaborative nature of the process. See Figure 1.
2
 
for an example. It is clearly desirable to have an automatic system to solv
e
 
this problem, thereby avoidin
g guesswork and time [BK07]
.
 
 
An equally ill
-
defined problem is that o
f
 
automatic translatio
n
 
of doc
-
 
uments. At one extreme, we could aim at full
y
 
understandin
g
 
a text befor
e
 
translating it using a curated set of rules crafted by a computational linguis
t
 
w
ell versed in the two languages we would like to translate. This is a rathe
r
 
arduous task, in particular given that text is not always grammatically cor
-
 
rect, nor is the document understanding part itself a trivial one. Instead, w
e
 
could simply us
e
 
exampl
e
s
 
of translated documents, such as the proceeding
s
 
of the Canadian parliament or other multilingual entities (United Nations
,
 
European Union, Switzerland) t
o
 
lear
n
 
how to translate between the tw
o
 
languages. In other words, we could use examples of translations to lear
n
 
how to translate. This machine learning approach proved quite successfu
l
 
[
?
]
.
 

Close the file

In [32]:
pdf.close()

Machine Learning

  1. Deal Banking Marketing Campaign Dataset With Machine Learning

TensorFlow

  1. Difference Between Scalar, Vector, Matrix and Tensor
  2. TensorFlow Deep Learning Model With IRIS Dataset
  3. Sequence to Sequence Learning With Neural Networks To Perform Number Addition
  4. Image Classification Model MobileNet V2 from TensorFlow Hub
  5. Step by Step Intent Recognition With BERT
  6. Sentiment Analysis for Hotel Reviews With NLTK and Keras
  7. Simple Sequence Prediction With LSTM
  8. Image Classification With ResNet50 Model
  9. Predict Amazon Inc Stock Price with Machine Learning
  10. Predict Diabetes With Machine Learning Algorithms
  11. TensorFlow Build Custom Convolutional Neural Network With MNIST Dataset
  12. Deal Banking Marketing Campaign Dataset With Machine Learning

PySpark

  1. How to Parallelize and Distribute Collection in PySpark
  2. Role of StringIndexer and Pipelines in PySpark ML Feature - Part 1
  3. Role of OneHotEncoder and Pipelines in PySpark ML Feature - Part 2
  4. Feature Transformer VectorAssembler in PySpark ML Feature - Part 3
  5. Logistic Regression in PySpark (ML Feature) with Breast Cancer Data Set

PyTorch

  1. Build the Neural Network with PyTorch
  2. Image Classification with PyTorch
  3. Twitter Sentiment Classification In PyTorch
  4. Training an Image Classifier in Pytorch

Natural Language Processing

  1. Spelling Correction Of The Text Data In Natural Language Processing
  2. Handling Text For Machine Learning
  3. Extracting Text From PDF File in Python Using PyPDF2
  4. How to Collect Data Using Twitter API V2 For Natural Language Processing
  5. Converting Text to Features in Natural Language Processing
  6. Extract A Noun Phrase For A Sentence In Natural Language Processing