Use PyPDF2 - open PDF file or encrypted PDF file
Motivation
Since I want to work PDF file with Python on my work, I investigate what library can do that and how to use it.
Preparation
The runtime and module version are as below.
- python 3.6
- PyPDF2 1.26.0
Install PyPDF2
To work PDF file with Python, PyPDF2 is often used.
PyPDF2 can
- Extract text from PDF file
- Work existing PDF file and create new one
Let’s install with pip
command.
1pip install PyPDF2
Prepare PDF file
Prepare a new PDF file for working. Download Executive Order in this time. It looks like below. There are three pages in all.
Read PDF file
In this section, Open and read a normal PDF file. Print number of pages in the PDF file in the following sample code.
1import PyPDF2
2
3FILE_PATH = './files/executive_order.pdf'
4
5with open(FILE_PATH, mode='rb') as f:
6 reader = PyPDF2.PdfFileReader(f)
7 print(f"Number of pages: {reader.getNumPages()}")
Open the PDF file as binary read mode after importing PyPDF2
.
And then, create a PdfFileReader
object to work PDF.
Check the result.
Number of pages: 3
Read a PDF file with password(Encrypted PDF)
In this section, Open and read an encrypted PDF file that has a password when opening a file. To create an encrypted PDF file, set a password with enabling encryption option when saving a PDF file.
Failed example
Save a PDF file named executive_order_encrypted.pdf
with a password hoge1234
.
Open the PDF file and execute with the previous code that read the PDF without password.
1# Failed example
2import PyPDF2
3
4FILE_PATH = './files/executive_order_encrypted.pdf'
5
6with open(FILE_PATH, mode='rb') as f:
7 reader = PyPDF2.PdfFileReader(f)
8 print(f"Number of pages: {reader.getNumPages()}")
The following error message will be printed.
PdfReadError: File has not been decrypted
Success example
The decrypt
function given a password string to an argument decrypts an encrypted PDF file.
It is a better way to check if the file is encrypted with isEncrypted
function before calling decrypt
function.
1import PyPDF2
2
3ENCRYPTED_FILE_PATH = './files/executive_order_encrypted.pdf'
4
5with open(ENCRYPTED_FILE_PATH, mode='rb') as f:
6 reader = PyPDF2.PdfFileReader(f)
7 if reader.isEncrypted:
8 reader.decrypt('hoge1234')
9 print(f"Number of page: {reader.getNumPages()}")
Number of pages: 3
Troubleshooting: NotImplementedError
is thrown in calling decrypt
function
The following error message may be thrown when working an encrypted PDF file.
NotImplementedError: only algorithm code 1 and 2 are supported
The error message means that PyPDF2 doesn’t have an implementation to decrypt an algorithm that encrypts the PDF file. If this happens, it’s difficult to open the PDF file with PyPDF2 only.
Decrypt with qpdf
Using qpdf is a quick solution.
qpdf is a tool to work PDF file on command line interface.
We can download its installer for Windows from SourceForge, or install it for Mac with brew install qpdf
command.
Sample code that qpdf decrypts a PDF file is below.
1import PyPDF2
2import os
3
4ENCRYPTED_FILE_PATH = './files/executive_order_encrypted.pdf'
5FILE_OUT_PATH = './files/executive_order_out.pdf'
6
7PASSWORD='hoge1234'
8
9with open(ENCRYPTED_FILE_PATH, mode='rb') as f:
10 reader = PyPDF2.PdfFileReader(f)
11 if reader.isEncrypted:
12 try:
13 reader.decrypt(PASSWORD)
14 except NotImplementedError:
15 command=f"qpdf --password='{PASSWORD}' --decrypt {ENCRYPTED_FILE_PATH} {FILE_OUT_PATH};"
16 os.system(command)
17 with open(FILE_OUT_PATH, mode='rb') as fp:
18 reader = PyPDF2.PdfFileReader(fp)
19 print(f"Number of page: {reader.getNumPages()}")
The point is that Python executes the qpdf
command as the OS command and
save decrypted PDF file as new PDF file without password. Then, create PdfFileReader
instance to work the PDF file with PyPDF2.
Conclusion
It is available to
- Open PDF file with
PdfFileReader
on PyPDF2 - Decrypt an encrypted PDF file with
decrypt
function - Decrypt an encrypted PDF file with qpdf when
NotImplementedError
is occured