I had one of those "Automate the Boring Stuff" problems this week. How do you convert a multi-page PDF into a folder of images? One image for each page in the PDF? Read on to see how to solve this problem with Python.
So here is the basic problem:
Convert a multi-page PDF into a directory of images
We're going to solve this problem with Python.
If you don't have Python installed yet, I suggest you install the Anaconda distribution of Python. See this post to learn how to install Anaconda on your computer. Alternatively, you can download Python form Python.org or download Python from the Microsoft Store.
Create and virtual environment and install pdf2image
Before we start writing Python code, it is a good idea to create a new virtual environment. A virtual environment is an isolated installation of Python that is separate from other Python installations running on your computer. See this post to learn how to create a virtual environment with the Anaconda Prompt.
To create a new Python virtual environment, open the Anaconda Prompt and type the following commands. Note the prompt sign
> is included to indicate the prompt, not a character you should type. The
-n pdf portion of the command denotes the name of the virtual environment.
python=3.7 ensures Python Version 3.7 is installed into the
> conda create -n pdf python=3.7
y for yes when prompted. Before we install any packages into the
(pdf) is shown in parenthesis before the prompt.
> conda activate pdf (pdf)>
Install img2pdf and poppler
Next, we need to install the
poppler package using conda and the
pdf2image package using pip. The
-c conda-forge portion of the
conda install command indicates
poppler will be installed from the conda-forge channel of packages.
(pdf)> conda install -c conda-forge poppler (pdf)> pip install pdf2image
Alright, it's time to write the code.
At the top of the file, import the
convert_from_path function from the
pdf2image package and the
# pdf.py from pdf2image import convert_from_path from pdf2image.exceptions import ( PDFInfoNotInstalledError, PDFPageCountError, PDFSyntaxError )
Below is the code to convert a PDF named
myfile.pdf to multiple
.png images. Save this code below the import lines.
images = convert_from_path('myfile.pdf') for i, image in enumerate(images): fname = "image" + str(i) + ".png" image.save(fname, "PNG")
Run the code
Put a PDF file in the same folder as the code. Make sure the PDF file has the same name as used in the code above. I called my PDF file
myfile.pdf. If your PDF file has a different name, either re-name the PDF or use a different file name in the code above
├───pdf │ ├───pdf.py │ └───myfile.pdf
pdf.py script can be run from the Anaconda Prompt. Make sure the
(pdf) environment is active before the script is run.
(pdf)> python pdf.py
When the script finishes, you should see images in the same folder as your
pdf.py script and
myfile.pdf PDF file. My PDF had three pages, so three .png image files were created.
In this post, we used a Python package called
pdf2image to convert a PDF file into a directory full of images. Big thanks to the maintainers of
pdf2image for making such a useful package!
The complete Python script is below:
# pdf.py """ A Python script to convert a multi-page PDF to a directory of images Uses the pdf2image package """ from pdf2image import convert_from_path from pdf2image.exceptions import ( PDFInfoNotInstalledError, PDFPageCountError, PDFSyntaxError ) images = convert_from_path('myfile.pdf') for i, image in enumerate(images): fname = 'image'+str(i)+'.png' image.save(fname, "PNG")