Extract all Images from PDF in Python

2 min readAug 4, 2021

In this tutorial, we will write a Python code to extract images from PDF files and save them in the local disk using PyMuPDF and Pillow libraries.

With PyMuPDF, you are able to access PDF, XPS, OpenXPS, epub and many other extensions. It should run on all platforms including Windows, Mac OSX and Linux.

Let’s get started!

First of all install the required modules.

python -m pip install PyMuPDF Pillow

Now Open/Create your python file and import the libraries.

import io
import fitz
from PIL import Image

For testing a pdf file we gonna use this file. Feel free to choose any file and make sure you put the file in your working directory, or you have the correct path to pdf file.

# file path you want to extract images from
file = "1770.521236.pdf"# open the file
pdf_file = fitz.open(file)

Since we want to extract images from all pages, we need to iterate over all the pages available, and get all image objects on each page, the following code does that:

# iterate over pdf pages
for page_index in range(len(pdf_file)):
    # get the page itself
    page = pdf_file[page_index]
    image_list = page.getImageList()
    # printing number of images found in this page
    if image_list:
        print(f"[+] Found a total of {len(image_list)} images in page {page_index}")
    else:
        print("[!] No images found on page", page_index)
    for image_index, img in enumerate(page.getImageList(), start=1):
        # get the XREF of the image
        xref = img[0]
        # extract the image bytes
        base_image = pdf_file.extractImage(xref)
        image_bytes = base_image["image"]
        # get the image extension
        image_ext = base_image["ext"]
        # load it to PIL
        image = Image.open(io.BytesIO(image_bytes))
        # save it to local disk
        image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))

We’re using getImageList() method to list all available image objects as a list of tuples in that particular page. To get the image object index, we simply get the first element of the tuple returned.

After that, we use the extractImage() method that returns the image in bytes along with additional information such as the image extension.

Finally, we convert the image bytes to a PIL image instance and save it to the local disk using the save() method, which accepts a file pointer as an argument, then we're simply naming the images with their corresponding page and image indices.

That was it!

After running the script you will get the following output:

[!] No images found on page 0
[+] Found a total of 3 images in page 1
[+] Found a total of 3 images in page 2
[!] No images found on page 3
[!] No images found on page 4

And the images are saved as well, in the current directory.

Conclusion

Alright, we have successfully extracted images from that PDF file without loosing image quality. For more information on how the library works, I suggest you take a look at the documentation.

Extract all Images from PDF in Python

Let’s get started!

That was it!

Conclusion

Written by Ali Aref

No responses yet