PDF Image Extractor

A Python tool to extract images from PDF files.

Features

📄 Extract images from PDF files
📁 Organized output structure
🖼️ Preserve original image formats
🔍 Filter out small images and duplicates
🛠️ Simple command-line interface

Prerequisites

Python 3.6 or higher
pip (Python package installer)

Installation

Clone or download this repository
Navigate to the project directory

Create and activate a virtual environment (recommended):

python -m venv .venv
.venv\Scripts\activate  # On Windows
source .venv/bin/activate  # On Unix/MacOS

Install required packages:
```
pip install -r requirements.txt
```

Usage

Basic Usage

Place your PDF files in the pdfs directory
Run the script:
```
python pdf-image-extractor.py
```
Extracted images will be saved in the extracted_images directory

Advanced Options

python pdf-image-extractor.py [INPUT_DIR] [--output_dir OUTPUT_DIR] [--min_size MIN_SIZE]

Arguments:

INPUT_DIR: Directory containing PDF files (optional, default: ./pdfs)
--output_dir: Directory to save extracted images (default: ./extracted_images)
--min_size: Minimum pixel dimension for images (default: 100)

Examples:

# Use default pdfs directory
python pdf-image-extractor.py

# Specify custom input directory
python pdf-image-extractor.py my_pdfs

# Extract images with custom minimum size
python pdf-image-extractor.py --min_size 200

# Specify custom input and output directories
python pdf-image-extractor.py my_pdfs --output_dir my_images

Directory Structure

.
├── pdfs/                  # Place your PDF files here
├── extracted_images/     # Contains extracted images
│   └── pdf_name/        # Subdirectory for each PDF
│       └── pageX_imgY_WxH.ext  # Extracted images
├── pdf-image-extractor.py
├── requirements.txt
└── README.md

Output Format

Extracted images are named using the following format:

page{page_number}_img{image_index}_{width}x{height}.{extension}

Example: page1_img0_800x600.jpg

Notes

Each PDF's images are extracted to a separate subdirectory
Small images and duplicates are automatically filtered
Original image formats are preserved

Troubleshooting

No PDFs found: Ensure your PDF files are in the specified input directory
Permission errors: Check write permissions for output directory
Corrupted PDFs: The script will skip problematic pages and continue processing
Memory issues: Process large PDFs one at a time

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Image Extractor

Features

Prerequisites

Installation

Usage

Basic Usage

Advanced Options

Arguments:

Examples:

Directory Structure

Output Format

Notes

Troubleshooting

License

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
pdfs		pdfs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pdf-image-extractor.py		pdf-image-extractor.py
requirements.txt		requirements.txt

License

patrickiel/PDF-Image-Extractor

Folders and files

Latest commit

History

Repository files navigation

PDF Image Extractor

Features

Prerequisites

Installation

Usage

Basic Usage

Advanced Options

Arguments:

Examples:

Directory Structure

Output Format

Notes

Troubleshooting

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages