A Python tool to extract images from PDF files.
- 📄 Extract images from PDF files
- 📁 Organized output structure
- 🖼️ Preserve original image formats
- 🔍 Filter out small images and duplicates
- 🛠️ Simple command-line interface
- Python 3.6 or higher
- pip (Python package installer)
- Clone or download this repository
- Navigate to the project directory
- Create and activate a virtual environment (recommended):
python -m venv .venv .venv\Scripts\activate # On Windows source .venv/bin/activate # On Unix/MacOS
- Install required packages:
pip install -r requirements.txt
- Place your PDF files in the
pdfs
directory - Run the script:
python pdf-image-extractor.py
- Extracted images will be saved in the
extracted_images
directory
python pdf-image-extractor.py [INPUT_DIR] [--output_dir OUTPUT_DIR] [--min_size MIN_SIZE]
INPUT_DIR
: Directory containing PDF files (optional, default: ./pdfs)--output_dir
: Directory to save extracted images (default: ./extracted_images)--min_size
: Minimum pixel dimension for images (default: 100)
# Use default pdfs directory
python pdf-image-extractor.py
# Specify custom input directory
python pdf-image-extractor.py my_pdfs
# Extract images with custom minimum size
python pdf-image-extractor.py --min_size 200
# Specify custom input and output directories
python pdf-image-extractor.py my_pdfs --output_dir my_images
.
├── pdfs/ # Place your PDF files here
├── extracted_images/ # Contains extracted images
│ └── pdf_name/ # Subdirectory for each PDF
│ └── pageX_imgY_WxH.ext # Extracted images
├── pdf-image-extractor.py
├── requirements.txt
└── README.md
Extracted images are named using the following format:
page{page_number}_img{image_index}_{width}x{height}.{extension}
Example: page1_img0_800x600.jpg
- Each PDF's images are extracted to a separate subdirectory
- Small images and duplicates are automatically filtered
- Original image formats are preserved
- No PDFs found: Ensure your PDF files are in the specified input directory
- Permission errors: Check write permissions for output directory
- Corrupted PDFs: The script will skip problematic pages and continue processing
- Memory issues: Process large PDFs one at a time
This project is licensed under the MIT License - see the LICENSE file for details.