OCRex is a high-accuracy, non-destructive bulk PDF OCR processing tool built on OCRmyPDF. Designed for large-scale document analysis, it enhances low-quality declassified PDFs with advanced pre-processing such as automatic rotation, deskewing, and denoising. The tool leverages multiprocessing and robust error handling to ensure high performance even on low-end machines.
- Batch OCR Processing – Processes hundreds of PDFs concurrently.
- Non-Destructive – Preserves original PDFs by generating new output files.
- Advanced Pre-Processing – Automatically rotates, deskews, and denoises scanned pages.
- Multiprocessing – Optimized for high-speed processing using available CPU cores.
- Full OCRmyPDF Integration – Utilizes all features of OCRmyPDF for lossless text recognition.
- Cross-Platform – Works on Linux, macOS, and Windows.
pip install -r requirements.txt
For system dependencies:
sudo apt install tesseract-ocr ghostscript poppler-utils -y # Debian/Ubuntu
brew install tesseract ghostscript poppler # macOS
git clone https://github.com/sbl8/ocrex.git
cd ocrex
python setup.py install
python -m ocrex.main --input_dir ./pdfs --output_dir ./output
pip install pyinstaller
pyinstaller --onefile --name ocrex ocrex/main.py
mv dist/ocrex /usr/local/bin/
Now you can run:
ocrex --input_dir ./pdfs --output_dir ./output
ocrex --input_dir ./pdfs --output_dir ./output
ocrex --input_dir ./pdfs --output_dir ./output --enable_preprocessing --auto_rotate --deskew --denoise
ocrex --input_dir ./pdfs --output_dir ./output --workers 8
ocrex --input_dir ./pdfs --output_dir ./output --optimize_level 2
Use advanced pre-processing to correct skewed, rotated, or noisy scans:
ocrex --input_dir ./pdfs --output_dir ./output --enable_preprocessing --deskew --denoise --auto_rotate
Run OCR with a lower optimization level to preserve as much text as possible:
ocrex --input_dir ./pdfs --output_dir ./output --optimize_level 0
Search for key phrases in the OCR output to verify accuracy:
grep "classified" ./output/*.pdf
Run the following to view all options:
ocrex --help
Option | Description |
---|---|
--input_dir |
Path to input PDFs (required). |
--output_dir |
Path to save OCR PDFs. |
--workers |
Number of parallel processes (default: CPU count). |
--auto_rotate |
Automatically rotate misaligned pages. |
--deskew |
Enable deskewing (default: on). |
--denoise |
Enable denoising (default: on). |
--optimize_level |
OCRmyPDF optimization level (0-3, default: 1). |
--pdf_dpi |
DPI for PDF-to-image conversion (default: 300). |
--verbose |
Enable verbose logging. |
- Tests Directory: All tests are located in the
tests/
directory at the project root. - Continuous Integration: GitHub Actions workflows run tests on every push and pull request.
- Automated Releases: Standalone binaries and source releases are built and tagged automatically via GitHub Actions.
Distributed under the MIT License. See LICENSE.
Contributions, bug reports, and feature requests are welcome. Please open an issue or submit a pull request on GitHub.