Skip to content
/ ocrex Public

OCRex is a high-accuracy, non-destructive bulk PDF OCR processing tool built on OCRmyPDF

License

Notifications You must be signed in to change notification settings

sbl8/ocrex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OCRex

OCRex Logo

Release License: MIT

OCRex is a high-accuracy, non-destructive bulk PDF OCR processing tool built on OCRmyPDF. Designed for large-scale document analysis, it enhances low-quality declassified PDFs with advanced pre-processing such as automatic rotation, deskewing, and denoising. The tool leverages multiprocessing and robust error handling to ensure high performance even on low-end machines.


Features

  • Batch OCR Processing – Processes hundreds of PDFs concurrently.
  • Non-Destructive – Preserves original PDFs by generating new output files.
  • Advanced Pre-Processing – Automatically rotates, deskews, and denoises scanned pages.
  • Multiprocessing – Optimized for high-speed processing using available CPU cores.
  • Full OCRmyPDF Integration – Utilizes all features of OCRmyPDF for lossless text recognition.
  • Cross-Platform – Works on Linux, macOS, and Windows.

Installation

1. Install Dependencies

pip install -r requirements.txt

For system dependencies:

sudo apt install tesseract-ocr ghostscript poppler-utils -y  # Debian/Ubuntu
brew install tesseract ghostscript poppler                   # macOS

2. Install from Source

git clone https://github.com/sbl8/ocrex.git
cd ocrex
python setup.py install

3. Running from Source

python -m ocrex.main --input_dir ./pdfs --output_dir ./output

Building a Standalone Binary

pip install pyinstaller
pyinstaller --onefile --name ocrex ocrex/main.py
mv dist/ocrex /usr/local/bin/

Now you can run:

ocrex --input_dir ./pdfs --output_dir ./output

Usage Examples

Standard OCR Processing

ocrex --input_dir ./pdfs --output_dir ./output

Correcting Skewed or Rotated Pages

ocrex --input_dir ./pdfs --output_dir ./output --enable_preprocessing --auto_rotate --deskew --denoise

Optimize Speed (More CPU Cores)

ocrex --input_dir ./pdfs --output_dir ./output --workers 8

Optimize File Size

ocrex --input_dir ./pdfs --output_dir ./output --optimize_level 2

Workflow for Declassified Documents

Step 1: Pre-Processing for Low-Quality Scans

Use advanced pre-processing to correct skewed, rotated, or noisy scans:

ocrex --input_dir ./pdfs --output_dir ./output --enable_preprocessing --deskew --denoise --auto_rotate

Step 2: OCR Processing with Enhanced Text Retention

Run OCR with a lower optimization level to preserve as much text as possible:

ocrex --input_dir ./pdfs --output_dir ./output --optimize_level 0

Step 3: Verifying OCR Output

Search for key phrases in the OCR output to verify accuracy:

grep "classified" ./output/*.pdf

Command-Line Options

Run the following to view all options:

ocrex --help
Option Description
--input_dir Path to input PDFs (required).
--output_dir Path to save OCR PDFs.
--workers Number of parallel processes (default: CPU count).
--auto_rotate Automatically rotate misaligned pages.
--deskew Enable deskewing (default: on).
--denoise Enable denoising (default: on).
--optimize_level OCRmyPDF optimization level (0-3, default: 1).
--pdf_dpi DPI for PDF-to-image conversion (default: 300).
--verbose Enable verbose logging.

Automated Testing & CI/CD

  • Tests Directory: All tests are located in the tests/ directory at the project root.
  • Continuous Integration: GitHub Actions workflows run tests on every push and pull request.
  • Automated Releases: Standalone binaries and source releases are built and tagged automatically via GitHub Actions.

License

Distributed under the MIT License. See LICENSE.


Contributing

Contributions, bug reports, and feature requests are welcome. Please open an issue or submit a pull request on GitHub.

About

OCRex is a high-accuracy, non-destructive bulk PDF OCR processing tool built on OCRmyPDF

Resources

License

Stars

Watchers

Forks

Languages