Zero Logic.
Pure Geometry.
Doesn't calculate balances. Doesn't categorise transactions. Whatever is printed on the PDF is exactly what appears in Excel. No more, no less.
Visual Grid Mapper
Detects the physical table structure using coordinate geometry — bordered (ruled lines) or borderless (whitespace rivers). Builds a bounding-box grid over the page.
Dumb Text Extractor
Iterates every bounding box left-to-right, top-to-bottom and lifts the raw string within that exact rectangle. Multi-line cells → joined by a space. Empty cells → empty string.
100% Local — No APIs
Runs entirely on your machine using pdfplumber, camelot-py, OpenCV, and local Tesseract OCR. Your PDF never leaves your computer.
Strict Guarantees
- Zero data manipulation — raw strings only
- Empty cell = empty cell (no shifting)
- Multi-page tables stitched seamlessly
- Duplicate page headers removed automatically
- Multi-line cell text joined with a space
Detection Strategy Cascade
- 1
camelot lattice
Detects explicit ruled lines (horizontal + vertical) — best for bank statements with visible grid lines.
- 2
camelot stream
Analyses whitespace gaps to infer column boundaries — handles clean borderless tables.
- 3
OpenCV HoughLines
Morphological image processing on a rasterised page — catches lines embedded as images.
- 4
Whitespace histogram
Pure coordinate math: x/y histogram gap analysis on character positions — no rasterisation needed.
- 5
Local Tesseract OCR
For scanned / image-only pages — runs entirely on your machine via pytesseract.
Geometric Extraction
Zero business logic — pure spatial analysis
Drop your PDF here
or click to browse — PDF files only, up to 100 MB
First-Time Setup
1. Install system dependencies
# Windows (recommended: Chocolatey or manual installers) choco install tesseract poppler ghostscript # macOS brew install tesseract poppler ghostscript # Ubuntu / Debian sudo apt install tesseract-ocr poppler-utils ghostscript
2. Install Python dependencies & start the server
cd python-engine pip install -r requirements.txt python run_server.py
3. Alternatively — use the CLI directly (no server needed)
python run_cli.py extract --file MyBank_Statement.pdf --output-dir ./output