Column Definer — Geometric Extraction

Zero Logic.
Pure Geometry.

Doesn't calculate balances. Doesn't categorise transactions. Whatever is printed on the PDF is exactly what appears in Excel. No more, no less.

Visual Grid Mapper

Detects the physical table structure using coordinate geometry — bordered (ruled lines) or borderless (whitespace rivers). Builds a bounding-box grid over the page.

Dumb Text Extractor

Iterates every bounding box left-to-right, top-to-bottom and lifts the raw string within that exact rectangle. Multi-line cells → joined by a space. Empty cells → empty string.

100% Local — No APIs

Runs entirely on your machine using pdfplumber, camelot-py, OpenCV, and local Tesseract OCR. Your PDF never leaves your computer.

Strict Guarantees

Zero data manipulation — raw strings only
Empty cell = empty cell (no shifting)
Multi-page tables stitched seamlessly
Duplicate page headers removed automatically
Multi-line cell text joined with a space

Detection Strategy Cascade

1
camelot lattice
Detects explicit ruled lines (horizontal + vertical) — best for bank statements with visible grid lines.
2
camelot stream
Analyses whitespace gaps to infer column boundaries — handles clean borderless tables.
3
OpenCV HoughLines
Morphological image processing on a rasterised page — catches lines embedded as images.
4
Whitespace histogram
Pure coordinate math: x/y histogram gap analysis on character positions — no rasterisation needed.
5
Local Tesseract OCR
For scanned / image-only pages — runs entirely on your machine via pytesseract.

Active Node: Column Definer

Geometric Extraction

Zero business logic — pure spatial analysis

Drop your PDF here

or click to browse — PDF files only, up to 100 MB

Detection Mode

First-Time Setup

1. Install system dependencies

# Windows (recommended: Chocolatey or manual installers)
choco install tesseract poppler ghostscript

# macOS
brew install tesseract poppler ghostscript

# Ubuntu / Debian
sudo apt install tesseract-ocr poppler-utils ghostscript

2. Install Python dependencies & start the server

cd python-engine
pip install -r requirements.txt
python run_server.py

3. Alternatively — use the CLI directly (no server needed)

python run_cli.py extract --file MyBank_Statement.pdf --output-dir ./output

Zero Logic.
Pure Geometry.

Visual Grid Mapper

Dumb Text Extractor

100% Local — No APIs

Strict Guarantees

Detection Strategy Cascade

Geometric Extraction

First-Time Setup

Other Extraction Nodes

Universal Parser

Region Extractor

Scan Extractor

Zero Logic. Pure Geometry.

Visual Grid Mapper

Dumb Text Extractor

100% Local — No APIs

Strict Guarantees

Detection Strategy Cascade

Geometric Extraction

First-Time Setup

Other Extraction Nodes

Universal Parser

Region Extractor

Scan Extractor

Zero Logic.
Pure Geometry.