StatementConverter
Testing Engine — Geometric Extraction

Zero Logic.
Pure Geometry.

Doesn't calculate balances. Doesn't categorise transactions. Whatever is printed on the PDF is exactly what appears in Excel. No more, no less.

Visual Grid Mapper

Detects the physical table structure using coordinate geometry — bordered (ruled lines) or borderless (whitespace rivers). Builds a bounding-box grid over the page.

Dumb Text Extractor

Iterates every bounding box left-to-right, top-to-bottom and lifts the raw string within that exact rectangle. Multi-line cells → joined by a space. Empty cells → empty string.

100% Local — No APIs

Runs entirely on your machine using pdfplumber, camelot-py, OpenCV, and local Tesseract OCR. Your PDF never leaves your computer.

Strict Guarantees

  • Zero data manipulation — raw strings only
  • Empty cell = empty cell (no shifting)
  • Multi-page tables stitched seamlessly
  • Duplicate page headers removed automatically
  • Multi-line cell text joined with a space

Detection Strategy Cascade

  1. 1

    camelot lattice

    Detects explicit ruled lines (horizontal + vertical) — best for bank statements with visible grid lines.

  2. 2

    camelot stream

    Analyses whitespace gaps to infer column boundaries — handles clean borderless tables.

  3. 3

    OpenCV HoughLines

    Morphological image processing on a rasterised page — catches lines embedded as images.

  4. 4

    Whitespace histogram

    Pure coordinate math: x/y histogram gap analysis on character positions — no rasterisation needed.

  5. 5

    Local Tesseract OCR

    For scanned / image-only pages — runs entirely on your machine via pytesseract.

Active Node: Testing Engine

Geometric Extraction

Zero business logic — pure spatial analysis

Drop your PDF here

or click to browse — PDF files only, up to 100 MB

First-Time Setup

1. Install system dependencies

# Windows (recommended: Chocolatey or manual installers)
choco install tesseract poppler ghostscript

# macOS
brew install tesseract poppler ghostscript

# Ubuntu / Debian
sudo apt install tesseract-ocr poppler-utils ghostscript

2. Install Python dependencies & start the server

cd python-engine
pip install -r requirements.txt
python run_server.py

3. Alternatively — use the CLI directly (no server needed)

python run_cli.py extract --file MyBank_Statement.pdf --output-dir ./output