Skip to content

sujeetgund/finbrain

Repository files navigation

finbrain — AI-Native Transaction Intelligence for Google Pay Statements

finbrain transforms Google Pay statement PDFs into high-fidelity, analytics-ready transaction data.

It combines robust PDF parsing, strict verification, LLM-powered categorization, and semantic QA to turn static statements into a queryable financial intelligence layer.

Problem Statement

Digital wallet statements are usually delivered as PDFs, which are difficult to query, aggregate, or analyze programmatically. Manual categorization and monthly spend analysis are repetitive and error-prone.

Objectives

  • Convert Google Pay statement PDFs into structured, validated transaction data.
  • Provide a backend API for extraction, review, semantic retrieval, and question answering.
  • Support scalable post-processing with asynchronous embedding jobs for semantic finance use cases.

Proposed Solution

finbrain combines deterministic PDF parsing with LLM-powered categorization and semantic QA:

  • Parse statement transactions and validate extracted totals against statement headers.
  • Label transactions into canonical spending categories.
  • Store records in PostgreSQL with extraction-run scoping and durable embedding job queues.
  • Enable semantic question answering over transaction history using vector search + LLM reasoning.

How Extraction Works

Google Pay statement PDFs store text as positioned glyphs (characters with x/y coordinates), not as semantic tables with explicit spaces. In many PDFs, a naive extract_text() call can collapse adjacent words (for example, payee names) because spacing is visual rather than encoded as space characters.

This project uses character-level extraction with pdfplumber:

  • Reads each character with its position.
  • Groups characters into lines by y-coordinate.
  • Reconstructs spaces when the horizontal gap between consecutive characters exceeds a threshold (2 pixels).

This approach preserves names and transaction fields more reliably for downstream parsing and validation.

FastAPI Service

The project also exposes an API service for statement extraction.

  • Endpoint: POST /extract
  • Endpoint: GET /transactions
  • Endpoint: POST /transactions/categories
  • Endpoint: POST /qa
  • Upload field: statement (multipart/form-data)
  • Max file size: 10 MB
  • Behavior: parses uploaded PDF, labels transaction categories using LLM, persists in PostgreSQL

Start API server:

uv run uvicorn main:app --reload

Open docs:

  • Swagger UI: http://127.0.0.1:8000/docs
  • ReDoc: http://127.0.0.1:8000/redoc

Example request:

curl -X POST "http://127.0.0.1:8000/extract" \
  -H "accept: application/json" \
  -F "statement=@data/gpay_statement.pdf;type=application/pdf"

Documentation

Architecture

flowchart TB
  User["User / App"] <--> API["FastAPI Service"]

  subgraph Extraction ["Extraction Pipeline"]
    direction TB
    E["POST /extract"]
    P["PDF Parsing and Verification"]
    L["LLM Category Labeling"]

    subgraph DB1 ["Datastore"]
      direction LR
      XRUN[(extraction_runs)]
      T1[(transactions)]
    end

    E -->|parse + verify| P
    E -->|categorize transactions| L
    E -->|create extraction run| XRUN
    P -->|validated rows| T1
    L -->|category labels| T1
  end

  subgraph Review ["User Review and Embedding Pipeline"]
    direction TB
    C["POST /transactions/categories"]
    W["Background Embedding Worker"]

    subgraph DB2 ["Datastore"]
      direction LR
      J[(embedding_jobs)]
      T2[(transactions)]
    end

    C -->|apply category corrections| T2
    C -->|enqueue embedding jobs| J
    J -->|process pending jobs| W
    W -->|persist embeddings| T2
  end

  subgraph QA ["Semantic Question Answering Pipeline"]
    direction TB
    Q["POST /qa"]
    S["Constraint Extraction and Vector Retrieval"]
    A["LLM Answer Synthesis"]

    subgraph DB3 ["Datastore"]
      T3[(transactions)]
    end

    Q -->|parse question intent| S
    S -->|retrieve semantic matches| T3
    S -->|build answer context| A
  end

  API -->|route| E
  API -->|route| C
  API -->|route| Q
Loading

Database Diagram

erDiagram
  EXTRACTION_RUNS ||--o{ TRANSACTIONS : contains
  TRANSACTIONS ||--o| EMBEDDING_JOBS : queues

  EXTRACTION_RUNS {
    uuid id PK
    timestamptz created_at
  }

  TRANSACTIONS {
    uuid id PK
    uuid extraction_id FK
    timestamp transaction_date
    string payee
    string type
    string category
    numeric amount
    string upi_id
    vector embedding
  }

  EMBEDDING_JOBS {
    uuid id PK
    uuid transaction_id FK
    string status
    int attempts
    text last_error
    timestamptz created_at
    timestamptz updated_at
    timestamptz started_at
    timestamptz completed_at
  }
Loading

API Sequence

sequenceDiagram
  participant Client
  participant API as FastAPI
  participant DB as PostgreSQL
  participant Worker as Embedding Worker

  Client->>API: POST /extract (statement.pdf)
  API->>DB: Insert extraction_runs row
  API->>DB: Insert transactions (LLM category labels)
  API-->>Client: extraction_id + extracted data

  Client->>API: POST /transactions/categories (extraction_id, updates[])
  API->>DB: Update categories (if provided)
  API->>DB: Enqueue embedding_jobs
  API-->>Client: updated_count, queued_embedding_jobs

  Worker->>DB: Process pending embedding_jobs
  Worker->>DB: Store embeddings on transactions

  Client->>API: POST /qa (question)
  API->>DB: Retrieve constrained semantic matches
  API-->>Client: Answer + useful supporting matches
Loading

Features

  • Extracts transactions into JSON (date, payee, amount, upiTransactionId)
  • Adds type as credit/debit (credit when source label contains Received from)
  • Adds semantic category labels using a canonical category set via LLM
  • Captures statement period (start, end, totalSent, totalReceived) when present
  • Verifies UPI ID presence and uniqueness
  • Reconciles extracted totals with statement header totals
  • Supports bulk user category verification/update and asynchronous embedding generation
  • Stores 768-dim embeddings in PostgreSQL (pgvector) for semantic QA retrieval
  • Returns exit code 2 when extraction succeeds but verification fails

Requirements

  • Python 3.12+ (or uv, recommended)
  • Project dependency: pdfplumber (installed via project setup)
  • PostgreSQL connection string via DATABASE_URL
  • Google API key via GOOGLE_API_KEY (Gemini categorization, embeddings, QA)

Getting the Statement PDF (Google Pay)

Tested against Google Pay Android version 317.1.1.

Export statement PDF

  1. Open Google Pay.
  2. Go to See Transaction History.
  3. Tap See All next to Transaction History.
  4. In the top search bar area, open the menu (...) and choose Get Statement.
  5. Select Last Six Months and continue to download.

Check app version

  1. Open Google Pay.
  2. Tap your profile photo (bottom-right).
  3. Go to Settings.
  4. Open About to view the app version.

Quick Start (uv recommended)

From the project root:

uv sync
set DATABASE_URL=postgresql+psycopg://postgres:postgres@localhost:5432/finbrain
set GOOGLE_API_KEY=<your_google_api_key>
uv run alembic upgrade head
uv run extract_gpay.py data\gpay_statement.pdf

Expected output:

  • JSON file at output/gpay_statement.json (default path pattern: output/<pdf_name>.json)
  • Verification report printed to terminal (unless quiet mode is used)

Installation

Option 1: uv (recommended)

uv sync

Option 2: pip + virtual environment

python -m venv .venv
.venv\Scripts\activate
pip install -e .

Docker (Cloud Run)

The included multi-stage Dockerfile is optimized for Google Cloud Run deployment.

Build image

docker build -t finbrain:latest .

Run locally

docker run --rm -p 8080:8080 --env-file .env finbrain:latest

Deploy to Google Cloud Run

gcloud builds submit --tag gcr.io/<PROJECT_ID>/finbrain:latest
gcloud run deploy finbrain \
  --image gcr.io/<PROJECT_ID>/finbrain:latest \
  --platform managed \
  --region <REGION> \
  --allow-unauthenticated

Required runtime env vars on Cloud Run:

  • DATABASE_URL
  • GOOGLE_API_KEY

Usage

API flow for semantic QA

  1. Upload statement to POST /extract (response includes categories per transaction and extraction_id).
  2. Verify/correct categories with POST /transactions/categories using extraction_id.
  3. Embeddings are generated asynchronously and stored in DB (vector(768)) without blocking response.
  4. Ask semantic questions with POST /qa.

Direct script

uv run extract_gpay.py data\gpay_statement.pdf
uv run extract_gpay.py data\gpay_statement.pdf --output output\custom.json
uv run extract_gpay.py data\gpay_statement.pdf --quiet

Windows PowerShell wrapper

.\scripts\extract-gpay.ps1 data\gpay_statement.pdf
.\scripts\extract-gpay.ps1 data\gpay_statement.pdf -Output output\custom.json
.\scripts\extract-gpay.ps1 data\gpay_statement.pdf -Quiet

Bash wrapper (Git Bash / WSL)

bash ./scripts/extract-gpay.sh data/gpay_statement.pdf
bash ./scripts/extract-gpay.sh data/gpay_statement.pdf --output output/custom.json
bash ./scripts/extract-gpay.sh data/gpay_statement.pdf --quiet

CLI Reference

extract_gpay.py <pdf_path> [--output <json_path>] [--quiet]

Arguments:

  • pdf_path (required): path to the Google Pay PDF statement
  • --output, -o: output JSON file path (default: output/<pdf_name>.json)
  • --quiet, -q: suppress verification output

For complete CLI documentation (commands, exit codes, troubleshooting), see docs/cli.md.

Output Format

Generated JSON shape:

{
  "transactionPeriod": {
    "start": "01 August 2025",
    "end": "31 January 2026",
    "totalSent": 782334.17,
    "totalReceived": 101.0
  },
  "totalTransactions": 123,
  "transactions": [
    {
      "date": "01 Aug, 2025 06:20 PM",
      "payee": "NAVEEN KUMAR S",
      "type": "debit",
      "category": "rent",
      "amount": 630.0,
      "upiTransactionId": "521314926792"
    }
  ]
}

Notes:

  • transactionPeriod is an empty object if period metadata is not detected in the PDF
  • amount values are numeric (float)

Verification and Exit Codes

When not using --quiet, the script applies three verification checks:

  • Totals reconciliation: compares extracted sent and received sums against the corresponding totals printed in the statement header. A paisa-level match across many rows is a strong signal that extraction is complete and not double-counted.
  • UPI ID uniqueness: ensures each transaction has a unique UPI Transaction ID.
  • Completeness: confirms every extracted transaction includes a UPI Transaction ID.

Exit codes:

  • 0: extraction succeeded (and verification passed, if enabled)
  • 1: invalid input (missing file, wrong extension, etc.)
  • 2: extraction succeeded but verification failed

Project Structure

finbrain
├─ alembic/
│  ├─ versions/
│  ├─ env.py
│  └─ script.py.mako
├─ alembic.ini
├─ categories.py
├─ config.py
├─ db/
│  ├─ base.py
│  └─ session.py
├─ extract_gpay.py
├─ main.py
├─ models/
│  ├─ embedding_job.py
│  ├─ extraction_run.py
│  └─ transaction.py
├─ services/
│  ├─ categorization.py
│  ├─ embeddings.py
│  └─ semantic_qa.py
├─ pyproject.toml
├─ schemas.py
├─ docs/
├─ output/
└─ scripts/
   ├─ extract-gpay.ps1
   └─ extract-gpay.sh

Troubleshooting

  • No transactions extracted: confirm the PDF is a Google Pay statement format supported by this parser.
  • Verification mismatches: review top-ups/received entries in output and compare with statement period totals.
  • Duplicate imports: /extract skips duplicate rows for the same extraction run (extraction_id + upi_id).
  • LLM errors: ensure GOOGLE_API_KEY is configured for category labeling, embeddings, and QA.
  • Dependency errors: re-run uv sync (or reinstall via pip environment setup).

Author

Sujeet Gund

Sujeet Gund
GitHub: github.com/sujeetgund

License

This project is licensed under the MIT License.

About

AI-native transaction intelligence pipeline that converts Google Pay statement PDFs into verified, categorized, and semantically searchable financial data using FastAPI, PostgreSQL/pgvector, and LLM-powered QA.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors