finbrain — AI-Native Transaction Intelligence for Google Pay Statements

finbrain transforms Google Pay statement PDFs into high-fidelity, analytics-ready transaction data.

It combines robust PDF parsing, strict verification, LLM-powered categorization, and semantic QA to turn static statements into a queryable financial intelligence layer.

Problem Statement

Digital wallet statements are usually delivered as PDFs, which are difficult to query, aggregate, or analyze programmatically. Manual categorization and monthly spend analysis are repetitive and error-prone.

Objectives

Convert Google Pay statement PDFs into structured, validated transaction data.
Provide a backend API for extraction, review, semantic retrieval, and question answering.
Support scalable post-processing with asynchronous embedding jobs for semantic finance use cases.

Proposed Solution

finbrain combines deterministic PDF parsing with LLM-powered categorization and semantic QA:

Parse statement transactions and validate extracted totals against statement headers.
Label transactions into canonical spending categories.
Store records in PostgreSQL with extraction-run scoping and durable embedding job queues.
Enable semantic question answering over transaction history using vector search + LLM reasoning.

How Extraction Works

Google Pay statement PDFs store text as positioned glyphs (characters with x/y coordinates), not as semantic tables with explicit spaces. In many PDFs, a naive extract_text() call can collapse adjacent words (for example, payee names) because spacing is visual rather than encoded as space characters.

This project uses character-level extraction with pdfplumber:

Reads each character with its position.
Groups characters into lines by y-coordinate.
Reconstructs spaces when the horizontal gap between consecutive characters exceeds a threshold (2 pixels).

This approach preserves names and transaction fields more reliably for downstream parsing and validation.

FastAPI Service

The project also exposes an API service for statement extraction.

Endpoint: POST /extract
Endpoint: GET /transactions
Endpoint: POST /transactions/categories
Endpoint: POST /qa
Upload field: statement (multipart/form-data)
Max file size: 10 MB
Behavior: parses uploaded PDF, labels transaction categories using LLM, persists in PostgreSQL

Start API server:

uv run uvicorn main:app --reload

Open docs:

Swagger UI: http://127.0.0.1:8000/docs
ReDoc: http://127.0.0.1:8000/redoc

Example request:

curl -X POST "http://127.0.0.1:8000/extract" \
  -H "accept: application/json" \
  -F "statement=@data/gpay_statement.pdf;type=application/pdf"

Documentation

API endpoints: docs/api-endpoints.md
CLI guide: docs/cli.md
Database guide: docs/database.md

Architecture

flowchart TB
  User["User / App"] <--> API["FastAPI Service"]

  subgraph Extraction ["Extraction Pipeline"]
    direction TB
    E["POST /extract"]
    P["PDF Parsing and Verification"]
    L["LLM Category Labeling"]

    subgraph DB1 ["Datastore"]
      direction LR
      XRUN[(extraction_runs)]
      T1[(transactions)]
    end

    E -->|parse + verify| P
    E -->|categorize transactions| L
    E -->|create extraction run| XRUN
    P -->|validated rows| T1
    L -->|category labels| T1
  end

  subgraph Review ["User Review and Embedding Pipeline"]
    direction TB
    C["POST /transactions/categories"]
    W["Background Embedding Worker"]

    subgraph DB2 ["Datastore"]
      direction LR
      J[(embedding_jobs)]
      T2[(transactions)]
    end

    C -->|apply category corrections| T2
    C -->|enqueue embedding jobs| J
    J -->|process pending jobs| W
    W -->|persist embeddings| T2
  end

  subgraph QA ["Semantic Question Answering Pipeline"]
    direction TB
    Q["POST /qa"]
    S["Constraint Extraction and Vector Retrieval"]
    A["LLM Answer Synthesis"]

    subgraph DB3 ["Datastore"]
      T3[(transactions)]
    end

    Q -->|parse question intent| S
    S -->|retrieve semantic matches| T3
    S -->|build answer context| A
  end

  API -->|route| E
  API -->|route| C
  API -->|route| Q

Database Diagram

erDiagram
  EXTRACTION_RUNS ||--o{ TRANSACTIONS : contains
  TRANSACTIONS ||--o| EMBEDDING_JOBS : queues

  EXTRACTION_RUNS {
    uuid id PK
    timestamptz created_at
  }

  TRANSACTIONS {
    uuid id PK
    uuid extraction_id FK
    timestamp transaction_date
    string payee
    string type
    string category
    numeric amount
    string upi_id
    vector embedding
  }

  EMBEDDING_JOBS {
    uuid id PK
    uuid transaction_id FK
    string status
    int attempts
    text last_error
    timestamptz created_at
    timestamptz updated_at
    timestamptz started_at
    timestamptz completed_at
  }

API Sequence

sequenceDiagram
  participant Client
  participant API as FastAPI
  participant DB as PostgreSQL
  participant Worker as Embedding Worker

  Client->>API: POST /extract (statement.pdf)
  API->>DB: Insert extraction_runs row
  API->>DB: Insert transactions (LLM category labels)
  API-->>Client: extraction_id + extracted data

  Client->>API: POST /transactions/categories (extraction_id, updates[])
  API->>DB: Update categories (if provided)
  API->>DB: Enqueue embedding_jobs
  API-->>Client: updated_count, queued_embedding_jobs

  Worker->>DB: Process pending embedding_jobs
  Worker->>DB: Store embeddings on transactions

  Client->>API: POST /qa (question)
  API->>DB: Retrieve constrained semantic matches
  API-->>Client: Answer + useful supporting matches

Features

Extracts transactions into JSON (date, payee, amount, upiTransactionId)
Adds type as credit/debit (credit when source label contains Received from)
Adds semantic category labels using a canonical category set via LLM
Captures statement period (start, end, totalSent, totalReceived) when present
Verifies UPI ID presence and uniqueness
Reconciles extracted totals with statement header totals
Supports bulk user category verification/update and asynchronous embedding generation
Stores 768-dim embeddings in PostgreSQL (pgvector) for semantic QA retrieval
Returns exit code 2 when extraction succeeds but verification fails

Requirements

Python 3.12+ (or uv, recommended)
Project dependency: pdfplumber (installed via project setup)
PostgreSQL connection string via DATABASE_URL
Google API key via GOOGLE_API_KEY (Gemini categorization, embeddings, QA)

Getting the Statement PDF (Google Pay)

Tested against Google Pay Android version 317.1.1.

Export statement PDF

Open Google Pay.
Go to See Transaction History.
Tap See All next to Transaction History.
In the top search bar area, open the menu (...) and choose Get Statement.
Select Last Six Months and continue to download.

Check app version

Open Google Pay.
Tap your profile photo (bottom-right).
Go to Settings.
Open About to view the app version.

Quick Start (uv recommended)

From the project root:

uv sync
set DATABASE_URL=postgresql+psycopg://postgres:postgres@localhost:5432/finbrain
set GOOGLE_API_KEY=<your_google_api_key>
uv run alembic upgrade head
uv run extract_gpay.py data\gpay_statement.pdf

Expected output:

JSON file at output/gpay_statement.json (default path pattern: output/<pdf_name>.json)
Verification report printed to terminal (unless quiet mode is used)

Installation

Option 1: uv (recommended)

uv sync

Option 2: pip + virtual environment

python -m venv .venv
.venv\Scripts\activate
pip install -e .

Docker (Cloud Run)

The included multi-stage Dockerfile is optimized for Google Cloud Run deployment.

Build image

docker build -t finbrain:latest .

Run locally

docker run --rm -p 8080:8080 --env-file .env finbrain:latest

Deploy to Google Cloud Run

gcloud builds submit --tag gcr.io/<PROJECT_ID>/finbrain:latest
gcloud run deploy finbrain \
  --image gcr.io/<PROJECT_ID>/finbrain:latest \
  --platform managed \
  --region <REGION> \
  --allow-unauthenticated

Required runtime env vars on Cloud Run:

DATABASE_URL
GOOGLE_API_KEY

Usage

API flow for semantic QA

Upload statement to POST /extract (response includes categories per transaction and extraction_id).
Verify/correct categories with POST /transactions/categories using extraction_id.
Embeddings are generated asynchronously and stored in DB (vector(768)) without blocking response.
Ask semantic questions with POST /qa.

Direct script

uv run extract_gpay.py data\gpay_statement.pdf
uv run extract_gpay.py data\gpay_statement.pdf --output output\custom.json
uv run extract_gpay.py data\gpay_statement.pdf --quiet

Windows PowerShell wrapper

.\scripts\extract-gpay.ps1 data\gpay_statement.pdf
.\scripts\extract-gpay.ps1 data\gpay_statement.pdf -Output output\custom.json
.\scripts\extract-gpay.ps1 data\gpay_statement.pdf -Quiet

Bash wrapper (Git Bash / WSL)

bash ./scripts/extract-gpay.sh data/gpay_statement.pdf
bash ./scripts/extract-gpay.sh data/gpay_statement.pdf --output output/custom.json
bash ./scripts/extract-gpay.sh data/gpay_statement.pdf --quiet

CLI Reference

extract_gpay.py <pdf_path> [--output <json_path>] [--quiet]

Arguments:

pdf_path (required): path to the Google Pay PDF statement
--output, -o: output JSON file path (default: output/<pdf_name>.json)
--quiet, -q: suppress verification output

For complete CLI documentation (commands, exit codes, troubleshooting), see docs/cli.md.

Output Format

Generated JSON shape:

{
  "transactionPeriod": {
    "start": "01 August 2025",
    "end": "31 January 2026",
    "totalSent": 782334.17,
    "totalReceived": 101.0
  },
  "totalTransactions": 123,
  "transactions": [
    {
      "date": "01 Aug, 2025 06:20 PM",
      "payee": "NAVEEN KUMAR S",
      "type": "debit",
      "category": "rent",
      "amount": 630.0,
      "upiTransactionId": "521314926792"
    }
  ]
}

Notes:

transactionPeriod is an empty object if period metadata is not detected in the PDF
amount values are numeric (float)

Verification and Exit Codes

When not using --quiet, the script applies three verification checks:

Totals reconciliation: compares extracted sent and received sums against the corresponding totals printed in the statement header. A paisa-level match across many rows is a strong signal that extraction is complete and not double-counted.
UPI ID uniqueness: ensures each transaction has a unique UPI Transaction ID.
Completeness: confirms every extracted transaction includes a UPI Transaction ID.

Exit codes:

0: extraction succeeded (and verification passed, if enabled)
1: invalid input (missing file, wrong extension, etc.)
2: extraction succeeded but verification failed

Project Structure

finbrain
├─ alembic/
│  ├─ versions/
│  ├─ env.py
│  └─ script.py.mako
├─ alembic.ini
├─ categories.py
├─ config.py
├─ db/
│  ├─ base.py
│  └─ session.py
├─ extract_gpay.py
├─ main.py
├─ models/
│  ├─ embedding_job.py
│  ├─ extraction_run.py
│  └─ transaction.py
├─ services/
│  ├─ categorization.py
│  ├─ embeddings.py
│  └─ semantic_qa.py
├─ pyproject.toml
├─ schemas.py
├─ docs/
├─ output/
└─ scripts/
   ├─ extract-gpay.ps1
   └─ extract-gpay.sh

Troubleshooting

No transactions extracted: confirm the PDF is a Google Pay statement format supported by this parser.
Verification mismatches: review top-ups/received entries in output and compare with statement period totals.
Duplicate imports: /extract skips duplicate rows for the same extraction run (extraction_id + upi_id).
LLM errors: ensure GOOGLE_API_KEY is configured for category labeling, embeddings, and QA.
Dependency errors: re-run uv sync (or reinstall via pip environment setup).

Author

Sujeet Gund
GitHub: github.com/sujeetgund

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
alembic		alembic
db		db
docs		docs
models		models
scripts		scripts
services		services
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
README.md		README.md
alembic.ini		alembic.ini
categories.py		categories.py
config.py		config.py
extract_gpay.py		extract_gpay.py
main.py		main.py
pyproject.toml		pyproject.toml
schemas.py		schemas.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

finbrain — AI-Native Transaction Intelligence for Google Pay Statements

Problem Statement

Objectives

Proposed Solution

How Extraction Works

FastAPI Service

Documentation

Architecture

Database Diagram

API Sequence

Features

Requirements

Getting the Statement PDF (Google Pay)

Export statement PDF

Check app version

Quick Start (uv recommended)

Installation

Option 1: uv (recommended)

Option 2: pip + virtual environment

Docker (Cloud Run)

Build image

Run locally

Deploy to Google Cloud Run

Usage

API flow for semantic QA

Direct script

Windows PowerShell wrapper

Bash wrapper (Git Bash / WSL)

CLI Reference

Output Format

Verification and Exit Codes

Project Structure

Troubleshooting

Author

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages