finbrain transforms Google Pay statement PDFs into high-fidelity, analytics-ready transaction data.
It combines robust PDF parsing, strict verification, LLM-powered categorization, and semantic QA to turn static statements into a queryable financial intelligence layer.
Digital wallet statements are usually delivered as PDFs, which are difficult to query, aggregate, or analyze programmatically. Manual categorization and monthly spend analysis are repetitive and error-prone.
- Convert Google Pay statement PDFs into structured, validated transaction data.
- Provide a backend API for extraction, review, semantic retrieval, and question answering.
- Support scalable post-processing with asynchronous embedding jobs for semantic finance use cases.
finbrain combines deterministic PDF parsing with LLM-powered categorization and semantic QA:
- Parse statement transactions and validate extracted totals against statement headers.
- Label transactions into canonical spending categories.
- Store records in PostgreSQL with extraction-run scoping and durable embedding job queues.
- Enable semantic question answering over transaction history using vector search + LLM reasoning.
Google Pay statement PDFs store text as positioned glyphs (characters with x/y coordinates), not as semantic tables with explicit spaces. In many PDFs, a naive extract_text() call can collapse adjacent words (for example, payee names) because spacing is visual rather than encoded as space characters.
This project uses character-level extraction with pdfplumber:
- Reads each character with its position.
- Groups characters into lines by y-coordinate.
- Reconstructs spaces when the horizontal gap between consecutive characters exceeds a threshold (
2pixels).
This approach preserves names and transaction fields more reliably for downstream parsing and validation.
The project also exposes an API service for statement extraction.
- Endpoint:
POST /extract - Endpoint:
GET /transactions - Endpoint:
POST /transactions/categories - Endpoint:
POST /qa - Upload field:
statement(multipart/form-data) - Max file size:
10 MB - Behavior: parses uploaded PDF, labels transaction categories using LLM, persists in PostgreSQL
Start API server:
uv run uvicorn main:app --reloadOpen docs:
- Swagger UI:
http://127.0.0.1:8000/docs - ReDoc:
http://127.0.0.1:8000/redoc
Example request:
curl -X POST "http://127.0.0.1:8000/extract" \
-H "accept: application/json" \
-F "statement=@data/gpay_statement.pdf;type=application/pdf"- API endpoints: docs/api-endpoints.md
- CLI guide: docs/cli.md
- Database guide: docs/database.md
flowchart TB
User["User / App"] <--> API["FastAPI Service"]
subgraph Extraction ["Extraction Pipeline"]
direction TB
E["POST /extract"]
P["PDF Parsing and Verification"]
L["LLM Category Labeling"]
subgraph DB1 ["Datastore"]
direction LR
XRUN[(extraction_runs)]
T1[(transactions)]
end
E -->|parse + verify| P
E -->|categorize transactions| L
E -->|create extraction run| XRUN
P -->|validated rows| T1
L -->|category labels| T1
end
subgraph Review ["User Review and Embedding Pipeline"]
direction TB
C["POST /transactions/categories"]
W["Background Embedding Worker"]
subgraph DB2 ["Datastore"]
direction LR
J[(embedding_jobs)]
T2[(transactions)]
end
C -->|apply category corrections| T2
C -->|enqueue embedding jobs| J
J -->|process pending jobs| W
W -->|persist embeddings| T2
end
subgraph QA ["Semantic Question Answering Pipeline"]
direction TB
Q["POST /qa"]
S["Constraint Extraction and Vector Retrieval"]
A["LLM Answer Synthesis"]
subgraph DB3 ["Datastore"]
T3[(transactions)]
end
Q -->|parse question intent| S
S -->|retrieve semantic matches| T3
S -->|build answer context| A
end
API -->|route| E
API -->|route| C
API -->|route| Q
erDiagram
EXTRACTION_RUNS ||--o{ TRANSACTIONS : contains
TRANSACTIONS ||--o| EMBEDDING_JOBS : queues
EXTRACTION_RUNS {
uuid id PK
timestamptz created_at
}
TRANSACTIONS {
uuid id PK
uuid extraction_id FK
timestamp transaction_date
string payee
string type
string category
numeric amount
string upi_id
vector embedding
}
EMBEDDING_JOBS {
uuid id PK
uuid transaction_id FK
string status
int attempts
text last_error
timestamptz created_at
timestamptz updated_at
timestamptz started_at
timestamptz completed_at
}
sequenceDiagram
participant Client
participant API as FastAPI
participant DB as PostgreSQL
participant Worker as Embedding Worker
Client->>API: POST /extract (statement.pdf)
API->>DB: Insert extraction_runs row
API->>DB: Insert transactions (LLM category labels)
API-->>Client: extraction_id + extracted data
Client->>API: POST /transactions/categories (extraction_id, updates[])
API->>DB: Update categories (if provided)
API->>DB: Enqueue embedding_jobs
API-->>Client: updated_count, queued_embedding_jobs
Worker->>DB: Process pending embedding_jobs
Worker->>DB: Store embeddings on transactions
Client->>API: POST /qa (question)
API->>DB: Retrieve constrained semantic matches
API-->>Client: Answer + useful supporting matches
- Extracts transactions into JSON (
date,payee,amount,upiTransactionId) - Adds
typeascredit/debit(creditwhen source label containsReceived from) - Adds semantic
categorylabels using a canonical category set via LLM - Captures statement period (
start,end,totalSent,totalReceived) when present - Verifies UPI ID presence and uniqueness
- Reconciles extracted totals with statement header totals
- Supports bulk user category verification/update and asynchronous embedding generation
- Stores 768-dim embeddings in PostgreSQL (
pgvector) for semantic QA retrieval - Returns exit code
2when extraction succeeds but verification fails
- Python 3.12+ (or
uv, recommended) - Project dependency:
pdfplumber(installed via project setup) - PostgreSQL connection string via
DATABASE_URL - Google API key via
GOOGLE_API_KEY(Gemini categorization, embeddings, QA)
Tested against Google Pay Android version 317.1.1.
- Open Google Pay.
- Go to See Transaction History.
- Tap See All next to Transaction History.
- In the top search bar area, open the menu (
...) and choose Get Statement. - Select Last Six Months and continue to download.
- Open Google Pay.
- Tap your profile photo (bottom-right).
- Go to Settings.
- Open About to view the app version.
From the project root:
uv sync
set DATABASE_URL=postgresql+psycopg://postgres:postgres@localhost:5432/finbrain
set GOOGLE_API_KEY=<your_google_api_key>
uv run alembic upgrade head
uv run extract_gpay.py data\gpay_statement.pdfExpected output:
- JSON file at
output/gpay_statement.json(default path pattern:output/<pdf_name>.json) - Verification report printed to terminal (unless quiet mode is used)
uv syncpython -m venv .venv
.venv\Scripts\activate
pip install -e .The included multi-stage Dockerfile is optimized for Google Cloud Run deployment.
docker build -t finbrain:latest .docker run --rm -p 8080:8080 --env-file .env finbrain:latestgcloud builds submit --tag gcr.io/<PROJECT_ID>/finbrain:latest
gcloud run deploy finbrain \
--image gcr.io/<PROJECT_ID>/finbrain:latest \
--platform managed \
--region <REGION> \
--allow-unauthenticatedRequired runtime env vars on Cloud Run:
DATABASE_URLGOOGLE_API_KEY
- Upload statement to
POST /extract(response includes categories per transaction andextraction_id). - Verify/correct categories with
POST /transactions/categoriesusingextraction_id. - Embeddings are generated asynchronously and stored in DB (
vector(768)) without blocking response. - Ask semantic questions with
POST /qa.
uv run extract_gpay.py data\gpay_statement.pdf
uv run extract_gpay.py data\gpay_statement.pdf --output output\custom.json
uv run extract_gpay.py data\gpay_statement.pdf --quiet.\scripts\extract-gpay.ps1 data\gpay_statement.pdf
.\scripts\extract-gpay.ps1 data\gpay_statement.pdf -Output output\custom.json
.\scripts\extract-gpay.ps1 data\gpay_statement.pdf -Quietbash ./scripts/extract-gpay.sh data/gpay_statement.pdf
bash ./scripts/extract-gpay.sh data/gpay_statement.pdf --output output/custom.json
bash ./scripts/extract-gpay.sh data/gpay_statement.pdf --quietextract_gpay.py <pdf_path> [--output <json_path>] [--quiet]
Arguments:
pdf_path(required): path to the Google Pay PDF statement--output,-o: output JSON file path (default:output/<pdf_name>.json)--quiet,-q: suppress verification output
For complete CLI documentation (commands, exit codes, troubleshooting), see docs/cli.md.
Generated JSON shape:
{
"transactionPeriod": {
"start": "01 August 2025",
"end": "31 January 2026",
"totalSent": 782334.17,
"totalReceived": 101.0
},
"totalTransactions": 123,
"transactions": [
{
"date": "01 Aug, 2025 06:20 PM",
"payee": "NAVEEN KUMAR S",
"type": "debit",
"category": "rent",
"amount": 630.0,
"upiTransactionId": "521314926792"
}
]
}Notes:
transactionPeriodis an empty object if period metadata is not detected in the PDFamountvalues are numeric (float)
When not using --quiet, the script applies three verification checks:
- Totals reconciliation: compares extracted sent and received sums against the corresponding totals printed in the statement header. A paisa-level match across many rows is a strong signal that extraction is complete and not double-counted.
- UPI ID uniqueness: ensures each transaction has a unique UPI Transaction ID.
- Completeness: confirms every extracted transaction includes a UPI Transaction ID.
Exit codes:
0: extraction succeeded (and verification passed, if enabled)1: invalid input (missing file, wrong extension, etc.)2: extraction succeeded but verification failed
finbrain
├─ alembic/
│ ├─ versions/
│ ├─ env.py
│ └─ script.py.mako
├─ alembic.ini
├─ categories.py
├─ config.py
├─ db/
│ ├─ base.py
│ └─ session.py
├─ extract_gpay.py
├─ main.py
├─ models/
│ ├─ embedding_job.py
│ ├─ extraction_run.py
│ └─ transaction.py
├─ services/
│ ├─ categorization.py
│ ├─ embeddings.py
│ └─ semantic_qa.py
├─ pyproject.toml
├─ schemas.py
├─ docs/
├─ output/
└─ scripts/
├─ extract-gpay.ps1
└─ extract-gpay.sh
- No transactions extracted: confirm the PDF is a Google Pay statement format supported by this parser.
- Verification mismatches: review top-ups/received entries in output and compare with statement period totals.
- Duplicate imports:
/extractskips duplicate rows for the same extraction run (extraction_id+upi_id). - LLM errors: ensure
GOOGLE_API_KEYis configured for category labeling, embeddings, and QA. - Dependency errors: re-run
uv sync(or reinstall via pip environment setup).
Sujeet Gund
GitHub: github.com/sujeetgund
This project is licensed under the MIT License.
