Skip to content

ThakkarVidhi/ai-banking-agent

Repository files navigation

Vaulta AI — Real-Time Streaming Financial Voice Assistant

Vaulta AI is a hybrid voice authentication and AI service system designed to simulate a secure, real-world digital banking assistant.

Build Status License


Table of Contents


Project Description

Vaulta AIis a real-time, voice-enabled banking agent that combines secure authentication with intelligent financial assistance. The system uses a hybrid architecture that begins with a deterministic authentication state machine verifying user identity through guided voice interaction before transitioning into an AI-powered conversational service mode.

Built using WebSocket-based audio streaming, speech-to-text processing, and local language models, Vaulta AI simulates a modern digital banking call center experience. After successful verification, users can interact naturally to check balances, review transactions, or perform account-related actions.

The project demonstrates secure AI system design principles by separating authentication logic from conversational intelligence, implementing retry limits, input validation, and session-based state management. It is designed as a technical showcase of how voice interfaces, AI agents, and financial security layers can be integrated into a cohesive, production-inspired architecture.


Features

  • Chunked voice streaming — Browser streams audio chunks to backend; backend buffers until end-of-utterance (1.2 s silence or explicit end); STT runs in memory; no WAV files written for normal flow.
  • Phase 1 guided authentication — Deterministic flow: language, full name, card type, last 4 digits, DOB; intent rules (parse_language, parse_full_name, parse_card_type, parse_four_digits, parse_dob, parse_menu_, parse_end_call); verification lockout after 2 failures ("Please call back later."*).
  • Phase 2 open conversation — Balance, last transaction, credit score, credit limit left, account status from SQLite; freeform questions via Ollama with conversation memory + DB context.
  • Bilingual (EN/FR) — Prompts in English or French from config/prompts.json and config/prompts_fr.json.
  • Sleek UI (Vaulta AI) — Single page: Start Call / End Call, MediaRecorder, WebSocket, AudioContext with audio queue to prevent overlap; dark theme; no step labels or file links.
  • Log masking — Card-like numbers masked in logs (e.g. ****1234) via utils/mask.py.
  • Local-first — Faster-Whisper, Piper (or edge-tts fallback), Ollama, SQLite; no OpenAI or cloud APIs.

Technology Stack

Frontend & Interface

Component Technology
UI HTML5 + vanilla JS — single page: MediaRecorder, WebSocket, AudioContext (audio queue); Start/End Call; sleek dark theme.

Backend & Transport

Component Technology
Backend FastAPI — static frontend + WebSocket /ws; receive binary chunks; send TTS bytes + JSON (state, transcript).
Transport WebSocket (binary + JSON) — browser ↔ FastAPI; audio chunks in; audio bytes + events out.

STT, TTS & LLM

Component Technology
STT Faster-Whisper — in-memory transcription; input PCM 16 kHz mono (see Audio conversion).
Audio convert pydub + imageio-ffmpeg (or system ffmpeg) — convert browser webm/opus chunks to PCM 16 kHz mono before STT. Optional: PyAV for decoding without ffmpeg.
TTS Piper (local) — synthesize(text, lang) → WAV bytes; edge-tts fallback when Piper voices not available.
LLM Ollama (local) — Phase 2 freeform and optional extraction; no API key.

Data & Security

Component Technology
Data SQLite — customers, cards, transactions; seed via scripts/init_db.py and scripts/generate_synthetic_data.py.
Security Log masking (utils/mask.py) — no full PAN in logs.

Tools & Utilities

Item Purpose
python-dotenv .env for OLLAMA_HOST, DB_PATH, PIPER_VOICE_DIR.
rapidfuzz Fuzzy name matching for Phase 1 full name.
dateparser Natural-language and numeric date parsing for DOB.

System Architecture

  • User opens the app in browser; clicks Start Call; allows mic; hears welcome TTS; speaks; audio is streamed via WebSocket.
  • FastAPI serves static frontend and WebSocket /ws; on connect sends welcome TTS; receives binary chunks; on 1.2 s silence or end_utterance JSON: converts webm → PCM, runs STT, calls orchestrator.
  • Orchestrator (conversation/orchestrator.py): process_turn(audio_bytes, session) → STT → intent/state (via agent) → TTS → returns (audio_bytes, updated_session, transcript_masked). Enforces lockout when retry_count >= 2.
  • Agent (agent/agent.py): Intent rules per state; DB helpers (data/db.py); Phase 1 prompts from config; Phase 2 tool-calling (get_balance, get_last_transaction, get_credit_score, get_credit_limit_left, get_account_status). Optional: Ollama for freeform.
  • SQLitedata/finance.db (customers, cards, transactions). No file storage for audio in normal flow.

Architecture Diagrams

High-level data flow

flowchart LR
  subgraph Browser [Browser]
    UI[UI]
    MR[MediaRecorder]
    WSC[WebSocket Client]
    AC[AudioContext]
  end
  subgraph Backend [FastAPI Backend]
    WSS[WebSocket Server]
    Buf[Audio Buffer]
    Convert[webm to PCM 16kHz]
    STT[Faster-Whisper]
    Orch[Orchestrator]
    Agent[Agent]
    TTS[Piper / edge-tts]
    DB[(SQLite)]
  end
  UI -->|Start Call| WSC
  MR -->|Chunks| WSC
  WSC <-->|Binary + JSON| WSS
  WSS --> Buf
  Buf --> Convert
  Convert --> STT
  STT --> Orch
  Orch --> Agent
  Agent --> DB
  Orch --> TTS
  TTS -->|Bytes| WSS
  WSS --> WSC
  WSC --> AC
  AC --> User[User]
Loading

Layers and responsibilities

flowchart TB
  subgraph Presentation [Presentation Layer]
    Frontend[Frontend SPA]
  end
  subgraph Transport [Transport Layer]
    WebSocket[FastAPI WebSocket]
  end
  subgraph Orchestration [Orchestration Layer]
    Orchestrator[Orchestrator]
    StateMachine[State Machine]
    Session[Session Data]
  end
  subgraph AgentLayer [Agent and Intent Layer]
    Intent[Intent Rules]
    Agent[Agent]
  end
  subgraph Data [Data Layer]
    SQLite[(SQLite)]
  end
  subgraph Models [Models Layer]
    STT[STT]
    TTS[TTS]
  end
  Frontend --> WebSocket
  WebSocket --> Orchestrator
  Orchestrator --> StateMachine
  Orchestrator --> Session
  Orchestrator --> STT
  Orchestrator --> TTS
  Orchestrator --> Intent
  Orchestrator --> Agent
  Intent --> Agent
  Agent --> SQLite
Loading

Request flow (per turn)

flowchart TB
  User[User message] --> Frontend[Frontend]
  Frontend -->|Binary chunks| WS[WebSocket]
  WS --> Buffer[Buffer until 1.2s silence or end_utterance]
  Buffer --> Convert[audio/convert: webm to PCM 16kHz mono]
  Convert --> STT[stt: transcribe]
  STT --> Orch[orchestrator: process_turn]
  Orch --> Agent[agent: run_agent]
  Agent --> Intent[intent/rules]
  Agent --> State[state_machine]
  Agent --> DB[(data/db)]
  Agent --> Reply[reply_text]
  Orch --> Retry[retry or lockout if verify fail]
  Reply --> Retry
  Retry --> TTS[tts: synthesize]
  TTS --> Send[Send bytes + JSON to client]
  Send --> Frontend
  Frontend --> Play[AudioContext queue play]
  Play --> User
Loading

Conversation flow (Phase 1 and Phase 2)

User journey (UI)

flowchart TD
  A[Open app] --> B[Click Start Call]
  B --> C[Allow mic]
  C --> D[Hear welcome TTS]
  D --> E[Speak]
  E --> F[Backend: buffer then STT then orchestrator then TTS]
  F --> G[Hear response]
  G --> H{End call?}
  H -->|No| E
  H -->|Yes| I[Click End Call]
  I --> J[Hear goodbye]
Loading

Phase 1 state machine (guided auth)

stateDiagram-v2
  [*] --> welcome
  welcome --> full_name: language en/fr
  full_name --> account_type: name matched
  account_type --> verify_number: debit/credit
  verify_number --> verify_dob: last4 matched
  verify_dob --> menu: DOB matched
  menu --> action: balance/transaction/score/status/other
  action --> menu_again: TTS result
  menu_again --> action: same options
  menu_again --> end_state: end_call
  menu --> end_state: end_call
  verify_number --> verify_number: retry
  verify_dob --> verify_dob: retry
  verify_number --> end_state: lockout
  verify_dob --> end_state: lockout
  end_state --> [*]
Loading

Phase 2 (after authentication)

flowchart LR
  Auth[Authenticated] --> Menu[Menu: balance, last transaction, credit score, limit, status, other]
  Menu --> Tool[Tool: get_balance, get_last_transaction, etc.]
  Tool --> DB[(SQLite)]
  DB --> Reply[TTS reply]
  Reply --> Menu
  Menu --> End[End call]
Loading

Data Flow and Orchestration

sequenceDiagram
  participant F as Frontend
  participant WS as WebSocket
  participant Main as Backend
  participant Convert as Convert
  participant STT as STT
  participant Orch as Orchestrator
  participant Agent as Agent
  participant DB as SQLite
  participant TTS as TTS

  F->>WS: Binary audio chunks
  F->>WS: end_utterance (optional)
  WS->>Main: Buffer then 1.2s silence or end_utterance
  Main->>Convert: webm_to_pcm_16k_mono / webm_to_wav_16k_mono
  Convert->>STT: PCM 16kHz mono bytes
  STT->>STT: transcribe
  STT->>Orch: transcript
  Orch->>Orch: mask_pan(transcript)
  Orch->>Agent: run_agent(state, transcript, session)
  Agent->>Agent: intent rules per state
  Agent->>DB: get_customer_by_name_and_dob, get_card_*, get_last_transaction_for_card, etc.
  DB->>Agent: row(s)
  Agent->>Orch: reply_text, session_dict, next_state
  Orch->>Orch: retry/lockout if verify_number or verify_dob failed
  Orch->>Orch: get prompt if reply_text empty
  Orch->>TTS: synthesize(reply_text, lang)
  TTS->>Orch: audio bytes
  Orch->>WS: audio bytes then JSON state/phase/transcript
  WS->>F: Binary then JSON
  F->>F: Audio queue decode and play
Loading

Steps in short:

  1. Frontend sends binary audio chunks over WebSocket; optionally {"type": "end_utterance"} to flush buffer.
  2. Backend buffers chunks; after 1.2 s silence or end_utterance, converts webm → PCM 16 kHz mono via audio/convert.py.
  3. STT (stt/stt.py): transcribe(audio_bytes) → transcript; transcript is masked for logs.
  4. Orchestrator loads prompts by language; calls agent run_agent(state, transcript, session).
  5. Agent runs intent parser for current state (parse_language, parse_full_name, parse_card_type, parse_four_digits, parse_dob, parse_menu_debit / parse_menu_credit, parse_end_call); updates session; on verification failure increments retry_count — if retry_count >= 2 sets state=end (lockout).
  6. Phase 1: Next prompt from config or DB-backed reply; Phase 2: tool-calling (get_balance, get_last_transaction, etc.).
  7. TTS synthesize(reply_text, session.language) → WAV or MP3 bytes.
  8. Backend sends audio bytes then JSON { state, phase, transcript, error } to client.
  9. Frontend pushes decoded audio to queue; plays sequentially (no overlap).

Project Structure

ai-banking-agent/
├── README.md
├── requirements.txt
├── .env.example
├── .gitignore
├── main.py                          # FastAPI app + WebSocket /ws
├── config/
│   ├── prompts.json                 # EN prompts per state/menu
│   └── prompts_fr.json             # FR prompts
├── data/
│   ├── __init__.py
│   ├── db.py                        # SQLite helpers (customers, cards, transactions)
│   └── finance.db                   # (gitignore) SQLite DB
├── conversation/
│   ├── __init__.py
│   ├── state_machine.py             # States, next_state
│   ├── session_data.py              # Session dataclass (phase, state, language, customer_id, card_id, etc.)
│   └── orchestrator.py              # process_turn: STT → agent → TTS
├── intent/
│   ├── __init__.py
│   └── rules.py                     # parse_language, parse_full_name, parse_card_type, parse_four_digits, parse_dob, parse_menu_*, parse_end_call
├── agent/
│   ├── __init__.py
│   └── agent.py                     # Ollama + DB; Phase 1 & Phase 2
├── stt/
│   ├── __init__.py
│   └── stt.py                       # transcribe(audio_bytes) in-memory, Faster-Whisper
├── tts/
│   ├── __init__.py
│   └── tts.py                       # synthesize(text, lang) → bytes (Piper / edge-tts)
├── audio/
│   ├── __init__.py
│   └── convert.py                   # webm/opus → PCM 16 kHz mono
├── utils/
│   ├── __init__.py
│   ├── mask.py                      # mask_pan() for logs
│   └── logger.py
├── scripts/
│   ├── init_db.py                   # Create tables (optional --drop)
│   └── generate_synthetic_data.py  # Seed customers, cards, transactions
├── static/
│   ├── index.html                   # Vaulta AI UI
│   ├── app.js                       # MediaRecorder, WebSocket, AudioContext queue
│   └── styles.css
├── voices/                          # Piper voice files (optional)
│   ├── en_US-lessac-medium.onnx (+ .json)
│   └── fr_FR-siwis-medium.onnx (+ .json)
├── tests/
│   └── test_e2e_voice.py            # E2E / voice tests
└── docs/
    └── scripted_flows.md            # Demo identities and exact phrases (UC1–UC8)

Data Strategy

Data Source Location How created
SQLite Synthetic + demo data/finance.db python scripts/init_db.py then python scripts/generate_synthetic_data.py.
Demo users See scripted flows Same DB John Smith (DOB 1990-03-15, credit last4 1234), Jane Doe (DOB 1985-07-22, debit last4 5678), etc.
  • Tables: customers (customer_id, name, dob, phone_last4), cards (card_id, customer_id, card_last4, card_type, status, balance, credit_limit, credit_score, …), transactions (transaction_id, card_id, merchant, amount, date, status).
  • Helpers in data/db.py: get_customer_by_name_and_dob, get_cards_by_customer_id, get_card_by_type_last4_and_customer, get_last_transaction_for_card, get_credit_limit_left, etc.

Demo

Below are snapshots of Vaulta AI: the chat UI and the demo video.

Vaulta UI

Demo video:

▶️ Click here to download the demo video


Installation

  1. Clone the repository

    git clone <your-repo-url>
    cd ai-banking-agent
  2. Create virtual environment and install dependencies
    requirements.txt is minimal: only packages used by the app, scripts, and tests (FastAPI, Faster-Whisper, edge-tts, pydub, imageio-ffmpeg, rapidfuzz, dateparser, Faker, pytest, etc.).

    python -m venv venv
    .\venv\Scripts\Activate.ps1   # Windows
    # source venv/bin/activate    # Linux/macOS
    pip install --upgrade pip
    pip install -r requirements.txt

    Audio (WebM→PCM): The app uses PyAV first, then pydub and imageio-ffmpeg (bundled ffmpeg) so conversion works without ffmpeg on PATH. If you still see "Audio conversion failed", install ffmpeg and add it to your PATH.

  3. Environment

    • Copy .env.example to .env.
    • Set (defaults shown):
      • OLLAMA_HOST=http://localhost:11434
      • DB_PATH=data/finance.db
      • PIPER_VOICE_DIR=voices
      • PIPER_VOICE_DIR_FR=voices
  4. Ollama (optional; for Phase 2 freeform)

    ollama serve
    ollama pull llama3.2
  5. Database

    python scripts/init_db.py --drop
    python scripts/generate_synthetic_data.py
  6. Piper (optional)
    TTS uses edge-tts by default (included in requirements). For local Piper TTS, install Piper on your PATH and place voice files in voices/ (see project structure). The voices/ folder is in .gitignore.

  7. ffmpeg
    For webm → PCM conversion, the app uses imageio-ffmpeg (bundled in requirements) when ffmpeg is not on system PATH. You can also install ffmpeg system-wide.


Configuration

Configuration is via .env (see .env.example). No cloud API keys are required for the core flow.

Variable Description Example
OLLAMA_HOST Ollama server URL http://localhost:11434
DB_PATH SQLite database path data/finance.db
PIPER_VOICE_DIR Directory with Piper EN voice voices
PIPER_VOICE_DIR_FR Directory with Piper FR voice voices

Usage

  1. Start the server

    uvicorn main:app --reload --host 0.0.0.0 --port 8000
  2. Open the app in your browser (e.g. http://localhost:8000).

  3. Click Start Call — allow microphone access; the welcome TTS plays automatically.

  4. Phase 1: Say e.g. "English""John Smith""Credit""One two three four" (or "1234") → "March 15 1990" → then choose "Balance", "Credit score", "Last transaction", etc., depending on menu.

  5. Phase 2: After "You're authenticated. How can I help you today?" — ask balance, last transaction, or freeform questions; say "End call" when done.

  6. Click End Call to disconnect.


Testing

pytest tests/ -v
  • Manual E2E: Run server, complete UC1–UC8 from docs/scripted_flows.md (e.g. happy path EN credit balance, French, debit last transaction, verification failure).
  • Note: Chunked voice streaming with turn-based processing — backend buffers until ~1.2 s silence or end_utterance, then runs STT and responds. Not true real-time streaming transcription.

License

This project is licensed under the MIT License.


FAQs

Q: What is Vaulta AI?
A: Vaulta AI is a real-time streaming financial voice assistant. You start a call, speak, and the assistant guides you through authentication (Phase 1) then answers balance, transactions, credit score, and freeform questions (Phase 2) using SQLite and a local LLM (Ollama). All processing is local; no cloud APIs.

Q: Is internet required?
A: For Piper + Faster-Whisper + Ollama, no. If you use the edge-tts fallback for TTS, that uses a Microsoft API and requires internet.

Q: Why do I not see WAV files or step numbers?
A: By design: no file save for the user, no download links, no step labels in the UI — only Start Call / End Call and status.

Q: What happens after two verification failures?
A: The session enters lockout: the assistant says "Please call back later." and ends the call (state=end).

Q: How do I add more demo users?
A: Edit scripts/generate_synthetic_data.py to add customers/cards/transactions, or insert directly into data/finance.db, and document them in docs/scripted_flows.md.


Contact Information

About

Vaulta AI is a voice-driven banking agent that authenticates users through a guided state machine before enabling AI-powered financial assistance via real-time streaming and conversational intelligence.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors