Vaulta AI — Real-Time Streaming Financial Voice Assistant

Vaulta AI is a hybrid voice authentication and AI service system designed to simulate a secure, real-world digital banking assistant.

Project Description

Vaulta AIis a real-time, voice-enabled banking agent that combines secure authentication with intelligent financial assistance. The system uses a hybrid architecture that begins with a deterministic authentication state machine verifying user identity through guided voice interaction before transitioning into an AI-powered conversational service mode.

Built using WebSocket-based audio streaming, speech-to-text processing, and local language models, Vaulta AI simulates a modern digital banking call center experience. After successful verification, users can interact naturally to check balances, review transactions, or perform account-related actions.

The project demonstrates secure AI system design principles by separating authentication logic from conversational intelligence, implementing retry limits, input validation, and session-based state management. It is designed as a technical showcase of how voice interfaces, AI agents, and financial security layers can be integrated into a cohesive, production-inspired architecture.

Features

Chunked voice streaming — Browser streams audio chunks to backend; backend buffers until end-of-utterance (1.2 s silence or explicit end); STT runs in memory; no WAV files written for normal flow.
Phase 1 guided authentication — Deterministic flow: language, full name, card type, last 4 digits, DOB; intent rules (parse_language, parse_full_name, parse_card_type, parse_four_digits, parse_dob, parse_menu_, parse_end_call); verification lockout after 2 failures ("Please call back later."*).
Phase 2 open conversation — Balance, last transaction, credit score, credit limit left, account status from SQLite; freeform questions via Ollama with conversation memory + DB context.
Bilingual (EN/FR) — Prompts in English or French from config/prompts.json and config/prompts_fr.json.
Sleek UI (Vaulta AI) — Single page: Start Call / End Call, MediaRecorder, WebSocket, AudioContext with audio queue to prevent overlap; dark theme; no step labels or file links.
Log masking — Card-like numbers masked in logs (e.g. ****1234) via utils/mask.py.
Local-first — Faster-Whisper, Piper (or edge-tts fallback), Ollama, SQLite; no OpenAI or cloud APIs.

Technology Stack

Frontend & Interface

Component	Technology
UI	HTML5 + vanilla JS — single page: MediaRecorder, WebSocket, AudioContext (audio queue); Start/End Call; sleek dark theme.

Backend & Transport

Component	Technology
Backend	FastAPI — static frontend + WebSocket `/ws`; receive binary chunks; send TTS bytes + JSON (state, transcript).
Transport	WebSocket (binary + JSON) — browser ↔ FastAPI; audio chunks in; audio bytes + events out.

STT, TTS & LLM

Component	Technology
STT	Faster-Whisper — in-memory transcription; input PCM 16 kHz mono (see Audio conversion).
Audio convert	pydub + imageio-ffmpeg (or system ffmpeg) — convert browser webm/opus chunks to PCM 16 kHz mono before STT. Optional: PyAV for decoding without ffmpeg.
TTS	Piper (local) — `synthesize(text, lang)` → WAV bytes; edge-tts fallback when Piper voices not available.
LLM	Ollama (local) — Phase 2 freeform and optional extraction; no API key.

Data & Security

Component	Technology
Data	SQLite — `customers`, `cards`, `transactions`; seed via `scripts/init_db.py` and `scripts/generate_synthetic_data.py`.
Security	Log masking (`utils/mask.py`) — no full PAN in logs.

Tools & Utilities

Item	Purpose
python-dotenv	`.env` for OLLAMA_HOST, DB_PATH, PIPER_VOICE_DIR.
rapidfuzz	Fuzzy name matching for Phase 1 full name.
dateparser	Natural-language and numeric date parsing for DOB.

System Architecture

User opens the app in browser; clicks Start Call; allows mic; hears welcome TTS; speaks; audio is streamed via WebSocket.
FastAPI serves static frontend and WebSocket /ws; on connect sends welcome TTS; receives binary chunks; on 1.2 s silence or end_utterance JSON: converts webm → PCM, runs STT, calls orchestrator.
Orchestrator (conversation/orchestrator.py): process_turn(audio_bytes, session) → STT → intent/state (via agent) → TTS → returns (audio_bytes, updated_session, transcript_masked). Enforces lockout when retry_count >= 2.
Agent (agent/agent.py): Intent rules per state; DB helpers (data/db.py); Phase 1 prompts from config; Phase 2 tool-calling (get_balance, get_last_transaction, get_credit_score, get_credit_limit_left, get_account_status). Optional: Ollama for freeform.
SQLite — data/finance.db (customers, cards, transactions). No file storage for audio in normal flow.

Architecture Diagrams

High-level data flow

flowchart LR
  subgraph Browser [Browser]
    UI[UI]
    MR[MediaRecorder]
    WSC[WebSocket Client]
    AC[AudioContext]
  end
  subgraph Backend [FastAPI Backend]
    WSS[WebSocket Server]
    Buf[Audio Buffer]
    Convert[webm to PCM 16kHz]
    STT[Faster-Whisper]
    Orch[Orchestrator]
    Agent[Agent]
    TTS[Piper / edge-tts]
    DB[(SQLite)]
  end
  UI -->|Start Call| WSC
  MR -->|Chunks| WSC
  WSC <-->|Binary + JSON| WSS
  WSS --> Buf
  Buf --> Convert
  Convert --> STT
  STT --> Orch
  Orch --> Agent
  Agent --> DB
  Orch --> TTS
  TTS -->|Bytes| WSS
  WSS --> WSC
  WSC --> AC
  AC --> User[User]

Layers and responsibilities

flowchart TB
  subgraph Presentation [Presentation Layer]
    Frontend[Frontend SPA]
  end
  subgraph Transport [Transport Layer]
    WebSocket[FastAPI WebSocket]
  end
  subgraph Orchestration [Orchestration Layer]
    Orchestrator[Orchestrator]
    StateMachine[State Machine]
    Session[Session Data]
  end
  subgraph AgentLayer [Agent and Intent Layer]
    Intent[Intent Rules]
    Agent[Agent]
  end
  subgraph Data [Data Layer]
    SQLite[(SQLite)]
  end
  subgraph Models [Models Layer]
    STT[STT]
    TTS[TTS]
  end
  Frontend --> WebSocket
  WebSocket --> Orchestrator
  Orchestrator --> StateMachine
  Orchestrator --> Session
  Orchestrator --> STT
  Orchestrator --> TTS
  Orchestrator --> Intent
  Orchestrator --> Agent
  Intent --> Agent
  Agent --> SQLite

Request flow (per turn)

flowchart TB
  User[User message] --> Frontend[Frontend]
  Frontend -->|Binary chunks| WS[WebSocket]
  WS --> Buffer[Buffer until 1.2s silence or end_utterance]
  Buffer --> Convert[audio/convert: webm to PCM 16kHz mono]
  Convert --> STT[stt: transcribe]
  STT --> Orch[orchestrator: process_turn]
  Orch --> Agent[agent: run_agent]
  Agent --> Intent[intent/rules]
  Agent --> State[state_machine]
  Agent --> DB[(data/db)]
  Agent --> Reply[reply_text]
  Orch --> Retry[retry or lockout if verify fail]
  Reply --> Retry
  Retry --> TTS[tts: synthesize]
  TTS --> Send[Send bytes + JSON to client]
  Send --> Frontend
  Frontend --> Play[AudioContext queue play]
  Play --> User

Conversation flow (Phase 1 and Phase 2)

User journey (UI)

flowchart TD
  A[Open app] --> B[Click Start Call]
  B --> C[Allow mic]
  C --> D[Hear welcome TTS]
  D --> E[Speak]
  E --> F[Backend: buffer then STT then orchestrator then TTS]
  F --> G[Hear response]
  G --> H{End call?}
  H -->|No| E
  H -->|Yes| I[Click End Call]
  I --> J[Hear goodbye]

Phase 1 state machine (guided auth)

stateDiagram-v2
  [*] --> welcome
  welcome --> full_name: language en/fr
  full_name --> account_type: name matched
  account_type --> verify_number: debit/credit
  verify_number --> verify_dob: last4 matched
  verify_dob --> menu: DOB matched
  menu --> action: balance/transaction/score/status/other
  action --> menu_again: TTS result
  menu_again --> action: same options
  menu_again --> end_state: end_call
  menu --> end_state: end_call
  verify_number --> verify_number: retry
  verify_dob --> verify_dob: retry
  verify_number --> end_state: lockout
  verify_dob --> end_state: lockout
  end_state --> [*]

Phase 2 (after authentication)

flowchart LR
  Auth[Authenticated] --> Menu[Menu: balance, last transaction, credit score, limit, status, other]
  Menu --> Tool[Tool: get_balance, get_last_transaction, etc.]
  Tool --> DB[(SQLite)]
  DB --> Reply[TTS reply]
  Reply --> Menu
  Menu --> End[End call]

Data Flow and Orchestration

sequenceDiagram
  participant F as Frontend
  participant WS as WebSocket
  participant Main as Backend
  participant Convert as Convert
  participant STT as STT
  participant Orch as Orchestrator
  participant Agent as Agent
  participant DB as SQLite
  participant TTS as TTS

  F->>WS: Binary audio chunks
  F->>WS: end_utterance (optional)
  WS->>Main: Buffer then 1.2s silence or end_utterance
  Main->>Convert: webm_to_pcm_16k_mono / webm_to_wav_16k_mono
  Convert->>STT: PCM 16kHz mono bytes
  STT->>STT: transcribe
  STT->>Orch: transcript
  Orch->>Orch: mask_pan(transcript)
  Orch->>Agent: run_agent(state, transcript, session)
  Agent->>Agent: intent rules per state
  Agent->>DB: get_customer_by_name_and_dob, get_card_*, get_last_transaction_for_card, etc.
  DB->>Agent: row(s)
  Agent->>Orch: reply_text, session_dict, next_state
  Orch->>Orch: retry/lockout if verify_number or verify_dob failed
  Orch->>Orch: get prompt if reply_text empty
  Orch->>TTS: synthesize(reply_text, lang)
  TTS->>Orch: audio bytes
  Orch->>WS: audio bytes then JSON state/phase/transcript
  WS->>F: Binary then JSON
  F->>F: Audio queue decode and play

Steps in short:

Frontend sends binary audio chunks over WebSocket; optionally {"type": "end_utterance"} to flush buffer.
Backend buffers chunks; after 1.2 s silence or end_utterance, converts webm → PCM 16 kHz mono via audio/convert.py.
STT (stt/stt.py): transcribe(audio_bytes) → transcript; transcript is masked for logs.
Orchestrator loads prompts by language; calls agent run_agent(state, transcript, session).
Agent runs intent parser for current state (parse_language, parse_full_name, parse_card_type, parse_four_digits, parse_dob, parse_menu_debit / parse_menu_credit, parse_end_call); updates session; on verification failure increments retry_count — if retry_count >= 2 sets state=end (lockout).
Phase 1: Next prompt from config or DB-backed reply; Phase 2: tool-calling (get_balance, get_last_transaction, etc.).
TTS synthesize(reply_text, session.language) → WAV or MP3 bytes.
Backend sends audio bytes then JSON { state, phase, transcript, error } to client.
Frontend pushes decoded audio to queue; plays sequentially (no overlap).

Project Structure

ai-banking-agent/
├── README.md
├── requirements.txt
├── .env.example
├── .gitignore
├── main.py                          # FastAPI app + WebSocket /ws
├── config/
│   ├── prompts.json                 # EN prompts per state/menu
│   └── prompts_fr.json             # FR prompts
├── data/
│   ├── __init__.py
│   ├── db.py                        # SQLite helpers (customers, cards, transactions)
│   └── finance.db                   # (gitignore) SQLite DB
├── conversation/
│   ├── __init__.py
│   ├── state_machine.py             # States, next_state
│   ├── session_data.py              # Session dataclass (phase, state, language, customer_id, card_id, etc.)
│   └── orchestrator.py              # process_turn: STT → agent → TTS
├── intent/
│   ├── __init__.py
│   └── rules.py                     # parse_language, parse_full_name, parse_card_type, parse_four_digits, parse_dob, parse_menu_*, parse_end_call
├── agent/
│   ├── __init__.py
│   └── agent.py                     # Ollama + DB; Phase 1 & Phase 2
├── stt/
│   ├── __init__.py
│   └── stt.py                       # transcribe(audio_bytes) in-memory, Faster-Whisper
├── tts/
│   ├── __init__.py
│   └── tts.py                       # synthesize(text, lang) → bytes (Piper / edge-tts)
├── audio/
│   ├── __init__.py
│   └── convert.py                   # webm/opus → PCM 16 kHz mono
├── utils/
│   ├── __init__.py
│   ├── mask.py                      # mask_pan() for logs
│   └── logger.py
├── scripts/
│   ├── init_db.py                   # Create tables (optional --drop)
│   └── generate_synthetic_data.py  # Seed customers, cards, transactions
├── static/
│   ├── index.html                   # Vaulta AI UI
│   ├── app.js                       # MediaRecorder, WebSocket, AudioContext queue
│   └── styles.css
├── voices/                          # Piper voice files (optional)
│   ├── en_US-lessac-medium.onnx (+ .json)
│   └── fr_FR-siwis-medium.onnx (+ .json)
├── tests/
│   └── test_e2e_voice.py            # E2E / voice tests
└── docs/
    └── scripted_flows.md            # Demo identities and exact phrases (UC1–UC8)

Data Strategy

Data	Source	Location	How created
SQLite	Synthetic + demo	`data/finance.db`	`python scripts/init_db.py` then `python scripts/generate_synthetic_data.py`.
Demo users	See scripted flows	Same DB	John Smith (DOB 1990-03-15, credit last4 1234), Jane Doe (DOB 1985-07-22, debit last4 5678), etc.

Tables: customers (customer_id, name, dob, phone_last4), cards (card_id, customer_id, card_last4, card_type, status, balance, credit_limit, credit_score, …), transactions (transaction_id, card_id, merchant, amount, date, status).
Helpers in data/db.py: get_customer_by_name_and_dob, get_cards_by_customer_id, get_card_by_type_last4_and_customer, get_last_transaction_for_card, get_credit_limit_left, etc.

Demo

Below are snapshots of Vaulta AI: the chat UI and the demo video.

Demo video:

▶️ Click here to download the demo video

Installation

Clone the repository

git clone <your-repo-url>
cd ai-banking-agent

Create virtual environment and install dependencies
requirements.txt is minimal: only packages used by the app, scripts, and tests (FastAPI, Faster-Whisper, edge-tts, pydub, imageio-ffmpeg, rapidfuzz, dateparser, Faker, pytest, etc.).
```
python -m venv venv
.\venv\Scripts\Activate.ps1   # Windows
# source venv/bin/activate    # Linux/macOS
pip install --upgrade pip
pip install -r requirements.txt
```
Audio (WebM→PCM): The app uses PyAV first, then pydub and imageio-ffmpeg (bundled ffmpeg) so conversion works without ffmpeg on PATH. If you still see "Audio conversion failed", install ffmpeg and add it to your PATH.
Environment
- Copy .env.example to .env.
- Set (defaults shown):
  - OLLAMA_HOST=http://localhost:11434
  - DB_PATH=data/finance.db
  - PIPER_VOICE_DIR=voices
  - PIPER_VOICE_DIR_FR=voices
Ollama (optional; for Phase 2 freeform)
```
ollama serve
ollama pull llama3.2
```

Database

python scripts/init_db.py --drop
python scripts/generate_synthetic_data.py

Piper (optional)
TTS uses edge-tts by default (included in requirements). For local Piper TTS, install Piper on your PATH and place voice files in voices/ (see project structure). The voices/ folder is in .gitignore.
ffmpeg
For webm → PCM conversion, the app uses imageio-ffmpeg (bundled in requirements) when ffmpeg is not on system PATH. You can also install ffmpeg system-wide.

Configuration

Configuration is via .env (see .env.example). No cloud API keys are required for the core flow.

Variable	Description	Example
OLLAMA_HOST	Ollama server URL	http://localhost:11434
DB_PATH	SQLite database path	data/finance.db
PIPER_VOICE_DIR	Directory with Piper EN voice	voices
PIPER_VOICE_DIR_FR	Directory with Piper FR voice	voices

Usage

Start the server

uvicorn main:app --reload --host 0.0.0.0 --port 8000

Open the app in your browser (e.g. http://localhost:8000).
Click Start Call — allow microphone access; the welcome TTS plays automatically.
Phase 1: Say e.g. "English" → "John Smith" → "Credit" → "One two three four" (or "1234") → "March 15 1990" → then choose "Balance", "Credit score", "Last transaction", etc., depending on menu.
Phase 2: After "You're authenticated. How can I help you today?" — ask balance, last transaction, or freeform questions; say "End call" when done.
Click End Call to disconnect.

Testing

pytest tests/ -v

Manual E2E: Run server, complete UC1–UC8 from docs/scripted_flows.md (e.g. happy path EN credit balance, French, debit last transaction, verification failure).
Note: Chunked voice streaming with turn-based processing — backend buffers until ~1.2 s silence or end_utterance, then runs STT and responds. Not true real-time streaming transcription.

License

This project is licensed under the MIT License.

FAQs

Q: What is Vaulta AI?
A: Vaulta AI is a real-time streaming financial voice assistant. You start a call, speak, and the assistant guides you through authentication (Phase 1) then answers balance, transactions, credit score, and freeform questions (Phase 2) using SQLite and a local LLM (Ollama). All processing is local; no cloud APIs.

Q: Is internet required?
A: For Piper + Faster-Whisper + Ollama, no. If you use the edge-tts fallback for TTS, that uses a Microsoft API and requires internet.

Q: Why do I not see WAV files or step numbers?
A: By design: no file save for the user, no download links, no step labels in the UI — only Start Call / End Call and status.

Q: What happens after two verification failures?
A: The session enters lockout: the assistant says "Please call back later." and ends the call (state=end).

Q: How do I add more demo users?
A: Edit scripts/generate_synthetic_data.py to add customers/cards/transactions, or insert directly into data/finance.db, and document them in docs/scripted_flows.md.

Contact Information

Email: vidhithakkar.ca@gmail.com
LinkedIn: Vidhi Thakkar

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
agent		agent
audio		audio
config		config
conversation		conversation
data		data
docs		docs
intent		intent
scripts		scripts
static		static
stt		stt
tests		tests
tts		tts
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Vaulta AI — Real-Time Streaming Financial Voice Assistant

Table of Contents

Project Description

Features

Technology Stack

Frontend & Interface

Backend & Transport

STT, TTS & LLM

Data & Security

Tools & Utilities

System Architecture

Architecture Diagrams

High-level data flow

Layers and responsibilities

Request flow (per turn)

Conversation flow (Phase 1 and Phase 2)

User journey (UI)

Phase 1 state machine (guided auth)

Phase 2 (after authentication)

Data Flow and Orchestration

Project Structure

Data Strategy

Demo

Installation

Configuration

Usage

Testing

License

FAQs

Contact Information

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages