Vaulta AI is a hybrid voice authentication and AI service system designed to simulate a secure, real-world digital banking assistant.
- Project Description
- Features
- Technology Stack
- System Architecture
- Architecture Diagrams
- Conversation Flow
- Data Flow and Orchestration
- Project Structure
- Data Strategy
- Demo
- Installation
- Configuration
- Usage
- Testing
- License
- FAQs
- Contact Information
Vaulta AIis a real-time, voice-enabled banking agent that combines secure authentication with intelligent financial assistance. The system uses a hybrid architecture that begins with a deterministic authentication state machine verifying user identity through guided voice interaction before transitioning into an AI-powered conversational service mode.
Built using WebSocket-based audio streaming, speech-to-text processing, and local language models, Vaulta AI simulates a modern digital banking call center experience. After successful verification, users can interact naturally to check balances, review transactions, or perform account-related actions.
The project demonstrates secure AI system design principles by separating authentication logic from conversational intelligence, implementing retry limits, input validation, and session-based state management. It is designed as a technical showcase of how voice interfaces, AI agents, and financial security layers can be integrated into a cohesive, production-inspired architecture.
- Chunked voice streaming — Browser streams audio chunks to backend; backend buffers until end-of-utterance (1.2 s silence or explicit end); STT runs in memory; no WAV files written for normal flow.
- Phase 1 guided authentication — Deterministic flow: language, full name, card type, last 4 digits, DOB; intent rules (parse_language, parse_full_name, parse_card_type, parse_four_digits, parse_dob, parse_menu_, parse_end_call); verification lockout after 2 failures ("Please call back later."*).
- Phase 2 open conversation — Balance, last transaction, credit score, credit limit left, account status from SQLite; freeform questions via Ollama with conversation memory + DB context.
- Bilingual (EN/FR) — Prompts in English or French from
config/prompts.jsonandconfig/prompts_fr.json. - Sleek UI (Vaulta AI) — Single page: Start Call / End Call, MediaRecorder, WebSocket, AudioContext with audio queue to prevent overlap; dark theme; no step labels or file links.
- Log masking — Card-like numbers masked in logs (e.g.
****1234) viautils/mask.py. - Local-first — Faster-Whisper, Piper (or edge-tts fallback), Ollama, SQLite; no OpenAI or cloud APIs.
| Component | Technology |
|---|---|
| UI | HTML5 + vanilla JS — single page: MediaRecorder, WebSocket, AudioContext (audio queue); Start/End Call; sleek dark theme. |
| Component | Technology |
|---|---|
| Backend | FastAPI — static frontend + WebSocket /ws; receive binary chunks; send TTS bytes + JSON (state, transcript). |
| Transport | WebSocket (binary + JSON) — browser ↔ FastAPI; audio chunks in; audio bytes + events out. |
| Component | Technology |
|---|---|
| STT | Faster-Whisper — in-memory transcription; input PCM 16 kHz mono (see Audio conversion). |
| Audio convert | pydub + imageio-ffmpeg (or system ffmpeg) — convert browser webm/opus chunks to PCM 16 kHz mono before STT. Optional: PyAV for decoding without ffmpeg. |
| TTS | Piper (local) — synthesize(text, lang) → WAV bytes; edge-tts fallback when Piper voices not available. |
| LLM | Ollama (local) — Phase 2 freeform and optional extraction; no API key. |
| Component | Technology |
|---|---|
| Data | SQLite — customers, cards, transactions; seed via scripts/init_db.py and scripts/generate_synthetic_data.py. |
| Security | Log masking (utils/mask.py) — no full PAN in logs. |
| Item | Purpose |
|---|---|
| python-dotenv | .env for OLLAMA_HOST, DB_PATH, PIPER_VOICE_DIR. |
| rapidfuzz | Fuzzy name matching for Phase 1 full name. |
| dateparser | Natural-language and numeric date parsing for DOB. |
- User opens the app in browser; clicks Start Call; allows mic; hears welcome TTS; speaks; audio is streamed via WebSocket.
- FastAPI serves static frontend and WebSocket
/ws; on connect sends welcome TTS; receives binary chunks; on 1.2 s silence orend_utteranceJSON: converts webm → PCM, runs STT, calls orchestrator. - Orchestrator (
conversation/orchestrator.py):process_turn(audio_bytes, session)→ STT → intent/state (via agent) → TTS → returns (audio_bytes, updated_session, transcript_masked). Enforces lockout whenretry_count >= 2. - Agent (
agent/agent.py): Intent rules per state; DB helpers (data/db.py); Phase 1 prompts from config; Phase 2 tool-calling (get_balance, get_last_transaction, get_credit_score, get_credit_limit_left, get_account_status). Optional: Ollama for freeform. - SQLite —
data/finance.db(customers, cards, transactions). No file storage for audio in normal flow.
flowchart LR
subgraph Browser [Browser]
UI[UI]
MR[MediaRecorder]
WSC[WebSocket Client]
AC[AudioContext]
end
subgraph Backend [FastAPI Backend]
WSS[WebSocket Server]
Buf[Audio Buffer]
Convert[webm to PCM 16kHz]
STT[Faster-Whisper]
Orch[Orchestrator]
Agent[Agent]
TTS[Piper / edge-tts]
DB[(SQLite)]
end
UI -->|Start Call| WSC
MR -->|Chunks| WSC
WSC <-->|Binary + JSON| WSS
WSS --> Buf
Buf --> Convert
Convert --> STT
STT --> Orch
Orch --> Agent
Agent --> DB
Orch --> TTS
TTS -->|Bytes| WSS
WSS --> WSC
WSC --> AC
AC --> User[User]
flowchart TB
subgraph Presentation [Presentation Layer]
Frontend[Frontend SPA]
end
subgraph Transport [Transport Layer]
WebSocket[FastAPI WebSocket]
end
subgraph Orchestration [Orchestration Layer]
Orchestrator[Orchestrator]
StateMachine[State Machine]
Session[Session Data]
end
subgraph AgentLayer [Agent and Intent Layer]
Intent[Intent Rules]
Agent[Agent]
end
subgraph Data [Data Layer]
SQLite[(SQLite)]
end
subgraph Models [Models Layer]
STT[STT]
TTS[TTS]
end
Frontend --> WebSocket
WebSocket --> Orchestrator
Orchestrator --> StateMachine
Orchestrator --> Session
Orchestrator --> STT
Orchestrator --> TTS
Orchestrator --> Intent
Orchestrator --> Agent
Intent --> Agent
Agent --> SQLite
flowchart TB
User[User message] --> Frontend[Frontend]
Frontend -->|Binary chunks| WS[WebSocket]
WS --> Buffer[Buffer until 1.2s silence or end_utterance]
Buffer --> Convert[audio/convert: webm to PCM 16kHz mono]
Convert --> STT[stt: transcribe]
STT --> Orch[orchestrator: process_turn]
Orch --> Agent[agent: run_agent]
Agent --> Intent[intent/rules]
Agent --> State[state_machine]
Agent --> DB[(data/db)]
Agent --> Reply[reply_text]
Orch --> Retry[retry or lockout if verify fail]
Reply --> Retry
Retry --> TTS[tts: synthesize]
TTS --> Send[Send bytes + JSON to client]
Send --> Frontend
Frontend --> Play[AudioContext queue play]
Play --> User
flowchart TD
A[Open app] --> B[Click Start Call]
B --> C[Allow mic]
C --> D[Hear welcome TTS]
D --> E[Speak]
E --> F[Backend: buffer then STT then orchestrator then TTS]
F --> G[Hear response]
G --> H{End call?}
H -->|No| E
H -->|Yes| I[Click End Call]
I --> J[Hear goodbye]
stateDiagram-v2
[*] --> welcome
welcome --> full_name: language en/fr
full_name --> account_type: name matched
account_type --> verify_number: debit/credit
verify_number --> verify_dob: last4 matched
verify_dob --> menu: DOB matched
menu --> action: balance/transaction/score/status/other
action --> menu_again: TTS result
menu_again --> action: same options
menu_again --> end_state: end_call
menu --> end_state: end_call
verify_number --> verify_number: retry
verify_dob --> verify_dob: retry
verify_number --> end_state: lockout
verify_dob --> end_state: lockout
end_state --> [*]
flowchart LR
Auth[Authenticated] --> Menu[Menu: balance, last transaction, credit score, limit, status, other]
Menu --> Tool[Tool: get_balance, get_last_transaction, etc.]
Tool --> DB[(SQLite)]
DB --> Reply[TTS reply]
Reply --> Menu
Menu --> End[End call]
sequenceDiagram
participant F as Frontend
participant WS as WebSocket
participant Main as Backend
participant Convert as Convert
participant STT as STT
participant Orch as Orchestrator
participant Agent as Agent
participant DB as SQLite
participant TTS as TTS
F->>WS: Binary audio chunks
F->>WS: end_utterance (optional)
WS->>Main: Buffer then 1.2s silence or end_utterance
Main->>Convert: webm_to_pcm_16k_mono / webm_to_wav_16k_mono
Convert->>STT: PCM 16kHz mono bytes
STT->>STT: transcribe
STT->>Orch: transcript
Orch->>Orch: mask_pan(transcript)
Orch->>Agent: run_agent(state, transcript, session)
Agent->>Agent: intent rules per state
Agent->>DB: get_customer_by_name_and_dob, get_card_*, get_last_transaction_for_card, etc.
DB->>Agent: row(s)
Agent->>Orch: reply_text, session_dict, next_state
Orch->>Orch: retry/lockout if verify_number or verify_dob failed
Orch->>Orch: get prompt if reply_text empty
Orch->>TTS: synthesize(reply_text, lang)
TTS->>Orch: audio bytes
Orch->>WS: audio bytes then JSON state/phase/transcript
WS->>F: Binary then JSON
F->>F: Audio queue decode and play
Steps in short:
- Frontend sends binary audio chunks over WebSocket; optionally
{"type": "end_utterance"}to flush buffer. - Backend buffers chunks; after 1.2 s silence or
end_utterance, converts webm → PCM 16 kHz mono viaaudio/convert.py. - STT (
stt/stt.py):transcribe(audio_bytes)→ transcript; transcript is masked for logs. - Orchestrator loads prompts by language; calls agent
run_agent(state, transcript, session). - Agent runs intent parser for current state (
parse_language,parse_full_name,parse_card_type,parse_four_digits,parse_dob,parse_menu_debit/parse_menu_credit,parse_end_call); updates session; on verification failure increments retry_count — ifretry_count >= 2sets state=end (lockout). - Phase 1: Next prompt from config or DB-backed reply; Phase 2: tool-calling (get_balance, get_last_transaction, etc.).
- TTS
synthesize(reply_text, session.language)→ WAV or MP3 bytes. - Backend sends audio bytes then JSON
{ state, phase, transcript, error }to client. - Frontend pushes decoded audio to queue; plays sequentially (no overlap).
ai-banking-agent/
├── README.md
├── requirements.txt
├── .env.example
├── .gitignore
├── main.py # FastAPI app + WebSocket /ws
├── config/
│ ├── prompts.json # EN prompts per state/menu
│ └── prompts_fr.json # FR prompts
├── data/
│ ├── __init__.py
│ ├── db.py # SQLite helpers (customers, cards, transactions)
│ └── finance.db # (gitignore) SQLite DB
├── conversation/
│ ├── __init__.py
│ ├── state_machine.py # States, next_state
│ ├── session_data.py # Session dataclass (phase, state, language, customer_id, card_id, etc.)
│ └── orchestrator.py # process_turn: STT → agent → TTS
├── intent/
│ ├── __init__.py
│ └── rules.py # parse_language, parse_full_name, parse_card_type, parse_four_digits, parse_dob, parse_menu_*, parse_end_call
├── agent/
│ ├── __init__.py
│ └── agent.py # Ollama + DB; Phase 1 & Phase 2
├── stt/
│ ├── __init__.py
│ └── stt.py # transcribe(audio_bytes) in-memory, Faster-Whisper
├── tts/
│ ├── __init__.py
│ └── tts.py # synthesize(text, lang) → bytes (Piper / edge-tts)
├── audio/
│ ├── __init__.py
│ └── convert.py # webm/opus → PCM 16 kHz mono
├── utils/
│ ├── __init__.py
│ ├── mask.py # mask_pan() for logs
│ └── logger.py
├── scripts/
│ ├── init_db.py # Create tables (optional --drop)
│ └── generate_synthetic_data.py # Seed customers, cards, transactions
├── static/
│ ├── index.html # Vaulta AI UI
│ ├── app.js # MediaRecorder, WebSocket, AudioContext queue
│ └── styles.css
├── voices/ # Piper voice files (optional)
│ ├── en_US-lessac-medium.onnx (+ .json)
│ └── fr_FR-siwis-medium.onnx (+ .json)
├── tests/
│ └── test_e2e_voice.py # E2E / voice tests
└── docs/
└── scripted_flows.md # Demo identities and exact phrases (UC1–UC8)
| Data | Source | Location | How created |
|---|---|---|---|
| SQLite | Synthetic + demo | data/finance.db |
python scripts/init_db.py then python scripts/generate_synthetic_data.py. |
| Demo users | See scripted flows | Same DB | John Smith (DOB 1990-03-15, credit last4 1234), Jane Doe (DOB 1985-07-22, debit last4 5678), etc. |
- Tables:
customers(customer_id, name, dob, phone_last4),cards(card_id, customer_id, card_last4, card_type, status, balance, credit_limit, credit_score, …),transactions(transaction_id, card_id, merchant, amount, date, status). - Helpers in
data/db.py:get_customer_by_name_and_dob,get_cards_by_customer_id,get_card_by_type_last4_and_customer,get_last_transaction_for_card,get_credit_limit_left, etc.
Below are snapshots of Vaulta AI: the chat UI and the demo video.
Demo video:
-
Clone the repository
git clone <your-repo-url> cd ai-banking-agent
-
Create virtual environment and install dependencies
requirements.txtis minimal: only packages used by the app, scripts, and tests (FastAPI, Faster-Whisper, edge-tts, pydub, imageio-ffmpeg, rapidfuzz, dateparser, Faker, pytest, etc.).python -m venv venv .\venv\Scripts\Activate.ps1 # Windows # source venv/bin/activate # Linux/macOS pip install --upgrade pip pip install -r requirements.txt
Audio (WebM→PCM): The app uses PyAV first, then pydub and imageio-ffmpeg (bundled ffmpeg) so conversion works without ffmpeg on PATH. If you still see "Audio conversion failed", install ffmpeg and add it to your PATH.
-
Environment
- Copy
.env.exampleto.env. - Set (defaults shown):
OLLAMA_HOST=http://localhost:11434DB_PATH=data/finance.dbPIPER_VOICE_DIR=voicesPIPER_VOICE_DIR_FR=voices
- Copy
-
Ollama (optional; for Phase 2 freeform)
ollama serve ollama pull llama3.2
-
Database
python scripts/init_db.py --drop python scripts/generate_synthetic_data.py
-
Piper (optional)
TTS uses edge-tts by default (included in requirements). For local Piper TTS, install Piper on your PATH and place voice files invoices/(see project structure). Thevoices/folder is in.gitignore. -
ffmpeg
For webm → PCM conversion, the app uses imageio-ffmpeg (bundled in requirements) when ffmpeg is not on system PATH. You can also install ffmpeg system-wide.
Configuration is via .env (see .env.example). No cloud API keys are required for the core flow.
| Variable | Description | Example |
|---|---|---|
| OLLAMA_HOST | Ollama server URL | http://localhost:11434 |
| DB_PATH | SQLite database path | data/finance.db |
| PIPER_VOICE_DIR | Directory with Piper EN voice | voices |
| PIPER_VOICE_DIR_FR | Directory with Piper FR voice | voices |
-
Start the server
uvicorn main:app --reload --host 0.0.0.0 --port 8000
-
Open the app in your browser (e.g.
http://localhost:8000). -
Click Start Call — allow microphone access; the welcome TTS plays automatically.
-
Phase 1: Say e.g. "English" → "John Smith" → "Credit" → "One two three four" (or "1234") → "March 15 1990" → then choose "Balance", "Credit score", "Last transaction", etc., depending on menu.
-
Phase 2: After "You're authenticated. How can I help you today?" — ask balance, last transaction, or freeform questions; say "End call" when done.
-
Click End Call to disconnect.
pytest tests/ -v- Manual E2E: Run server, complete UC1–UC8 from
docs/scripted_flows.md(e.g. happy path EN credit balance, French, debit last transaction, verification failure). - Note: Chunked voice streaming with turn-based processing — backend buffers until ~1.2 s silence or
end_utterance, then runs STT and responds. Not true real-time streaming transcription.
This project is licensed under the MIT License.
Q: What is Vaulta AI?
A: Vaulta AI is a real-time streaming financial voice assistant. You start a call, speak, and the assistant guides you through authentication (Phase 1) then answers balance, transactions, credit score, and freeform questions (Phase 2) using SQLite and a local LLM (Ollama). All processing is local; no cloud APIs.
Q: Is internet required?
A: For Piper + Faster-Whisper + Ollama, no. If you use the edge-tts fallback for TTS, that uses a Microsoft API and requires internet.
Q: Why do I not see WAV files or step numbers?
A: By design: no file save for the user, no download links, no step labels in the UI — only Start Call / End Call and status.
Q: What happens after two verification failures?
A: The session enters lockout: the assistant says "Please call back later." and ends the call (state=end).
Q: How do I add more demo users?
A: Edit scripts/generate_synthetic_data.py to add customers/cards/transactions, or insert directly into data/finance.db, and document them in docs/scripted_flows.md.
- Email: vidhithakkar.ca@gmail.com
- LinkedIn: Vidhi Thakkar
