This project has gone through multiple production-grade model iterations, with a consistent focus on low latency, deterministic behavior, and voice-first reliability. Early versions of the system ran almost exclusively on OpenAI gpt-4.1-nano, which proved to be a true workhorse and established the baseline for acceptable performance. As the system matured and multilingual and RAG complexity increased, model consistency and tail latency (p50 / p95) became just as important as raw speed.
To validate current model choices, a standardized smoke test suite was executed across multiple providers and models. Each test represents a real IVR path (inventory lookup, hours, staff routing, weather, city RAG, chat memory, etc.) and reflects end-to-end request latency, not just model inference time.
Provider / Model: BEDROCK / us.meta.llama4-scout-17b-instruct-v1:0
| Test | Duration (ms) | Approx RPS |
|---|---|---|
| SmokeTests.chucklesCandyTest | 2131 | 0.47 |
| SmokeTests.restaurantTest | 1070 | 0.93 |
| SmokeTests.addressTest | 839 | 1.19 |
| SmokeTests.staffTest | 1658 | 0.60 |
| SmokeTests.weatherTest | 2259 | 0.44 |
| SmokeTests.cityRagTest | 1511 | 0.66 |
| SmokeTests.chatMemoryTest | 832 | 1.20 |
Notes:
Good overall throughput, but earlier runs showed language drift in non-English scenarios.
Provider / Model: BEDROCK / us.anthropic.claude-3-5-haiku-20241022-v1:0
| Test | Duration (ms) | Approx RPS |
|---|---|---|
| SmokeTests.chucklesCandyTest | 2881 | 0.35 |
| SmokeTests.restaurantTest | 994 | 1.01 |
| SmokeTests.addressTest | 822 | 1.22 |
| SmokeTests.staffTest | 1650 | 0.61 |
| SmokeTests.weatherTest | 2453 | 0.41 |
| SmokeTests.cityRagTest | 2013 | 0.50 |
| SmokeTests.chatMemoryTest | 754 | 1.33 |
Notes:
Strong reasoning quality, but consistently slower p50 on conversational paths.
Provider / Model: BEDROCK / us.amazon.nova-2-lite-v1:0
| Test | Duration (ms) | Approx RPS |
|---|---|---|
| SmokeTests.chucklesCandyTest | 1876 | 0.53 |
| SmokeTests.restaurantTest | 961 | 1.04 |
| SmokeTests.addressTest | 659 | 1.52 |
| SmokeTests.staffTest | 1481 | 0.68 |
| SmokeTests.weatherTest | 2327 | 0.43 |
| SmokeTests.cityRagTest | 2323 | 0.43 |
| SmokeTests.chatMemoryTest | 672 | 1.49 |
Why this matters:
- Fastest or near-fastest p50 across most conversational paths
- Zero language leakage (no English responses when operating in Swedish or other locales)
- Extremely consistent behavior under IVR load
- Best balance of latency, determinism, and multilingual safety
Provider / Model: OPENAI / gpt-5-nano
| Test | Duration (ms) | Approx RPS |
|---|---|---|
| SmokeTests.chucklesCandyTest | 2190 | 0.46 |
| SmokeTests.restaurantTest | 1154 | 0.87 |
| SmokeTests.addressTest | 825 | 1.21 |
| SmokeTests.staffTest | 1517 | 0.66 |
| SmokeTests.weatherTest | 2130 | 0.47 |
| SmokeTests.cityRagTest | 1737 | 0.58 |
| SmokeTests.chatMemoryTest | 766 | 1.31 |
Notes:
Solid successor to 4.1-nano, but not materially faster for IVR-style workloads.
Provider / Model: OPENAI / gpt-4.1-nano
| Test | Duration (ms) | Approx RPS |
|---|---|---|
| SmokeTests.chucklesCandyTest | 2146 | 0.47 |
| SmokeTests.restaurantTest | 1011 | 0.99 |
| SmokeTests.addressTest | 817 | 1.22 |
| SmokeTests.staffTest | 1686 | 0.59 |
| SmokeTests.weatherTest | 2194 | 0.46 |
| SmokeTests.cityRagTest | 1646 | 0.61 |
| SmokeTests.chatMemoryTest | 878 | 1.14 |
Notes:
- This model carried the project for a long time and remains an excellent reference point.
- However, it showed occasional multilingual missteps under real-world conversational pressure.
-
All tested models meet low-latency IVR requirements
Most conversational paths consistently complete in ~700–2200 ms, well within acceptable limits for voice interactions. -
Nova 2 Lite stands out for consistency, not just speed
While raw timings are comparable across providers, behavioral correctness (especially language fidelity) is where Nova 2 Lite clearly wins. -
4.1-nano and Meta models failed edge-case language tests
Emitting English responses in non-English sessions is unacceptable for production IVR, regardless of latency. -
Bedrock provides tighter p50 stability
Especially important for voice systems where perceived responsiveness matters more than peak throughput.
The move from gpt-4.1-nano to Amazon Bedrock Nova 2 Lite was not driven by hype or vendor preference—it was driven by measured performance, consistency, and real IVR correctness. The numbers show that all models are fast enough; the behavior shows that Nova 2 Lite is the safest and most reliable choice today.
This system is designed to remain provider-agnostic, and continued benchmarking is part of the roadmap—but for now, the data backs the decision.