Skip to content

voicetest

Voice agents break in ways unit tests can't catch: a transition that fires too eagerly, a prompt that confuses the LLM under hedging, a phone number TTS reads as "four hundred fifteen, five hundred fifty-five." Manual testing doesn't scale — you click through a few conversations, ship the change, and find out from a customer three days later that something broke.

Voicetest is the test harness that closes that loop. It simulates real multi-turn conversations against your agent, judges the results with LLMs, and catches regressions before your users do.

Web UI

What you get

# Install once
uv tool install voicetest

# Run a regression suite against any agent (Retell, VAPI, LiveKit, Bland, Telnyx)
voicetest run --agent agent.json --tests tests.json --all
Test                          Pass  Score  Turns
─────────────────────────────────────────────────
Schedules an appointment       ✓    0.93   8
Handles cancellation           ✗    0.41   12
No PII leakage                 ✓    1.00   6
─────────────────────────────────────────────────
2/3 passed

When something fails, voicetest can diagnose and propose the fix. When you tweak a prompt, you can snapshot the suite before and after to see what moved. When you have production calls, you can import them as a regression suite. When you're ready to ship, GitHub Actions blocks merges that regress.

Where to start

If you want to... Go here
Get your first test running in 5 minutes — install, run the demo, see results Getting Started
Solve a specific problem — task-oriented walkthroughs Recipes
Understand graphs, nodes, and judging Core Concepts
Configure models, platforms, and credentials Configuration
Browse every CLI command CLI Reference

What's distinctive

  • Test any platform. Retell, VAPI, LiveKit, Bland, Telnyx, or custom — voicetest's unified AgentGraph IR means one test suite runs against agents from any of them.
  • Convert between platforms without manual rewrites — import a Retell Conversation Flow, export to VAPI; import VAPI, export to LiveKit.
  • LLM-powered diagnosis identifies the root cause of a failed test and proposes a prompt fix, with a one-click "apply and re-run" loop.
  • Replay production calls against the agent's current graph to detect drift before users notice.
  • Audio evaluation catches what text-only judges miss — TTS/STT round-trip checks for digit-reading and pronunciation issues.
  • Run anywhere — CLI, Web UI, REST API, GitHub Actions, or as a Claude Code plugin.

Platform support

Platform Import Export Push Sync
Retell
VAPI
LiveKit
Bland
Telnyx

Interfaces

Interface Command Best for
Web UI voicetest serve Visual iteration, side-by-side run comparison, diagnosis
CLI voicetest run Scripting, CI/CD, fast iteration
Interactive shell voicetest Exploratory testing
REST API voicetest serve Integrate with any toolchain at localhost:8000/api
Claude Code /voicetest-run Have Claude drive your test suite — see Claude Code Integration