voicetest¶

Voice agents break in ways unit tests can't catch: a transition that fires too eagerly, a prompt that confuses the LLM under hedging, a phone number TTS reads as "four hundred fifteen, five hundred fifty-five." Manual testing doesn't scale — you click through a few conversations, ship the change, and find out from a customer three days later that something broke.

Voicetest is the test harness that closes that loop. It simulates real multi-turn conversations against your agent, judges the results with LLMs, and catches regressions before your users do.

Web UI

What you get¶

# Install once
uv tool install voicetest

# Run a regression suite against any agent (Retell, VAPI, LiveKit, Bland, Telnyx)
voicetest run --agent agent.json --tests tests.json --all

Test                          Pass  Score  Turns
─────────────────────────────────────────────────
Schedules an appointment       ✓    0.93   8
Handles cancellation           ✗    0.41   12
No PII leakage                 ✓    1.00   6
─────────────────────────────────────────────────
2/3 passed

When something fails, voicetest can diagnose and propose the fix. When you tweak a prompt, you can snapshot the suite before and after to see what moved. When you have production calls, you can import them as a regression suite. When you're ready to ship, GitHub Actions blocks merges that regress.

Where to start¶

If you want to...	Go here
Get your first test running in 5 minutes — install, run the demo, see results	Getting Started
Solve a specific problem — task-oriented walkthroughs	Recipes
Understand graphs, nodes, and judging	Core Concepts
Configure models, platforms, and credentials	Configuration
Browse every CLI command	CLI Reference

What's distinctive¶

Test any platform. Retell, VAPI, LiveKit, Bland, Telnyx, or custom — voicetest's unified AgentGraph IR means one test suite runs against agents from any of them.
Convert between platforms without manual rewrites — import a Retell Conversation Flow, export to VAPI; import VAPI, export to LiveKit.
LLM-powered diagnosis identifies the root cause of a failed test and proposes a prompt fix, with a one-click "apply and re-run" loop.
Replay production calls against the agent's current graph to detect drift before users notice.
Audio evaluation catches what text-only judges miss — TTS/STT round-trip checks for digit-reading and pronunciation issues.
Run anywhere — CLI, Web UI, REST API, GitHub Actions, or as a Claude Code plugin.

Platform support¶

Platform	Import	Export	Push	Sync
Retell	✓	✓	✓	✓
VAPI	✓	✓	✓	✓
LiveKit	✓	✓	✓	✓
Bland	✓	✓	✓
Telnyx	✓	✓	✓	✓

Interfaces¶

Interface	Command	Best for
Web UI	`voicetest serve`	Visual iteration, side-by-side run comparison, diagnosis
CLI	`voicetest run`	Scripting, CI/CD, fast iteration
Interactive shell	`voicetest`	Exploratory testing
REST API	`voicetest serve`	Integrate with any toolchain at `localhost:8000/api`
Claude Code	`/voicetest-run`	Have Claude drive your test suite — see Claude Code Integration