voicetest¶
Voice agents break in ways unit tests can't catch: a transition that fires too eagerly, a prompt that confuses the LLM under hedging, a phone number TTS reads as "four hundred fifteen, five hundred fifty-five." Manual testing doesn't scale — you click through a few conversations, ship the change, and find out from a customer three days later that something broke.
Voicetest is the test harness that closes that loop. It simulates real multi-turn conversations against your agent, judges the results with LLMs, and catches regressions before your users do.

What you get¶
# Install once
uv tool install voicetest
# Run a regression suite against any agent (Retell, VAPI, LiveKit, Bland, Telnyx)
voicetest run --agent agent.json --tests tests.json --all
Test Pass Score Turns
─────────────────────────────────────────────────
Schedules an appointment ✓ 0.93 8
Handles cancellation ✗ 0.41 12
No PII leakage ✓ 1.00 6
─────────────────────────────────────────────────
2/3 passed
When something fails, voicetest can diagnose and propose the fix. When you tweak a prompt, you can snapshot the suite before and after to see what moved. When you have production calls, you can import them as a regression suite. When you're ready to ship, GitHub Actions blocks merges that regress.
Where to start¶
| If you want to... | Go here |
|---|---|
| Get your first test running in 5 minutes — install, run the demo, see results | Getting Started |
| Solve a specific problem — task-oriented walkthroughs | Recipes |
| Understand graphs, nodes, and judging | Core Concepts |
| Configure models, platforms, and credentials | Configuration |
| Browse every CLI command | CLI Reference |
What's distinctive¶
- Test any platform. Retell, VAPI, LiveKit, Bland, Telnyx, or custom — voicetest's unified AgentGraph IR means one test suite runs against agents from any of them.
- Convert between platforms without manual rewrites — import a Retell Conversation Flow, export to VAPI; import VAPI, export to LiveKit.
- LLM-powered diagnosis identifies the root cause of a failed test and proposes a prompt fix, with a one-click "apply and re-run" loop.
- Replay production calls against the agent's current graph to detect drift before users notice.
- Audio evaluation catches what text-only judges miss — TTS/STT round-trip checks for digit-reading and pronunciation issues.
- Run anywhere — CLI, Web UI, REST API, GitHub Actions, or as a Claude Code plugin.
Platform support¶
| Platform | Import | Export | Push | Sync |
|---|---|---|---|---|
| Retell | ✓ | ✓ | ✓ | ✓ |
| VAPI | ✓ | ✓ | ✓ | ✓ |
| LiveKit | ✓ | ✓ | ✓ | ✓ |
| Bland | ✓ | ✓ | ✓ | |
| Telnyx | ✓ | ✓ | ✓ | ✓ |
Interfaces¶
| Interface | Command | Best for |
|---|---|---|
| Web UI | voicetest serve |
Visual iteration, side-by-side run comparison, diagnosis |
| CLI | voicetest run |
Scripting, CI/CD, fast iteration |
| Interactive shell | voicetest |
Exploratory testing |
| REST API | voicetest serve |
Integrate with any toolchain at localhost:8000/api |
| Claude Code | /voicetest-run |
Have Claude drive your test suite — see Claude Code Integration |