Models guide¶
Voicetest uses three distinct LLM roles. Choosing the right model for each is the single biggest lever for run cost, run speed, and result quality.
The three roles¶
| Role | What it does | What it needs |
|---|---|---|
| Agent | Plays your voice agent — generates responses node by node, decides when to transition, calls tools when needed | Strong instruction-following, structured output for transitions and tool calls, low latency for streaming |
| Simulator | Plays the user — generates the next user turn given the persona and goal in the test case | Good at sustaining a believable persona, accepting hedging/filler, willing to be uncooperative when the test demands it |
| Judge | Evaluates the finished transcript against each metric, producing a 0–1 score with reasoning | Strong reading comprehension, calibrated scoring, follows scoring rubrics consistently |
Each role can be configured independently in settings.toml, in RunOptions, or per test case (when test_model_precedence is enabled).
Quick recommendations¶
| Use case | Agent | Simulator | Judge |
|---|---|---|---|
| Free / local development | groq/llama-3.1-8b-instant |
groq/llama-3.1-8b-instant |
groq/llama-3.1-8b-instant |
| Highest quality (production CI) | openai/gpt-4o |
openai/gpt-4o-mini |
anthropic/claude-3-5-sonnet |
| Cost-optimized | openai/gpt-4o-mini |
gemini/gemini-1.5-flash |
anthropic/claude-3-5-haiku |
| Fully offline (Ollama) | ollama_chat/qwen2.5:14b |
ollama_chat/qwen2.5:7b |
ollama_chat/qwen2.5:14b |
| Claude Code passthrough | claudecode/sonnet |
claudecode/haiku |
claudecode/sonnet |
These are starting points. Validate against your own test suite before locking them in — model quality on voice-agent transcripts varies more than on general benchmarks.
How to choose¶
Agent model¶
The agent runs at every turn of every conversation, so latency and per-call cost compound fast. It also has the hardest job — your real production agent's behavior depends on this LLM. Match what your production deployment uses, when feasible:
- If you ship Retell with
gpt-4o, test withopenai/gpt-4o. Same provider tier eliminates surprises. - If you ship with a custom model, test with the same one.
- For development iteration, downgrade to
gpt-4o-minior a free Groq model — you'll catch the obvious regressions without the production cost.
Avoid: tiny models (under 8B parameters) for agents with structured tool calls or many transitions. They'll fumble the JSON.
Simulator model¶
The simulator's job is "play the user" — produce realistic next turns from a persona and goal. This is easier than the agent's job, so you can usually go a tier cheaper. But there are two caveats:
- Don't go too cheap. A simulator that can't sustain a persona produces unrealistic conversations, and your tests pass for the wrong reasons. Test the simulator quality by reading 5–10 transcripts and asking "would a real caller talk like this?"
- Don't use the same model as the agent. When agent and simulator are the same model, the LLM "knows itself" — the simulator anticipates the agent's next move in ways a real user wouldn't. Pick a different provider or family.
Judge model¶
The judge reads a finished transcript and scores each metric. This is mostly reading comprehension; quality matters less than calibration consistency. Practical tips:
- Pick one judge and stick with it. Switching judges between runs makes scores incomparable.
- Stronger judges are worth it for production CI. If a regression slips past the judge, you ship the bug. The cost is small relative to the per-conversation agent cost.
- Test the judge on known-good transcripts. If a judge consistently mis-scores cases you know should pass, retire it.
Setting models¶
Put it in settings.toml:
[models]
agent = "openai/gpt-4o-mini"
simulator = "gemini/gemini-1.5-flash"
judge = "anthropic/claude-3-5-haiku-20241022"
Or per run via RunOptions:
from voicetest.models.test_case import RunOptions
options = RunOptions(
agent_model="openai/gpt-4o",
simulator_model="openai/gpt-4o-mini",
judge_model="anthropic/claude-3-5-sonnet-20241022",
)
Or per test case (requires test_model_precedence = true in run options):
{
"name": "Hard escalation case",
"user_prompt": "...",
"metrics": ["..."],
"llm_model": "openai/gpt-4o",
"type": "llm"
}
Provider-specific notes¶
- LiteLLM format — voicetest uses LiteLLM for all routing, so any provider LiteLLM supports works. Format:
provider/model-name. - Vertex AI region — set
VERTEXAI_LOCATION=globalfor newer Gemini models. See Configuration: Vertex AI. - Claude Code passthrough — use your existing Claude Code subscription instead of an API key. See Claude Code Integration.
- Ollama for local — fully offline, free, and reproducible. Install Ollama, pull the model, and use
ollama_chat/<model>in voicetest. Bigger models give better quality at the cost of speed.
Caching¶
Voicetest caches LLM responses by default to avoid redundant calls when re-running tests. This makes the same suite cheap to re-run. Disable per call with --no-cache or no_cache = true in run options. See Features: LLM response cache for the disk and S3 backends.