Models guide¶

Voicetest uses three distinct LLM roles. Choosing the right model for each is the single biggest lever for run cost, run speed, and result quality.

The three roles¶

Role	What it does	What it needs
Agent	Plays your voice agent — generates responses node by node, decides when to transition, calls tools when needed	Strong instruction-following, structured output for transitions and tool calls, low latency for streaming
Simulator	Plays the user — generates the next user turn given the persona and goal in the test case	Good at sustaining a believable persona, accepting hedging/filler, willing to be uncooperative when the test demands it
Judge	Evaluates the finished transcript against each metric, producing a 0–1 score with reasoning	Strong reading comprehension, calibrated scoring, follows scoring rubrics consistently

Each role can be configured independently in settings.toml, in RunOptions, or per test case (when test_model_precedence is enabled).

Quick recommendations¶

Use case	Agent	Simulator	Judge
Free / local development	`groq/llama-3.1-8b-instant`	`groq/llama-3.1-8b-instant`	`groq/llama-3.1-8b-instant`
Highest quality (production CI)	`openai/gpt-4o`	`openai/gpt-4o-mini`	`anthropic/claude-3-5-sonnet`
Cost-optimized	`openai/gpt-4o-mini`	`gemini/gemini-1.5-flash`	`anthropic/claude-3-5-haiku`
Fully offline (Ollama)	`ollama_chat/qwen2.5:14b`	`ollama_chat/qwen2.5:7b`	`ollama_chat/qwen2.5:14b`
Claude Code passthrough	`claudecode/sonnet`	`claudecode/haiku`	`claudecode/sonnet`

These are starting points. Validate against your own test suite before locking them in — model quality on voice-agent transcripts varies more than on general benchmarks.

How to choose¶

Agent model¶

The agent runs at every turn of every conversation, so latency and per-call cost compound fast. It also has the hardest job — your real production agent's behavior depends on this LLM. Match what your production deployment uses, when feasible:

If you ship Retell with gpt-4o, test with openai/gpt-4o. Same provider tier eliminates surprises.
If you ship with a custom model, test with the same one.
For development iteration, downgrade to gpt-4o-mini or a free Groq model — you'll catch the obvious regressions without the production cost.

Avoid: tiny models (under 8B parameters) for agents with structured tool calls or many transitions. They'll fumble the JSON.

Simulator model¶

The simulator's job is "play the user" — produce realistic next turns from a persona and goal. This is easier than the agent's job, so you can usually go a tier cheaper. But there are two caveats:

Don't go too cheap. A simulator that can't sustain a persona produces unrealistic conversations, and your tests pass for the wrong reasons. Test the simulator quality by reading 5–10 transcripts and asking "would a real caller talk like this?"
Don't use the same model as the agent. When agent and simulator are the same model, the LLM "knows itself" — the simulator anticipates the agent's next move in ways a real user wouldn't. Pick a different provider or family.

Judge model¶

The judge reads a finished transcript and scores each metric. This is mostly reading comprehension; quality matters less than calibration consistency. Practical tips:

Pick one judge and stick with it. Switching judges between runs makes scores incomparable.
Stronger judges are worth it for production CI. If a regression slips past the judge, you ship the bug. The cost is small relative to the per-conversation agent cost.
Test the judge on known-good transcripts. If a judge consistently mis-scores cases you know should pass, retire it.

Setting models¶

Put it in settings.toml:

[models]
agent = "openai/gpt-4o-mini"
simulator = "gemini/gemini-1.5-flash"
judge = "anthropic/claude-3-5-haiku-20241022"

Or per run via RunOptions:

from voicetest.models.test_case import RunOptions

options = RunOptions(
    agent_model="openai/gpt-4o",
    simulator_model="openai/gpt-4o-mini",
    judge_model="anthropic/claude-3-5-sonnet-20241022",
)

Or per test case (requires test_model_precedence = true in run options):

{
  "name": "Hard escalation case",
  "user_prompt": "...",
  "metrics": ["..."],
  "llm_model": "openai/gpt-4o",
  "type": "llm"
}

Provider-specific notes¶

LiteLLM format — voicetest uses LiteLLM for all routing, so any provider LiteLLM supports works. Format: provider/model-name.
Vertex AI region — set VERTEXAI_LOCATION=global for newer Gemini models. See Configuration: Vertex AI.
Claude Code passthrough — use your existing Claude Code subscription instead of an API key. See Claude Code Integration.
Ollama for local — fully offline, free, and reproducible. Install Ollama, pull the model, and use ollama_chat/<model> in voicetest. Bigger models give better quality at the cost of speed.

Caching¶

Voicetest caches LLM responses by default to avoid redundant calls when re-running tests. This makes the same suite cheap to re-run. Disable per call with --no-cache or no_cache = true in run options. See Features: LLM response cache for the disk and S3 backends.