Features¶

Format conversion¶

voicetest converts between agent formats via its unified AgentGraph representation:

Retell CF ─────┐                   ┌───▶ Retell LLM
               │                   │
Retell LLM ────┼                   ├───▶ Retell CF
               │                   │
VAPI ──────────┼                   ├───▶ VAPI Assistant
               │                   │
Bland ─────────┼───▶ AgentGraph ───┼───▶ VAPI Squad
               │                   │
Telnyx ────────┤                   ├───▶ Bland
               │                   │
LiveKit ───────┤                   ├───▶ Telnyx
               │                   │
XLSForm ───────┤                   ├───▶ LiveKit
               │                   │
Custom ────────┘                   └───▶ Mermaid · Voicetest JSON

Import from any supported format, then export to any other:

# Convert Retell Conversation Flow to Retell LLM format
voicetest export --agent retell-cf-agent.json --format retell-llm > retell-llm-agent.json

# Convert VAPI assistant to Retell LLM format
voicetest export --agent vapi-assistant.json --format retell-llm > retell-agent.json

# Convert Retell LLM to VAPI format
voicetest export --agent retell-llm-agent.json --format vapi-assistant > vapi-agent.json

Platform integration¶

voicetest connects directly to voice platforms to import and push agent configurations.

Platform	Import	Push	Sync	API Key Env Var
Retell	✓	✓	✓	`RETELL_API_KEY`
VAPI	✓	✓	✓	`VAPI_API_KEY`
Bland	✓	✓		`BLAND_API_KEY`
Telnyx	✓	✓	✓	`TELNYX_API_KEY`
LiveKit	✓	✓	✓	`LIVEKIT_API_KEY` + `LIVEKIT_API_SECRET`

In the Web UI, go to the "Platforms" tab to configure credentials, browse remote agents, import, push, and sync.

Prompt snippets¶

Agent prompts often repeat text across nodes — sign-off phrases, compliance disclaimers, tone instructions, etc. Snippets are named, reusable text blocks defined at the agent level and referenced in prompts via {%snippet_name%}.

Snippet Name	Text
`sign_off`	"Thank you for calling. Is there anything else I can help with?"
`hipaa_warn`	"I need to verify your identity before sharing medical info."

Use {%name%} (with percent signs) in any node prompt or general prompt:

Welcome the caller and introduce yourself.

{%hipaa_warn%}

When the conversation ends:
{%sign_off%}

Snippets are expanded before dynamic variables ({{var}}), so you can combine both:

Hello {{caller_name}}, {%greeting%}

Click Analyze DRY to scan all prompts for repeated or near-identical text. Exact matches can be auto-extracted into snippets; fuzzy matches (above 80%) are flagged for review.

DRY Analysis Demo (light)

When exporting, choose Raw (.vt.json) to preserve {%snippet%} references, or Expanded to resolve them to plain text for platform deployment.

Global metrics¶

Global metrics are compliance-style checks that run on every test for an agent. Configure them in the "Metrics" tab in the Web UI.

Each agent has:

Pass threshold: Default score (0-1) required for metrics to pass (default: 0.7)
Global metrics: List of criteria evaluated on every test run

Each global metric has:

Name: Display name (e.g., "HIPAA Compliance")
Criteria: What the LLM judge evaluates (e.g., "Agent must verify patient identity before sharing medical information")
Threshold override: Optional per-metric threshold (uses agent default if not set)
Enabled: Toggle to skip without deleting

Example use cases:

HIPAA compliance checks for healthcare agents
PCI-DSS validation for payment processing
Brand voice consistency across all conversations
Safety guardrails and content policy adherence

Diagnosis & auto-fix¶

When a test fails, voicetest can diagnose the root cause and suggest concrete prompt changes to fix it. Available from the CLI and the Web UI.

CLI:

# One-shot: print fault location and proposed prompt change
voicetest diagnose --agent agent.json --tests tests.json --test "Schedules an appointment"

# Auto-fix loop: propose, apply, re-run until pass or iteration cap
voicetest diagnose --agent agent.json --tests tests.json --all \
  --auto-fix --max-iterations 5 --save fixed_agent.json

--save writes the fixed graph; the original agent.json is untouched.

Web UI:

Diagnose — Click "Diagnose" on a failed result. The LLM analyzes the graph, transcript, and failed metrics to identify fault locations and root cause.
Review & Edit — Proposed changes are shown as editable textareas. Modify the suggested text before applying.
Apply & Test — Click "Apply & Test" to apply changes to a copy of the graph and rerun the test. A score comparison table shows original vs. new scores with deltas.
Iterate — If not all metrics pass, click "Try Again" to revise the fix based on the latest results.
Save — Click "Save Changes" to persist the fix to the agent graph.

Auto-Fix Mode in the UI runs the same loop without prompting between iterations. Configure stop condition ("On improvement" or "When all pass") and max iterations (1–10, default 3).

For a full walkthrough including what diagnose is good and bad at, see the Diagnose a failing test recipe.

Audio evaluation¶

Text-only evaluation has a blind spot: when an agent produces "415-555-1234", an LLM judge sees correct digits and passes. But TTS might speak it as "four hundred fifteen, five hundred fifty-five..." — which a caller can't use. Audio evaluation catches these issues by round-tripping agent messages through TTS/STT and judging what would actually be heard.

Conversation runs normally (text-only)
    ↓
Judges evaluate raw text → metric_results
    ↓
Agent messages → TTS → audio → STT → "heard" text
    ↓
Judges evaluate heard text → audio_metric_results

Both sets of results are stored. The original message text is preserved alongside what was heard, with a word-level diff shown in the UI.

Enable audio_eval in settings or toggle "Audio evaluation" in the Web UI. On-demand: click "Run audio eval" on any completed result.

Audio evaluation requires the TTS and STT services from voicetest up:

Service	URL	Description
`whisper`	http://localhost:8001	Faster Whisper STT
`kokoro`	http://localhost:8002	Kokoro TTS

Agent decomposition¶

Split a large agent into smaller, focused sub-agents:

voicetest decompose -a agent.json -o output/ [--num-agents N] [--model ID]

Three-phase LLM pipeline: analyze the graph → refine the decomposition plan → build sub-agent JSON files. Produces one .json file per sub-agent plus a manifest.json with handoff rules and sub-agent registry.

Options:

--num-agents N — Target number of sub-agents (default: let the LLM decide)
--model ID — LLM model override (defaults to judge model from settings)

LLM response cache¶

DSPy LLM responses are cached to avoid redundant API calls. Default backend is local disk.

For shared caching across CI runners or team members, use the S3 backend:

# .voicetest/settings.toml
[cache]
cache_backend = "s3"
s3_bucket = "my-bucket"
s3_prefix = "dspy-cache/"
s3_region = "us-east-1"

Disable caching for a run with no_cache = true in run options or --no-cache on the CLI.

Transcript import & replay¶

Voicetest can ingest real production call transcripts as Runs, alongside the simulated runs the harness generates. Imported transcripts share the same storage and UI surfaces as simulated runs, and can be replayed against the agent's current graph to detect behavioral drift.

Operation	What it does	UI	CLI	REST
Import calls	Parse a platform-specific transcript dump and persist as a Run with `status="imported"` Results	"Import Calls…" button on the agent page	`voicetest import-call --agent <id> --transcript file.json`	`POST /api/agents/{id}/import-call` (multipart)
Replay	Drive a fresh conversation against the agent's current graph using a source Run's user turns as a script	"Replay" button on the run detail page	`voicetest replay <run-id>`	`POST /api/runs/{id}/replay`

Supported formats¶

Retell — accepts the call object as returned by GET /v2/get-call/{call_id}, the post-call webhook envelope ({"event": ..., "call": {...}}), or arrays of either:

{
  "call_id": "call_abc123",
  "transcript_object": [
    {"role": "agent", "content": "Hi, how can I help?"},
    {"role": "user", "content": "I need to cancel my order."}
  ],
  "duration_ms": 60000,
  "start_timestamp": 1700000000000,
  "end_timestamp": 1700000060000
}

The adapter maps Retell's role: "agent" → role: "assistant" (voicetest convention) and ignores word-level timing details.

Other platforms (VAPI, LiveKit, Telnyx, Bland) are not yet supported — --format is parameterized so adapters can be added without breaking changes.

Data model¶

Imported run — a Run whose Results have status="imported", test_case_id=null, call_id=null. Each Result holds one call's transcript.
Replay run — a Run produced by replaying a source Run. Results have status="pass" (replay results are passive captures of live behavior; judging happens later when metrics are configured).

Both kinds render in the existing runs UI alongside simulated runs. The runs list shows an "imported" badge for runs whose Results are all imported.

Replay semantics¶

ScriptedUserSimulator yields the source's recorded user turns in order. The live agent's responses replace the recorded ones; the source's agent turns are not used. If the live agent diverges from the recorded conversation, the next recorded user turn may not fit perfectly — the replay continues anyway, since the conversation as a whole still produces a transcript you can judge.

Replay is best-effort: there is no LLM-based divergence handling in v1.

Limitations¶

Single-platform support (Retell only).
No PII redaction at import time — clients with sensitive data should redact before ingesting.
No diff view between source and replay yet; they're separate runs in the UI.
No batch import via UI — large dumps are easier via CLI.

For the workflow walkthrough, see the Import call history recipe.

Web UI¶

voicetest serve starts a local server at http://localhost:8000 with visual surfaces for every feature on this page — graph visualization, test management, streaming transcripts, run history, side-by-side run comparison, diagnosis, audio evaluation, and settings.

The REST API lives at http://localhost:8000/api. Full API documentation: voicetest.dev/api.

Data is persisted to .voicetest/data.duckdb (override with VOICETEST_DB_PATH).