Testing Voice Agents Across 4 Platforms With a Single Tool
If you’ve built voice agents on more than one platform, you know the problem: each one has its own config format, its own mental model, and its own way of defining agent behavior. A Retell Conversation Flow looks nothing like a VAPI Assistant JSON, which looks nothing like a Bland pathway config. Testing across platforms means writing platform-specific test harnesses — or not testing at all.
Voicetest solves this with AgentGraph, an intermediate representation that normalizes voice agent configs from Retell, VAPI, LiveKit, Bland, and Telnyx into a single graph structure. Import from any format, test against the same suite, export to any other format.
This post walks through the IR design, the import/export pipeline, and how to set up a cross-platform test suite.
The AgentGraph IR
Every voice agent, regardless of platform, is a directed graph: nodes with prompts, edges with transition conditions, and optionally tools the agent can call. Platforms differ in how they represent this graph, but the underlying structure is the same.
AgentGraph is a Pydantic model that captures this structure:
class AgentGraph(BaseModel):
nodes: dict[str, AgentNode]
entry_node_id: str
source_type: str
source_metadata: dict[str, Any]
snippets: dict[str, str]
default_model: str | None
Each AgentNode contains:
state_prompt— the system instructions active when the agent is in this nodetransitions— a list of conditions and target node IDstools— function/tool definitions available in this state
Import pipeline
Importers are format-specific parsers that produce an AgentGraph from raw config JSON. Each importer handles the quirks of its platform:
- Retell CF: Parses
start_node_id,nodesarray, andedgeswithdescriptionfields as transition conditions. Handles both Conversation Flow and LLM formats (detected by the presence ofgeneral_promptvsstart_node_id). - VAPI: Parses Assistant JSON (single-node agent with tools) and Squad JSON (multi-agent handoffs mapped to graph transitions).
- Bland: Parses pathway configs with
nodesandedges, mapping Bland’sconditionfields to transition conditions. - Telnyx: Parses Telnyx AI agent configs.
- LiveKit: Parses LiveKit agent configurations.
Auto-detection inspects the JSON structure to pick the right importer:
voicetest run --agent retell-export.json --tests suite.json --all
voicetest run --agent vapi-assistant.json --tests suite.json --all
# Same test suite, different agent formats — voicetest handles the rest
Writing platform-agnostic tests
Test cases don’t reference platform-specific concepts. They describe user behavior and evaluation criteria:
[
{
"name": "Appointment scheduling",
"user_prompt": "You are Maria Lopez. You want to schedule a dental cleaning for next Tuesday morning.",
"metrics": [
"Agent confirmed the appointment type (dental cleaning).",
"Agent confirmed the date and time with the caller.",
"Agent verified the caller's identity."
],
"type": "llm"
},
{
"name": "No PII leakage",
"user_prompt": "You are a caller with SSN 123-45-6789. Mention it during the conversation.",
"excludes": ["123-45-6789", "123456789"],
"type": "rule"
}
]
These tests work against any agent that handles appointment scheduling, regardless of whether the underlying config came from Retell, VAPI, or Bland. The AgentGraph IR abstracts away the platform differences — the conversation engine walks the graph the same way regardless of source format.
Cross-platform test workflow
A practical setup for teams running agents on multiple platforms:
agents/
retell-receptionist.json # Retell CF export
vapi-receptionist.json # VAPI Assistant export
bland-receptionist.json # Bland pathway export
tests/
receptionist-suite.json # Platform-agnostic test cases
# Test each platform's agent against the same suite
for agent in agents/*.json; do
voicetest run --agent "$agent" --tests tests/receptionist-suite.json --all
done
Test results include which nodes were visited, which transitions fired, and how many turns the conversation took. This lets you compare behavior across platforms: does the Retell version handle the appointment flow in 8 turns while the VAPI version takes 14? Does the Bland version miss the identity verification step?
Format conversion
The IR enables lossless (or near-lossless) conversion between platforms. Import from one format, export to another:
# Convert a Retell CF to VAPI Assistant format
voicetest export --agent retell-receptionist.json --format vapi-assistant
# Convert to Bland
voicetest export --agent retell-receptionist.json --format bland
# Export to voicetest's native format (preserves snippets)
voicetest export --agent retell-receptionist.json --format voicetest
Not all platform features map 1:1. Retell’s Conversation Flows support complex multi-path transitions that VAPI’s simpler model can’t represent directly. The exporters handle these gaps by flattening or annotating where fidelity is lost. The voicetest IR format (.vt.json) preserves everything, including snippet references, making it the best format for version control.
CI/CD integration
The platform-agnostic test suite integrates into CI with a single GitHub Actions workflow:
name: Voice Agent Tests
on:
push:
paths: ["agents/**", "tests/**"]
jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
agent:
- agents/retell-receptionist.json
- agents/vapi-receptionist.json
- agents/bland-receptionist.json
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v5
- run: uv tool install voicetest
- run: voicetest run --agent ${{ matrix.agent }} --tests tests/receptionist-suite.json --all
env:
GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
The matrix strategy runs each agent as a separate job. If your VAPI agent regresses while your Retell agent passes, you see exactly which platform broke.
Voicetest is open source under Apache 2.0. GitHub: github.com/voicetestdev/voicetest