Voice Agent Hardening Test Plan
Preamble — Why This Document Exists
Day 3 of the verticals-first platform (2026-04-29) surfaced eight P0 regressions, all from a single PR (#251), in a single founder-supervised 7-hour verification session. Every bug had been present at merge time. Every bug was undetectable by the tests that existed at merge time, because every test stubbed the layer where the bug lived. The structural diagnosis:
PR #251 surface area: browser → mic → LiveKit → agent → STT → LLM → tool → SIP API → carrier → callee → bridge
Tests in PR #251: [stub] [stub] [stub] [stub] [stub] [stub] [stub] [stub] [stub] [n/a] [n/a]
Bugs caught by tests: 0
Bugs found in live test: 8
This document is the founder's, the next agent's, and every contributor's map of "is the voice agent actually robust?" It must be read before touching voice code, consulted before opening a PR that touches voice layers, and updated whenever a new failure mode is discovered.
Memory files (mandatory background reading before any voice work):
memory/feedback_round_trip_test_before_merge.md— the 8-P0 post-mortem; the core argument for round-trip Playwright gatingmemory/feedback_telnyx_outbound_three_requirements.md— the three Telnyx provisioning requirements + PATCH gotchamemory/feedback_robustness_over_velocity.md— founder priority: conversion-quality demos justify extra daysmemory/feedback_livekit_recovery_lk_deploy_only.md—lk agent restartis insufficient; onlylk agent deployrecoversmemory/feedback_lk_overwrite_flag_destroys_secrets.md—--overwritenukes all 22 production secrets
§1 — The 11 Layers (with definitions and concrete examples)
The voice call stack has eleven distinct layers. A feature spanning multiple layers must have non-stubbing test coverage at every layer it touches. "Stubbing a layer" means the test replaces the real system at that layer with a mock or no-op, rendering failures at that layer invisible.
Layer 1 — Browser DOM + getUserMedia
What lives here: The browser page, the JavaScript/React component, the getUserMedia({ audio: true }) call that requests microphone permission, and the <audio> DOM element that receives incoming audio tracks. This is the user's entire experience until the voice call is established.
Concrete file: src/components/cold-outreach/DirectorTransferDemo.tsx — the handleStartCall() function that runs navigator.mediaDevices.getUserMedia({ audio: true }) before connecting to LiveKit.
What stubbing looks like: A test that mounts the component without actually loading it in a Chromium browser — e.g., a React unit test with JSDOM that patches navigator.mediaDevices. JSDOM's getUserMedia returns a resolved promise instantly, bypassing real permission timing and real audio track events.
Layer 2 — Mic input → WebRTC track publish
What lives here: The WebRTC MediaStreamTrack obtained from getUserMedia, the LiveKit SDK call to room.localParticipant.setMicrophoneEnabled(true) or publishTrack(), and the timing between permission grant and agent dispatch. This layer is where "user's mic input actually reaches the agent" is determined.
Concrete file: src/components/cold-outreach/DirectorTransferDemo.tsx lines around setMicrophoneEnabled(true) — the fix for P0 #4 moved this call before room.connect(), ensuring mic permission is granted before the agent is dispatched.
What stubbing looks like: A test that calls room.connect() with a mock Room object whose setMicrophoneEnabled is a no-op. The mock always "succeeds" in zero milliseconds; the real bug (permission prompt fires AFTER agent dispatched, causing a mic-publish race) is invisible.
Layer 3 — LiveKit Cloud room + signaling
What lives here: The LiveKit Cloud room as a service: room creation, WebRTC signaling, participant join/leave events, track subscription events (RoomEvent.TrackSubscribed), and the room's TURN relay infrastructure. This is the real-time switching fabric.
Concrete file: src/components/cold-outreach/DirectorTransferDemo.tsx — the room.on(RoomEvent.TrackSubscribed, (track, _, participant) => { ... }) handler added in commit fe3f07a7 to attach remote audio to a DOM <audio> element.
What stubbing looks like: A test that instantiates a mock Room and fires synthetic RoomEvent.TrackSubscribed events on a timer. The real bug (handler was entirely missing — no DOM <audio> element ever appeared, so callers heard silence from the AI) is invisible because the mock fires the event regardless.
Layer 4 — Agent runtime / dispatcher (Python)
What lives here: The LiveKit Python agent process (main.py), the @server.rtc_session handler, the AgentSession, JobContext, agent class instantiation, and the LiveKit named-dispatch mechanism (agent_name="churchwiseai-voice"). Also the dispatch rule (SDR_cYzx7sAkUTvx, SDR_Wpyno7GDNQqg) that routes calls to this agent.
Concrete file: voice-agent-livekit/main.py — the entrypoint; voice-agent-livekit/session.py — the session lifecycle and resolve_route() function.
What stubbing looks like: A test that imports agent classes and calls methods on them directly without ever spinning up the LiveKit agent runtime. Fine for testing Python logic; invisible to runtime-level failures like livekit/agents#3104 (named-dispatch hang where lk agent list shows "Available" but no worker is registered).
Layer 5 — STT (Deepgram via LiveKit plugin)
What lives here: The Deepgram Nova-3 real-time speech-to-text transcription, the LiveKit Deepgram plugin configuration, keyterms boost (tradition-specific theological terminology), and the TranscriptionSegment objects that arrive as conversation turns.
Concrete file: voice-agent-livekit/main.py — deepgram.STT(model="nova-3", ...) instantiation with keyterms= list.
What stubbing looks like: A test that creates a mock STT output with pre-written TranscriptionSegment objects. The real mic-to-transcript pipeline (audio codec → Deepgram → transcript) never runs; acoustic failures and keyterms boost effects are invisible.
Layer 6 — LLM (Anthropic / Gemini / Groq disabled)
What lives here: The LLM API call (Claude Haiku 4.5 primary, Gemini 2.5 Flash fallback), the tool schema construction from @function_tool-annotated Python methods, the parse_function_tools() call that uses typing.get_type_hints() to build JSON schemas, and the ModelSettings with caching and system prompt injection.
Concrete file: voice-agent-livekit/verticals/church/agents.py — every @function_tool-decorated method; voice-agent-livekit/safety.py — flag_safety_event(context: RunContext, ...).
What stubbing looks like: A test that mocks the llm.LLM object and returns hard-coded llm.ChatChunk objects. The real parse_function_tools() execution — which crashed with KeyError on Day 3 P0 #1 because flag_safety_event lacked a type annotation — never runs. The AST-based test_function_tool_schemas.py is specifically designed to catch this class of bug without mocking.
Layer 7 — Tool call registration + invocation
What lives here: The registration of @function_tool methods onto agent class instances, the LiveKit framework's dispatch of LLM tool call requests to the correct method, and the agent's routing of tool calls across agent handoff boundaries (e.g., CoordinatorAgent handing a call to CareAgent).
Concrete file: voice-agent-livekit/verticals/church/agents.py — CoordinatorAgent class (line 447+), CareAgent class (line 140+); specifically, transfer_to_director is defined on CareAgent (line 354) and was added to CoordinatorAgent in commit 067c7c8f to fix P0 #5.
What stubbing looks like: A test that calls agent.transfer_to_director(...) directly on a specific class. If only CareAgent.transfer_to_director is tested, the bug (method missing from CoordinatorAgent, funeral path uses CoordinatorAgent, LLM hallucinated an alternate name and got "unknown AI function") is invisible.
Layer 8 — SIP outbound API (CreateSIPParticipant, TransferSIPParticipant)
What lives here: The LiveKit Python SDK calls lk_api.CreateSIPParticipantRequest(...) and lk_api.TransferSIPParticipantRequest(...) in core/transfer.py, and the field shape requirements enforced by LiveKit server-side validation (livekit/protocol/livekit/sip.go). This is where "dial the director's phone via the outbound SIP trunk" actually happens.
Concrete file: voice-agent-livekit/core/transfer.py — execute_attended_transfer() function, lines ~460-595.
What stubbing looks like: A test that patches lk_api.CreateSIPParticipantRequest to accept any kwargs. The real validation rule — sip_call_to must be a bare phone number or SIP user (not a full sip:user@domain URI); transfer_to must have a URI scheme prefix (tel:+E164) — never runs. P0 #6 (TwirpError: SipCallTo should be a phone number or SIP user, not a full SIP URI) is invisible.
Layer 9 — Carrier (Telnyx / Twilio)
What lives here: The carrier-side state for every outbound SIP trunk: Telnyx credential connection authentication, outbound voice profile binding (outbound_voice_profile_id), DID-to-connection binding, and the actual PSTN network reach. This layer is entirely outside the codebase; it lives in the Telnyx dashboard and API.
Concrete file: Not a code file — this is the Telnyx credential connection 2948197312620398250. Verified via GET https://api.telnyx.com/v2/credential_connections/2948197312620398250.
What stubbing looks like: Any test that considers LiveKit's CreateSIPParticipant returning a participant_id as proof that the call will connect. LiveKit returns a participant_id the moment the SIP INVITE is sent; Telnyx's silent 403/D35 rejection (caused by null outbound_voice_profile_id in P0 #7) happens asynchronously and is invisible to the SDK call.
Layer 10 — Callee (PSTN ringer reaching real phone)
What lives here: The real phone that rings when the director is dialed — the founder's cell, a demo echo number, a funeral director's on-call phone. This layer is verified only by a human hearing their phone ring.
Concrete file: N/A — this is physical telephony infrastructure. Test substitute: a Telnyx echo number that auto-answers, says nothing, and hangs up (proves carrier connectivity at <$0.01/test).
What stubbing looks like: Any test that does not actually dial a number and verify it rings. All automated tests below Layer 9 stub this layer.
Layer 11 — Audio bridge (REFER vs room-native mixing)
What lives here: The bridge mechanic that connects the two legs (caller + director) after the transfer: SIP REFER (TransferSIPParticipant) for PSTN-caller paths, or LiveKit room-native audio mixing (agent leaves room, browser ↔ SIP-director connected by the room) for WebRTC-caller demo paths. This is where the architectural split between PSTN and browser demos occurs.
Concrete file: voice-agent-livekit/core/transfer.py — execute_attended_transfer() bridge step (lines ~580-600); voice-agent-livekit/verticals/church/agents.py — transfer_to_director() on both CoordinatorAgent and CareAgent.
What stubbing looks like: A test that asserts TransferSIPParticipant was called without checking the caller leg's transport type. P0 #8 — "no SIP session associated with participant" when TransferSIPParticipant is called for a WebRTC browser caller — is invisible because the mock accepts the call regardless.
§2 — Failure-Mode Catalog
The following table maps every confirmed production failure to its layer. "Static test that catches it now" means a test in the current main branch (or in the worktrees carrying Day 3 fixes). "Integration test that catches it now" means a real round-trip test, not a stub.
| # | Bug | Layer | Symptom in production | Static test catches it now | Integration test catches it now |
|---|---|---|---|---|---|
| P0-1 | safety.py flag_safety_event(context) missing type annotation → KeyError in parse_function_tools → both Anthropic + Google reject all LLM turns → dead air | L6 — LLM schema build | Agent greets caller; first user turn → 57s silence → caller hangs up | test_function_tool_schemas.py (AST-walks all @function_tool methods) — in worktree, not yet on main ⚠️ | ⚠️ (none yet) |
| P0-2 | _run_call referenced demo_director_phone_override outside scope → NameError on funeral-prospect path | L4 — Agent runtime | Funeral prospect path throws NameError immediately; agent errors out | voice-tool-schemas.yml workflow (ruff F821 catches undefined name usage) — in worktree, not yet on main ⚠️ | ⚠️ (none yet) |
| P0-3 | DirectorTransferDemo.tsx missing RoomEvent.TrackSubscribed handler → AI's TTS audio never reached browser DOM | L3 — LiveKit room events | Prospect clicks "Try live director"; browser call starts but they hear nothing from the AI | ⚠️ (none yet) | ⚠️ (none yet) — requires Playwright round-trip |
| P0-4 | setMicrophoneEnabled(true) ran AFTER room.connect() → mic-publish race → agent dispatched before caller's audio tracked | L2 — Mic publish timing | Caller's voice never reaches agent; AI hears silence, cannot respond to what caller says | ⚠️ (none yet) | ⚠️ (none yet) — requires Playwright round-trip |
| P0-5 | transfer_to_director on CareAgent only; funeral-prospect path uses CoordinatorAgent → LLM hallucinated tool name | L7 — Tool registration scope | LLM logs "unknown AI function initiate_transfer"; transfer never fires | ⚠️ (none yet) — requires per-agent-class tool inventory check | ⚠️ (none yet) — requires Playwright round-trip |
| P0-6 | sip_call_to=f"sip:{n}@{domain}" (full SIP URI) → Telnyx rejects with TwirpError: SipCallTo should be phone number not full SIP URI | L8 — SIP API field shape | Director's phone never rings; LiveKit logs TwirpError synchronously | test_transfer_sip_payload_shape.py (6 assertions on field format) — in worktree, not yet on main ⚠️ | ⚠️ (none yet) |
| P0-7 | Telnyx credential connection outbound_voice_profile_id: null → carrier silently 403/D35-rejects all outbound INVITEs | L9 — Carrier config | CreateSIPParticipant returns participant_id; director phone never rings; no MDR record | ⚠️ (none yet — requires voice-health cron extension to check Telnyx API) | ⚠️ (none yet — requires daily outbound dial cron to echo number) |
| P0-8 | TransferSIPParticipant (SIP REFER) fails for WebRTC browser caller with "no SIP session associated with participant" | L11 — Bridge mechanic | Transfer initiated; immediate error; caller and director never connect | ⚠️ (none yet — test_transfer_sip_payload_shape.py checks field shape but not caller-type branch) | ⚠️ (none yet — requires Playwright with caller-type assertion) |
| Near-miss-A | lk agent update-secrets --overwrite would have nuked all 22 production secrets | L4 — Agent runtime | All 4 customer phone lines dead; no API keys; full outage | ⚠️ (none — CLI flag; caught by interactive prompt before Enter) | ⚠️ (none — operational hazard, not code bug) |
| Near-miss-B | livekit/agents#3104 named-dispatch hang — lk agent list shows "Available" but no worker registered | L4 — Agent runtime | Calls ring indefinitely; agent never answers; silent to LiveKit-side callers | test_load_church_data_integration.py (catches DB path failures but not agent-runtime hang) | ⚠️ (none yet — requires post-deploy health assertion) |
| Near-miss-C | Cartesia voice voice_id silent default to "Katie" when ID not found | L5 (TTS config) | Customer hears wrong voice; tenant isolation broken | ⚠️ (none yet — no voice_id format or presence validation test) | ⚠️ (none yet) |
| Near-miss-D | classify_call Gemini-only single-point-of-failure | L6 — LLM fallback | If Gemini down, classification silently fails; no fallback chain | ⚠️ (none yet — LLM fallback chain not tested) | ⚠️ (none yet) |
| Prior-1 | M2 migration dropped FK constraints → PostgREST join syntax in _fetch_voice_agent_row returned 400 → all dedicated-trunk demos routed to Sales Agent (~24h) | L4 — DB path in agent runtime | Every church number routes to sales agent; churches get generic sales pitch | test_routing.py (unit, mocked) — insufficient alone | test_load_church_data_integration.py (LIVE Supabase query against all demo + paying-customer UUIDs — catches schema regressions) — on main |
| Prior-2 | OUTBOUND_TRUNK_ID env var not asserted at startup → empty string passed to CreateSIPParticipantRequest → silent dead call | L8 — SIP API config | Transfer attempted; LiveKit returns not_found; director never called | test_transfer_env.py (asserts RuntimeError on empty trunk ID in production) — on main | ⚠️ (none yet) |
§3 — Existing Test Surface (current state, 2026-04-30)
Tests are organized by layer. "On main" means the test is committed to the main branch (feat/verticals-platform-day1-foundation or main). "In worktree" means the test exists in a worktree branch that has not yet been merged to main.
Layer 1 — Browser DOM + getUserMedia
- ⚠️ No tests.
DirectorTransferDemo.tsxhas no unit tests. The component's DOM behavior (audio element creation, getUserMedia timing) is only verifiable via Playwright round-trip.
Layer 2 — Mic input → WebRTC track publish
- ⚠️ No tests. Mic-publish timing (the P0-4 fix) has no automated regression guard.
Layer 3 — LiveKit Cloud room + signaling
- ⚠️ No tests.
RoomEvent.TrackSubscribedhandler presence (the P0-3 fix) has no automated regression guard. Only verifiable via Playwright.
Layer 4 — Agent runtime / dispatcher (Python)
voice-agent-livekit/tests/test_routing.py— on main. Unit tests forresolve_route()covering everyPHONE_REGISTRYentry. Mocked Supabase. Regression guard for the P0 routing failure.voice-agent-livekit/tests/test_load_church_data_integration.py— on main. LIVE Supabase integration test; queries real production DB for every church_id inPHONE_REGISTRY; assertsload_church_datareturns valid dict. Catches schema regressions (FK drops, RLS changes, column renames). Runs invoice-routing-integration-on-pr.ymlCI.voice-agent-livekit/tests/test_calls_limit.py— on main. Unit tests forCALLS_LIMIT_BY_PLAN, NULL-fallback path, andat_capacityflag..github/workflows/voice-routing-integration-on-pr.yml— on main. Triggerstest_routing.py+test_load_church_data_integration.py+test_calls_limit.pyon PRs touching voice-agent-livekit Python code.
Layer 5 — STT (Deepgram)
voice-agent-livekit/tests/test_audio_cache.py— on main. Testscore/audio_cache.py(audio cache lookup/miss, bridge phrases, thinking phrases, voice_name_for_id). Indirectly touches TTS wiring but not STT.voice-agent-livekit/tests/test_audio_bridge.py— on main. Testscore/audio_bridge.py(EmotionDetector, BridgePlayer). Tests the bridge player that uses cached audio, not live Deepgram.- ⚠️ No STT live-transcription tests. Keyterms boost, nova-3 model selection, and the real STT pipeline are not tested.
Layer 6 — LLM tool schema
voice-agent-livekit/tests/test_function_tool_schemas.py— in worktreeagent-a2595426576a83769, not yet on main. AST-based contract test that walks every Python file in the voice agent package, finds all@function_tool-decorated methods, and asserts every parameter has a type annotation. Runs in <1s with no API keys. This is the test that would have caught P0-1 at PR time..github/workflows/voice-tool-schemas.yml— in worktree, not yet on main. CI workflow that runsruff check --select F821 --target-version py312(catches undefined names, P0-2) plustest_function_tool_schemas.py.
Layer 7 — Tool call registration + invocation
voice-agent-livekit/tests/test_church_info.py— on main. Testschurch_info.pyfallback formatters (used when PCO not configured). Does not test@function_toolregistration or multi-agent routing.voice-agent-livekit/tests/test_escalation_routing.py— on main. 102-message contract test for the two-track escalation (Track A operational vs Track B safety/crisis). Uses local regexes mirroringmoderation.py. LIFE-SAFETY tagged; mandatory before merging changes to escalation paths.- ⚠️ No tool-registration inventory test across both
CoordinatorAgentandCareAgent. P0-5 (tool on wrong agent class) has no regression guard at this layer beyondtest_function_tool_schemas.py(which only checks annotations, not which class has which method).
Layer 8 — SIP outbound API
voice-agent-livekit/tests/test_transfer_sip_payload_shape.py— in worktreeagent-a2595426576a83769, not yet on main. Six assertions onCreateSIPParticipantRequestandTransferSIPParticipantRequestfield shape, includingsip_call_tomust be bare phone/user (no@),transfer_tomust havetel:orsip:prefix, and SDK field-name drift detection. Catches P0-6 at PR time.voice-agent-livekit/tests/test_transfer_crisis_gate.py— on main. LIFE-SAFETY hard gate: assertsexecute_attended_transfer()returnsreason='crisis_gate', success=Falsefor every crisis/DV/threat phrase. Regression guard for the hard-coded crisis block incore/transfer.py.voice-agent-livekit/tests/test_transfer_env.py— on main. Asserts_resolve_outbound_trunk_id()raisesRuntimeErrorin production whenOUTBOUND_TRUNK_IDis empty. Catches silent dead-call from P0-2 (original code warned and proceeded).voice-agent-livekit/tests/test_moderation.py— on main. Unit tests formoderation.pycrisis/threat/abuse regex patterns. Verifies all crisis phrases are caught; verifies false-positive exclusions.
Layer 9 — Carrier (Telnyx / Twilio)
src/app/api/cron/voice-health/route.ts— on main. Runs every 15 minutes. Checks LiveKit-side state: inbound trunkST_Xa3Bp9aixRFPpresence and its four phone numbers, dispatch rule IDs andagent_name, outbound trunkST_X3n9jxR55VrBpresence. ReportsHealthIssueobjects withcriticalorwarningseverity. Fires P0 alerts viareportError(). Gap: does NOT check Telnyx-side carrier state — specifically, does not verifyoutbound_voice_profile_idis set on credential connection2948197312620398250. P0-7 would have surfaced here if this check existed.- ⚠️ No daily outbound-dial certification. There is no automated test that actually dials a Telnyx echo number end-to-end to prove the carrier path works.
Layer 10 — Callee (PSTN ringer)
- ⚠️ No automated tests. This layer is only testable with real telephony. Current approach: manual founder-supervised verification sessions.
Layer 11 — Audio bridge
voice-agent-livekit/tests/test_transfer_sip_payload_shape.py— in worktree. Asserts field shapes onTransferSIPParticipantRequest. Does not assert whetherTransferSIPParticipantshould be called at all based on caller leg transport type.- ⚠️ No WebRTC-caller-branch test. The architectural fix for P0-8 (detect
ParticipantKind.STANDARDfor WebRTC callers and skipTransferSIPParticipant) has no regression guard.
Cross-layer behavioral tests
voice-agent-livekit/tests/behavioral/— on main. Behavioral test suite covering church and funeral verticals. Uses LLM-as-judge (Haiku) against scripted scenarios..github/workflows/voice-behavioral-nightly-church.yml,voice-behavioral-funeral.yml,voice-behavioral-critical-on-pr.yml— on main. Nightly and on-PR behavioral runs..github/workflows/voice-clients-drift.yml— on main. Voice client YAML drift detection.
§4 — Gap Closure Roadmap
Prioritized by founder-quality framing. P0 = blocks cold-email GO/NO-GO. P1 = blocks production confidence. P2 = important but not blocking.
G1 — Round-trip Playwright spec cold-outreach-director-transfer.spec.ts
- Gaps closed: P0-3 (audio element), P0-4 (mic timing), P0-5 (wrong agent class), P0-8 (WebRTC bridge branch), near-miss-B (post-deploy health)
- Type: Playwright e2e against deployed Vercel preview URL (NOT localhost)
- File:
churchwiseai-web/e2e/cold-outreach-director-transfer.spec.ts - CI workflow:
.github/workflows/cold-outreach-director-transfer.yml - Trigger: PRs touching
src/components/cold-outreach/**,src/app/api/livekit/token/**,voice-agent-livekit/core/transfer.py,voice-agent-livekit/verticals/*/agents.py - Key assertions: (a) audio element appears in DOM after TrackSubscribed; (b)
voice_call_logs.transcriptcontains bothrole='assistant'androle='user'within 60s; (c) for WebRTC-caller path,TransferSIPParticipantis NOT called; (d) SIP participant joins room; (e) AI agent audio muted/left after bridge intro - Effort: L
- Dependency: P0-8 architectural fix (Day 4 §4.1) must land first; requires Telnyx echo number env var
PLAYWRIGHT_ECHO_NUMBER - Priority: P0
- In flight: Lane B (Day 4) — spec skeleton described in
07-DAY4-HANDOFF.md §4.2
G2 — Merge worktree tests to main: test_function_tool_schemas.py + voice-tool-schemas.yml
- Gaps closed: P0-1 (
@function_toolannotation completeness), P0-2 (ruff F821 undefined names) - Type: Static contract (AST-based, no API keys required)
- File:
voice-agent-livekit/tests/test_function_tool_schemas.py,.github/workflows/voice-tool-schemas.yml - Effort: S (tests exist in worktree
agent-a2595426576a83769; merge to foundation branch) - Dependency: None — self-contained
- Priority: P0
- In flight: Exists in worktree, pending merge to
feat/verticals-platform-day1-foundation
G3 — Merge worktree test to main: test_transfer_sip_payload_shape.py
- Gaps closed: P0-6 (SIP URI field shape), SDK field-name drift
- Type: Static contract (Python, mocked LiveKit SDK)
- File:
voice-agent-livekit/tests/test_transfer_sip_payload_shape.py - Effort: S (test exists in worktree
agent-a2595426576a83769; merge to foundation branch) - Dependency: None
- Priority: P0
- In flight: Exists in worktree, pending merge
G4 — Voice-health cron Telnyx carrier config extension
- Gaps closed: P0-7 (
outbound_voice_profile_idnull) - Type: Synthetic cron (HTTP to Telnyx API)
- File:
src/app/api/cron/voice-health/route.ts— extend existing cron - Key assertion:
GET /v2/credential_connections/2948197312620398250→.data.outbound.outbound_voice_profile_idmust not be null AND phone number+12268830526(or equivalent) must haveconnection_id == 2948197312620398250 - Effort: M
- Dependency:
TELNYX_API_KEYenv var in Vercel production (already set per runbooks) - Priority: P1
- In flight: Day 4 open follow-up
07-DAY4-HANDOFF.md §7
G5 — Daily outbound-trunk dial certification cron
- Gaps closed: P0-7 (carrier-side silent rejection), Near-miss-B (agent registration)
- Type: Synthetic cron (real outbound dial to Telnyx echo number)
- File:
src/app/api/cron/voice-outbound-cert/route.ts(new) - Key assertion: Dial
TELNYX_ECHO_NUMBERvialk sip participant create --trunk ST_X3n9jxR55VrB; assert participant joins LiveKit room within 30s; assert participant disconnects cleanly; total cost <$0.01 per run - Effort: M
- Dependency: Telnyx echo number provisioned;
TELNYX_ECHO_NUMBERenv var in Vercel; keep dials OFF the founder's cell - Priority: P1
- In flight: Day 4 open follow-up
07-DAY4-HANDOFF.md §7
G6 — Voice agent boot smoke (post-deploy health assertion)
- Gaps closed: Near-miss-B (livekit/agents#3104 silent registration failure)
- Type: Integration check (scripted as post-deploy step)
- File: Add to voice agent deploy runbook
knowledge/runbooks/voice-provisioning.md+knowledge/runbooks/voice-ops/voice-agent-debug.md - Key assertion: After
lk agent deploy, within 90s,lk agent logs --log-type deploycontains "registered worker"; if not present after 90s → escalate; if present → green - Effort: S (already in CLAUDE.md; needs automated script and runbook)
- Dependency: None
- Priority: P1
G7 — WebRTC↔SIP bridge branch test (test_transfer_browser_branch.py)
- Gaps closed: P0-8 (architectural)
- Type: Unit pytest (mocked
ParticipantKind, mocked LiveKit room) - File:
voice-agent-livekit/tests/test_transfer_browser_branch.py - Key assertions: (a) WebRTC caller →
TransferSIPParticipantNOT called; (b) SIP caller →TransferSIPParticipantIS called; (c) crisis gate applies regardless of caller transport type - Effort: M
- Dependency: P0-8 architectural fix (Day 4 §4.1) must land first
- Priority: P0
- In flight: Lane A (Day 4) per
07-DAY4-HANDOFF.md §4.1
G8 — Per-agent tool inventory contract test
- Gaps closed: P0-5 (tool on wrong agent class)
- Type: Static contract (Python reflection)
- File:
voice-agent-livekit/tests/test_agent_tool_inventory.py - Key assertion: Assert that a pre-defined set of tools (including
transfer_to_director) are registered on BOTHCoordinatorAgentANDCareAgent. Extend toFuneralCoordinatorAgentand any future agent class. - Effort: S
- Dependency: None
- Priority: P1
G9 — Crisis pathway end-to-end test
- Gaps closed: Life-safety regression (ensure 988 routing, no transfer, no callback SMS)
- Type: Integration pytest (against LIVE agent via scripted session with mock STT)
- File:
voice-agent-livekit/tests/integration/test_crisis_pathway.py - Key assertions: (a) Caller says "I want to end my life" → agent recites 988; (b)
crisis_eventsrow written with correctsourceandvertical; (c) NOvoice_callback_requestsrow written; (d) NOtransfer_to_directortool call logged; (e) NO SMS tonotification_phone; (f) conversation continues (AI stays on line) - Effort: L
- Dependency: Requires
voice_tool_callsaudit table ORlk agent logspost-hoc parsing; requires mock STT input capability - Priority: P0 (LIFE-SAFETY)
- Note:
test_transfer_crisis_gate.pycovers the Python gate logic (static); this closes the end-to-end gap
G10 — Multi-tenant routing test (all 4 production lines)
- Gaps closed: Per-church config isolation regression
- Type: Integration pytest (LIVE Supabase + mocked agent session)
- File:
voice-agent-livekit/tests/integration/test_multitenant_routing.py - Key assertions: For each of
+18886030316,+14696152221,+13658254095,+14144007103— assertresolve_route()returns the correct(agent_type, church_id)tuple ANDload_church_data(church_id)returns the correctchurch_voice_agentsrow with the expectednotification_phoneandvertical - Effort: M
- Dependency: Relies on
test_load_church_data_integration.pypattern (already on main) — extend to add per-number assertions - Priority: P1
G11 — LLM fallback chain test
- Gaps closed: Near-miss-D (Anthropic-only single point of failure)
- Type: Unit pytest (mocked LLM providers)
- File:
voice-agent-livekit/tests/test_llm_fallback.py - Key assertions: (a) Anthropic disabled → Gemini fires; (b) both timeout → keyword-based fallback fires; (c) no path results in silent dead air
- Effort: M
- Dependency: None — pure Python mocking
- Priority: P1
G12 — Inbound trunk lock test (CI-blocking)
- Gaps closed: Unauthorized edit to
ST_Xa3Bp9aixRFPconfig - Type: CI check (runs on every PR)
- File: Add check to
voice-healthcron OR add newvoice-trunk-lock-check.ymlCI workflow - Key assertion: LiveKit
listSipInboundTrunk()returnsST_Xa3Bp9aixRFPwith exactly the four expected numbers and no auth changes. If any diff fromEXPECTEDinvoice-health/route.ts→ CI fails + founder alert - Effort: S
- Dependency: None — extend existing voice-health cron check logic
- Priority: P1
G13 — Cartesia voice_id format validation
- Gaps closed: Near-miss-C (silent wrong-voice fallback)
- Type: Static contract (Python)
- File:
voice-agent-livekit/tests/test_voice_id_format.py - Key assertions: (a)
voice_idfromchurch_voice_agents.cartesia_voice_idmatches UUID4 pattern[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}; (b) reject ElevenLabs-format IDs (alphanumeric, no hyphens); (c) everyvoice_idinknowledge/references/cartesia-voices/index.jsonis in the Cartesia catalog - Effort: S
- Dependency: None
- Priority: P2
G14 — Self-dial loop detection test
- Gaps closed: P1 from
07-DAY4-HANDOFF.md §7(#9) - Type: Unit pytest
- File:
voice-agent-livekit/tests/test_self_dial_detection.py - Key assertions:
execute_attended_transfer()returnsreason='self_dial', success=Falsewhentarget_numberresolves to any of+18886030316,+14696152221,+13658254095,+14144007103(our own DIDs); legitimate external numbers pass through - Effort: S
- Dependency:
core/transfer.pymust implement self-dial guard first (Day 4 open follow-up) - Priority: P1
G15 — STT keyterms boost test
- Gaps closed: Layer 5 coverage gap
- Type: Manual verification with synthetic audio fixture
- Key assertion: Play audio of "theophany," "transubstantiation," "Wesleyan," etc. → assert transcript contains the term correctly (not a phonetically similar but wrong word)
- Effort: M
- Dependency: Requires Deepgram keyterms API test environment
- Priority: P2
G16 — Multi-agent (Coordinator → Care) handoff regression test
- Gaps closed: Agent handoff boundary failures
- Type: Unit pytest (mocked session transfer)
- File:
voice-agent-livekit/tests/test_agent_handoff.py - Key assertions: (a)
CoordinatorAgentdelegates pastoral topic toCareAgent; (b)CareAgentreceives correct session context; (c) tools registered onCareAgentare accessible after handoff; (d)CoordinatorAgenttools do not persist onCareAgentsession - Effort: M
- Dependency: None
- Priority: P1
G17 — demo_dial_log count integrity test
- Gaps closed: Rate-limiter counting FAILED handshakes (Day 4 open follow-up #7)
- Type: Unit pytest (mocked Supabase)
- File:
voice-agent-livekit/tests/test_demo_rate_limiter.py - Key assertions: (a) dial log row inserted only on participant JOIN (not on token mint); (b) 3 rows per IP per day blocks fourth attempt; (c) failed handshake does NOT increment count
- Effort: M
- Dependency: None
- Priority: P1
§5 — Test Cadence + Ownership
On every PR (gate — blocks merge)
| Test | File | What it gates |
|---|---|---|
voice-tool-schemas.yml (ruff F821 + AST annotation walker) | .github/workflows/voice-tool-schemas.yml | Any PR touching voice-agent-livekit/**/*.py — catches P0-1, P0-2 |
voice-routing-integration-on-pr.yml (routing unit + live Supabase) | .github/workflows/voice-routing-integration-on-pr.yml | Any PR touching session.py, main.py, or verticals/*/integrations/** — catches FK/RLS regressions |
voice-behavioral-critical-on-pr.yml (behavioral critical subset) | .github/workflows/voice-behavioral-critical-on-pr.yml | Any PR touching voice agent Python code — behavioral smoke |
cold-outreach-director-transfer.yml (Playwright round-trip) | .github/workflows/cold-outreach-director-transfer.yml | PRs touching src/components/cold-outreach/**, src/app/api/livekit/token/**, voice-agent-livekit/core/transfer.py, voice-agent-livekit/verticals/*/agents.py ⚠️ not yet created |
crisis-pathway gate (test_transfer_crisis_gate.py) | voice-agent-livekit/tests/test_transfer_crisis_gate.py | PRs touching core/transfer.py, safety.py, moderation.py — LIFE-SAFETY mandatory |
test_escalation_routing.py (102-msg two-track contract) | voice-agent-livekit/tests/test_escalation_routing.py | PRs touching core/escalation.py, safety.py, moderation.py, verticals/*/prompts.py — LIFE-SAFETY |
Proposed: voice-critical-path-gate workflow — mirrors critical-path-gate.yml logic but specific to voice. Gates all voice-related PRs on passing cold-outreach-director-transfer.spec.ts Playwright artifact AND static contract tests (voice-tool-schemas.yml + test_transfer_sip_payload_shape.py). Applies the existing critical-path-override label escape hatch with a logged reason.
On every voice agent deploy (post-deploy smoke — within 90s of lk agent deploy)
lk agent logs --log-type deploy— assert "registered worker" appears within 90s- Manual or scripted call to a demo line — assert agent greets caller (proves dispatch working)
- (Future, G5) automated outbound-dial to Telnyx echo number — assert participant joins room within 30s
- If any check fails: DO NOT declare deploy successful. Re-run
lk agent deploy(livekit/agents#3104 fix pattern). If failure persists after two deploys, escalate to founder. Reference:memory/feedback_livekit_recovery_lk_deploy_only.md.
Daily (crons)
| Cron | File | Cadence | What it checks |
|---|---|---|---|
cron-voice-health | src/app/api/cron/voice-health/route.ts | Every 15 min | LiveKit inbound trunk config, dispatch rules, agent_name |
| Telnyx carrier state extension (G4) | extend voice-health/route.ts | Every 15 min | Telnyx outbound_voice_profile_id bound, DID-to-connection binding ⚠️ not yet implemented |
| Daily outbound-dial cert (G5) | src/app/api/cron/voice-outbound-cert/route.ts | Daily | Real dial to Telnyx echo number, assert room join ⚠️ not yet implemented |
voice-behavioral-nightly-church.yml | .github/workflows/voice-behavioral-nightly-church.yml | Nightly 06:00 UTC | Church vertical behavioral suite (Haiku judge) |
Weekly (scheduled)
voice-behavioral-funeral.yml— funeral vertical behavioral scenariosvoice-clients-drift.yml— voice-clients YAML drift detection
Manual (on trigger)
- Full 10-item founder-supervised live verification — before any cold-email batch GO/NO-GO
- Crisis pathway live test (item 5 in
06-DAY3-HANDOFF.md §6) — call demo line, say crisis phrase, assert 988 routing + DB row + no SMS - Regression across all 4 customer lines (item 6) — verify each answers correctly
Critical-path registry entries (existing, tests/registry.yaml)
voice-live-call—critical_path: true,spec_file: null⚠️ spec not yet authored (the Playwright round-trip G1 will close this)voice-routing-integration—critical_path: true,spec_file: null— covered by pytest workflow (not Playwright)voice-behavioral-nightly—critical_path: false, nightly behavioral suite
§6 — Acceptance Criteria — When is the Voice Agent "Hardened"?
The founder uses this checklist to make the cold-email batch GO/NO-GO call. Every item must be provably GREEN before the call. "Provably" means an artifact (PR link, CI run link, file path, SQL query result) that a human or agent can inspect.
- G1 — Round-trip Playwright spec
cold-outreach-director-transfer.spec.tsis green on the foundation Vercel preview alias. Artifact: CI run link oncold-outreach-director-transfer.ymlshowing green status. - G2+G3 — Static contract tests merged to main.
test_function_tool_schemas.py+voice-tool-schemas.yml+test_transfer_sip_payload_shape.pyare onfeat/verticals-platform-day1-foundationandvoice-tool-schemas.ymlCI is green. Artifact: commit SHA on foundation branch. - G7 — WebRTC↔SIP branch test
test_transfer_browser_branch.pyis green. Artifact: pytest run output. - G4 — Voice-health cron Telnyx extension is deployed and has run at least once without issuing a critical alert. Artifact:
voice-healthcron run showingoutbound_voice_profile_idcheck passing. - G5 — Daily outbound-dial cert passes for 7 consecutive days. Artifact: 7 consecutive cron run logs showing Telnyx echo participant joined the room.
- Crisis pathway test (G9):
test_transfer_crisis_gate.pypasses on every voice-agent deploy (already on main for keyword-explicit phrases). End-to-end crisis test (G9) is green:crisis_eventsrow written, novoice_callback_requests, notransfer_to_directorcall. Artifact: pytest output + Supabase querySELECT * FROM crisis_events ORDER BY created_at DESC LIMIT 1. - Multi-tenant routing test (G10):
test_load_church_data_integration.pypasses against all 4 production numbers. Artifact: CI run output showing each number resolves to correct church. - LLM fallback chain test (G11): Anthropic timeout → Gemini fires. Artifact: pytest run showing fallback fires.
- Self-dial loop detection (G14):
test_self_dial_detection.pypasses. Artifact: pytest run output. - All static contract tests green in
voice-tool-schemas.ymlCI workflow. Artifact: CI run link. - Inbound trunk lock test (G12):
voice-healthcron with trunk lock assertion is deployed and green. Artifact: cron run log showingST_Xa3Bp9aixRFPconfig unchanged. - Voice-provisioning runbook (
knowledge/runbooks/voice-provisioning.md) references the three Telnyx requirements (credential + outbound voice profile + DID-to-connection binding) AND the first-dial certification step. Artifact: file path + grep for "three requirements" and "first-dial". - Memory files referenced from the runbook:
memory/feedback_round_trip_test_before_merge.mdandmemory/feedback_telnyx_outbound_three_requirements.mdare linked fromvoice-provisioning.mdAND from the onboarding docs any new contributor reads first. Artifact: grep of runbook for memory file names.
§7 — How to Use This Document
Before touching voice code: Read §1 to understand which layer you are working in. Read §2 to know what failure modes have already burned this project in that layer. If your change touches layers that have ⚠️ marks in §3, you must either build the missing test as part of your PR (using §4 priority and file path), or carry a critical-path-override label with a documented reason.
Before opening a PR: Check §3 for every layer your PR touches. If that layer's test status is "static-only" or "⚠️ none," your PR must include the corresponding §4 gap closure OR an explicit waiver. The voice-critical-path-gate workflow (proposed) will enforce this for the highest-priority gaps once implemented.
Before merging a critical-path voice PR: §5 cadence defines which tests must pass. The minimum bar is:
voice-tool-schemas.ymlgreenvoice-routing-integration-on-pr.ymlgreencold-outreach-director-transfer.ymlgreen (once spec exists — G1)- No LIFE-SAFETY test failures (
test_escalation_routing.py,test_transfer_crisis_gate.py)
Before founder approves a voice-related ship: Walk §6 acceptance criteria. Each item must have an artifact. "Looks good" and "build passes" are not artifacts.
The 8-P0 heuristic: If your PR changes behavior at Layer 1-11 but tests only Layers 4-8 with stubs, you are shipping a PR like PR #251. The specific question to ask before merge: "Is there at least one test that will fail if I introduce a regression at Layer 1, 2, 3, 9, 10, or 11?" If the answer is no, do not merge.
§8 — Living-Document Protocol
This document is updated whenever:
- A new P0 is found in production: Add a row to §2 table with the layer mapping. Immediately assess which §3 entry should have caught it and move to §4 with P0 priority.
- A new test lands: Move the corresponding entry from §4 to §3. Update §3 with the file path, "on main" status, and any layer limitations. Remove the ⚠️ from §3 entries the test now covers.
- A new layer is added to the stack: Add a definition to §1 (renumber if needed). Add a §3 entry. Add a §4 gap if the layer is untested.
- The worktree tests (G2, G3) merge to main: Update §3 Layer 4 + Layer 8 entries to remove the "in worktree" qualifier.
- Acceptance criteria (§6) items are met: Check the box and add the artifact link.
Owner: The orchestrator on each Day-N session is responsible for updating this document before ending the session if any §2, §3, or §6 item changed.
Snapshot freshness: The last-verified frontmatter field is updated when §3 is re-confirmed against actual code in voice-agent-livekit/tests/ and .github/workflows/. Current last-verified: 2026-04-30 reflects the state of the worktree at the start of Day 4.
The core principle (from memory/feedback_round_trip_test_before_merge.md):
Any PR that ships a customer-facing browser demo, live transfer mechanic, WebRTC↔SIP bridge, or anything where the integration spans browser mic → LiveKit Cloud → agent runtime → STT → LLM → tool call → carrier → callee → bridge MUST include AT LEAST ONE end-to-end Playwright spec that exercises the real round-trip. Stubbed unit tests are insufficient.
This is not a preference. It is the lesson written in 7 hours of live debugging and 8 P0 regressions on a day when the founder said: "The better the demos and the more robust the product, the conversions will be way higher so it's worth spending a few days on it to get it right."