How the Voice Agent Works
Last updated: 2026-03-28
The Body Analogy
Think of the voice agent as a person answering the phone. Each service is a body part:
┌──────────────────────────────┐
│ THE AGENT │
│ │
Phone Call ──► │ 👂 EARS (STT) │
(caller speaks) │ Deepgram Nova-3 │
│ Hears speech → text │
│ │
│ 🧠 BRAIN (LLM) │
│ Gemini 2.5 Flash / │
│ Claude Haiku 4.5 │
│ Thinks → decides response │
│ │
│ 📚 MEMORY (RAG) │
│ Church KB + Theology │
│ Recalls facts on demand │
│ │
│ 🛡️ CONSCIENCE (Moderation) │
│ Crisis / Threat / Abuse │
│ Filters before brain acts │
│ │
│ 👄 VOICE (TTS) │
│ Cartesia Sonic 3 │
│ Text → natural speech │
│ │
│ 🙉 FOCUS (Noise Filter) │
│ Drops "um", "uh", fillers │
│ Passes real words through │
│ │
│ 🤝 SOCIAL SENSE (Turn Detect) │
│ Knows when to listen vs │
│ when to speak │
│ │
│ 🖐️ HANDS (Tools) │
│ Books appointments │
│ Sends texts │
│ Submits prayer requests │
│ Captures visitor info │
│ │
Phone Call ◄── │ ☎️ PHONE LINE (SIP) │
(agent speaks) │ LiveKit ↔ Telnyx/Twilio │
│ Carries the call │
└──────────────────────────────┘
What Happens When Someone Calls
- Phone rings → Telnyx or Twilio receives the call from the PSTN
- SIP routing → Call forwarded to LiveKit Cloud via SIP trunk
- LiveKit matches → Finds our trunk by phone number, dispatches agent
- Agent starts → Loads church config, RAG context, product knowledge
- Greeting plays → LLM generates welcome, TTS speaks it ("Thank you for calling...")
Then for each thing the caller says:
- Ears hear → Deepgram transcribes speech to text (STT)
- Focus filters → Drops filler words ("um", "uh"), passes real speech
- Conscience checks → Moderation scans for crisis/threat/abuse BEFORE brain sees it
- Brain thinks → LLM processes the text with church context + RAG knowledge
- Hands act → If needed, tools fire (prayer request, callback, text link)
- Voice speaks → Cartesia converts LLM response to natural speech (TTS)
- Caller hears → Audio sent back through SIP to caller's phone
When the call ends:
- Farewell → Agent says goodbye, calls end_call tool
- Room closes → LiveKit disconnects after 8s delay (TTS finishes)
- Transcript saved → Conversation written to DB
- Classification → Gemini Flash analyzes: summary, sentiment, topics, urgency
- Notifications → Email/SMS sent to church staff (if not in testing mode)
Service Map — Every Component
Telephony (Phone Lines)
| Service | Role | When Used |
|---|---|---|
| Telnyx | SIP provider for NEW customer numbers | All new churches get Telnyx numbers |
| Twilio | SIP provider for LEGACY numbers | Demo lines, sales line, toll-free |
| LiveKit Cloud SIP | SIP gateway — bridges phone calls to WebRTC | Every call |
Call path: Caller → PSTN → Telnyx/Twilio → SIP INVITE → LiveKit SIP Gateway → Agent
Key config:
- LiveKit SIP URL:
5u9xu5ysoly.sip.livekit.cloud(project ID, NOT project name) - Telnyx FQDN connection:
2925216093662349036→ points to LiveKit SIP URL - Main trunk:
ST_Xa3Bp9aixRFP— holds all phone numbers (LOCKED) - Dispatch rules route trunk → agent name
churchwiseai-voice
Speech-to-Text (STT) — The Ears
| Service | Model | Role | Latency |
|---|---|---|---|
| Deepgram | Nova-3 | Primary STT | ~200ms |
Deepgram Nova-3 is the primary (and currently only) STT. It's the best for phone audio quality — handles background noise, accents, and low-bitrate G.722 codec well.
Configuration: stt="deepgram/nova-3" in AgentSession
Fallback: No STT fallback currently configured. If Deepgram goes down, calls will connect but the agent won't understand speech. TODO: Add Whisper or Google STT as fallback.
Large Language Model (LLM) — The Brain
| Service | Model | Role | Speed | Cost |
|---|---|---|---|---|
| Gemini 2.5 Flash | Coordinator, Sales, Demo agents | Very fast | Low | |
| Anthropic | Claude Haiku 4.5 | Care Agent (pastoral/emotional) | Fast | Medium |
Why two brains?
- Gemini Flash is fast and cheap — great for factual Q&A (service times, directions, events)
- Claude Haiku has better empathy — handles grief, prayer, crisis with more nuance
Configuration:
COORDINATOR_MODEL = "google/gemini-2.5-flash" # Fast, factual
CARE_MODEL = "anthropic/claude-haiku-4-5-20251001" # Empathetic, careful
Fallback: If one LLM fails, the system should fall back to the other. Currently no automatic fallback — TODO: implement LLM fallback chain.
Text-to-Speech (TTS) — The Voice
| Service | Model | Role | Latency (TTFB) |
|---|---|---|---|
| Cartesia | Sonic 3 | Primary TTS | ~200ms |
Cartesia Sonic 3 produces the most natural-sounding voice for phone calls. Supports custom voices per church.
Configuration: tts="cartesia/sonic-3:{voice_id}" in AgentSession
Default voices:
- Male: Carson (
86e30c1d-714b-4074-a1f2-1cb6b552fb49) - Female: Cindy (
1242fb95-7ddd-44ac-8a05-9e8a22a6137d) - Default for new churches: Cindy (female)
Per-church custom voice: Set voice_id in church_voice_agents table. Must be a valid Cartesia UUID. ElevenLabs IDs will NOT work (caused dead air for Zewdei — fixed 2026-03-28).
Fallback: No TTS fallback currently configured. If Cartesia goes down, calls will connect but the agent will be silent. TODO: Add LiveKit TTS or Google TTS as fallback.
Voice Activity Detection (VAD) — Hearing Attention
| Service | Model | Role |
|---|---|---|
| Silero | VAD v5 | Detects when someone is speaking vs silence |
Pre-warmed at agent startup (loaded once per worker process, not per call). Combined with the multilingual turn detector for end-of-utterance detection.
Turn Detection — Social Awareness
| Service | Model | Role |
|---|---|---|
| LiveKit | Multilingual Turn Detector | Knows when caller has finished speaking |
End-of-utterance delay: ~600ms. This is the pause after the caller stops speaking before the agent starts responding. Too short = agent interrupts. Too long = awkward silence.
RAG (Retrieval-Augmented Generation) — Memory
| Component | Source | When Used |
|---|---|---|
| Church KB | church_knowledge_base table | Per-turn (500ms timeout) |
| Theological content | unified_rag_content + sai_theological_lenses | Session start (one-time) |
| Product knowledge | product_knowledge table | Session start (one-time) |
| Repeat caller history | voice_call_logs by phone | Session start (one-time) |
Embeddings: OpenAI text-embedding-3-small
RPCs: search_church_knowledge, search_unified_rag_content
Theological lenses: 17 denominations mapped to lens IDs (Baptist→14, Catholic→7, etc.)
Moderation — Conscience
| Check | What It Catches | Action |
|---|---|---|
| Threat | Violence, weapons, bomb threats | End call immediately |
| Crisis | Suicidal ideation, self-harm, coded language | Inject 988 Lifeline into LLM context |
| Abuse | Profanity, harassment | 1st: warning. 2nd+: end call |
Processing order: Moderation runs BEFORE the LLM sees the text. A crisis caller gets help resources injected into the response. A threat caller gets disconnected immediately.
Crisis detection includes: Direct statements ("kill myself"), coded language (elderly: "tired of living", religious: "ready to go home to be with the Lord", farewell: "giving away my things"), C-SSRS Q1, burden language.
Context-aware: "Ready to go to church" does NOT trigger crisis. "Ready to go" (standalone) DOES.
Noise Filtering — Focus
| Category | Examples | Action |
|---|---|---|
| Pure noise | um, uh, hmm, ah, er | Always dropped |
| Backchannels | uh huh, mm hmm, i see | Always dropped |
| Context-dependent | okay, yeah, good, perfect | Dropped if agent didn't ask a question |
| Floor-takes | wait, stop, no, hold on | Always passed (barge-in) |
| Meaningful | thanks, bye, goodbye | Always passed |
Tools — Hands
| Tool | Agent | What It Does |
|---|---|---|
capture_visitor_info | Coordinator | Saves visitor contact to DB |
send_directions_link | Coordinator | Texts Google Maps link |
register_for_event | Coordinator | Registers for church event |
send_giving_link | Coordinator | Texts giving/donation URL |
check_availability | Coordinator | Checks Cal.com calendar |
book_appointment | Coordinator | Books via Cal.com |
submit_prayer_request | Care | Saves prayer to DB + notifies team |
request_callback | Care | Saves callback request + notifies pastor |
send_sms_link | All | Texts any URL to caller |
end_call | All | Says farewell, waits 8s, disconnects |
schedule_demo | Sales | Captures demo request |
search_churches | Sales | Searches PewSearch directory |
capture_support | Sales | Logs tech support request |
All tools are conditionally enabled based on church config (Cal.com keys, PCO credentials, giving_enabled flag, etc.). See church_voice_agents table.
Fallback Chain — Full Picture
PLATFORM LEVEL:
Primary: LiveKit Cloud (Agents v1.5, Python)
Fallback: None (voice-agent-livekit// is legacy, no longer maintained)
Trigger: LiveKit Cloud outage > 1 hour
TELEPHONY:
New customers: Telnyx (FQDN connection → LiveKit SIP)
Legacy numbers: Twilio (SIP trunk → LiveKit SIP)
If Telnyx FQDN fails: TeXML webhook bridge (churchwiseai.com/api/telnyx/voice-webhook)
STT (Ears):
Primary: Deepgram Nova-3
Fallback: NONE CONFIGURED ���️
TODO: Add Google STT or Whisper as fallback
LLM (Brain):
Coordinator: Gemini 2.5 Flash → (no auto-fallback) → Claude Haiku 4.5
Care Agent: Claude Haiku 4.5 → (no auto-fallback) → Gemini 2.5 Flash
TODO: Implement automatic LLM fallback
TTS (Voice):
Primary: Cartesia Sonic 3
Fallback: NONE CONFIGURED ⚠️
TODO: Add Google TTS or LiveKit TTS as fallback
VAD:
Primary: Silero v5 (pre-warmed)
Fallback: Built into LiveKit (basic energy detection)
RECORDING:
Status: NOT IMPLEMENTED
Plan: LiveKit Egress (audio-only MP3) → S3/R2 bucket
Cost: ~$0.004/min ($12/mo at 1000 calls)
POST-CALL:
Transcript: conversation_item_added event → saved to voice_call_logs
Classification: Gemini 2.5 Flash → summary, sentiment, topics, urgency
Notifications: Resend (email) + Twilio (SMS), fire-and-forget
Key Files
| File | What It Does |
|---|---|
voice-agent-livekit/main.py | Entry point — SIP routing, session setup, transcript capture |
voice-agent-livekit/session.py | Phone registry, Supabase client, call logs, classify_call() |
voice-agent-livekit/safety.py | SafeAgent base class — pre-LLM moderation via llm_node |
voice-agent-livekit/moderation.py | Threat/crisis/abuse regex patterns |
voice-agent-livekit/call_handler.py | Noise filtering, farewell detection |
voice-agent-livekit/core/rag.py | Embeddings + Supabase RPC search |
voice-agent-livekit/core/notifications.py | Email/SMS fan-out, testing mode redirect |
voice-agent-livekit/core/prompt_fragments.py | HEAR protocol, crisis protocol, guardrails |
voice-agent-livekit/core/tools.py | SMS link sender, directions sender |
voice-agent-livekit/verticals/church/agents.py | CoordinatorAgent + CareAgent classes |
voice-agent-livekit/verticals/church/prompts.py | Per-church prompt builder |
voice-agent-livekit/verticals/church/tools.py | Prayer, callback, visitor, event tools |
voice-agent-livekit/verticals/sales/agents.py | SalesAgent, DemoRouterAgent, DemoAgent |
voice-agent-livekit/verticals/sales/prompts.py | Sales prompt builder |
voice-agent-livekit/verticals/church/config.py | Tier gating, default voice IDs |
Documentation Sources
See knowledge/references/voice-agent-sources.md for the full list of LiveKit and Telnyx documentation, GitHub repos, community channels, and Context7 MCP library IDs.