How the Voice Agent Works

Last updated: 2026-03-28

The Body Analogy

Think of the voice agent as a person answering the phone. Each service is a body part:

                    ┌──────────────────────────────┐
                    │         THE AGENT             │
                    │                               │
   Phone Call ──►   │  👂 EARS (STT)               │
   (caller speaks)  │     Deepgram Nova-3           │
                    │     Hears speech → text        │
                    │                               │
                    │  🧠 BRAIN (LLM)              │
                    │     Gemini 2.5 Flash /         │
                    │     Claude Haiku 4.5           │
                    │     Thinks → decides response  │
                    │                               │
                    │  📚 MEMORY (RAG)             │
                    │     Church KB + Theology       │
                    │     Recalls facts on demand    │
                    │                               │
                    │  🛡️ CONSCIENCE (Moderation)   │
                    │     Crisis / Threat / Abuse    │
                    │     Filters before brain acts  │
                    │                               │
                    │  👄 VOICE (TTS)              │
                    │     Cartesia Sonic 3           │
                    │     Text → natural speech      │
                    │                               │
                    │  🙉 FOCUS (Noise Filter)      │
                    │     Drops "um", "uh", fillers  │
                    │     Passes real words through   │
                    │                               │
                    │  🤝 SOCIAL SENSE (Turn Detect) │
                    │     Knows when to listen vs    │
                    │     when to speak              │
                    │                               │
                    │  🖐️ HANDS (Tools)            │
                    │     Books appointments          │
                    │     Sends texts                 │
                    │     Submits prayer requests     │
                    │     Captures visitor info       │
                    │                               │
   Phone Call ◄──   │  ☎️ PHONE LINE (SIP)         │
   (agent speaks)   │     LiveKit ↔ Telnyx/Twilio    │
                    │     Carries the call            │
                    └──────────────────────────────┘

What Happens When Someone Calls

Phone rings → Telnyx or Twilio receives the call from the PSTN
SIP routing → Call forwarded to LiveKit Cloud via SIP trunk
LiveKit matches → Finds our trunk by phone number, dispatches agent
Agent starts → Loads church config, RAG context, product knowledge
Greeting plays → LLM generates welcome, TTS speaks it ("Thank you for calling...")

Then for each thing the caller says:

Ears hear → Deepgram transcribes speech to text (STT)
Focus filters → Drops filler words ("um", "uh"), passes real speech
Conscience checks → Moderation scans for crisis/threat/abuse BEFORE brain sees it
Brain thinks → LLM processes the text with church context + RAG knowledge
Hands act → If needed, tools fire (prayer request, callback, text link)
Voice speaks → Cartesia converts LLM response to natural speech (TTS)
Caller hears → Audio sent back through SIP to caller's phone

When the call ends:

Farewell → Agent says goodbye, calls end_call tool
Room closes → LiveKit disconnects after 8s delay (TTS finishes)
Transcript saved → Conversation written to DB
Classification → Gemini Flash analyzes: summary, sentiment, topics, urgency
Notifications → Email/SMS sent to church staff (if not in testing mode)

Service Map — Every Component

Telephony (Phone Lines)

Service	Role	When Used
Telnyx	SIP provider for NEW customer numbers	All new churches get Telnyx numbers
Twilio	SIP provider for LEGACY numbers	Demo lines, sales line, toll-free
LiveKit Cloud SIP	SIP gateway — bridges phone calls to WebRTC	Every call

Call path: Caller → PSTN → Telnyx/Twilio → SIP INVITE → LiveKit SIP Gateway → Agent

Key config:

LiveKit SIP URL: 5u9xu5ysoly.sip.livekit.cloud (project ID, NOT project name)
Telnyx FQDN connection: 2925216093662349036 → points to LiveKit SIP URL
Main trunk: ST_Xa3Bp9aixRFP — holds all phone numbers (LOCKED)
Dispatch rules route trunk → agent name churchwiseai-voice

Speech-to-Text (STT) — The Ears

Service	Model	Role	Latency
Deepgram	Nova-3	Primary STT	~200ms

Deepgram Nova-3 is the primary (and currently only) STT. It's the best for phone audio quality — handles background noise, accents, and low-bitrate G.722 codec well.

Configuration: stt="deepgram/nova-3" in AgentSession

Fallback: No STT fallback currently configured. If Deepgram goes down, calls will connect but the agent won't understand speech. TODO: Add Whisper or Google STT as fallback.

Large Language Model (LLM) — The Brain

Service	Model	Role	Speed	Cost
Google	Gemini 2.5 Flash	Coordinator, Sales, Demo agents	Very fast	Low
Anthropic	Claude Haiku 4.5	Care Agent (pastoral/emotional)	Fast	Medium

Why two brains?

Gemini Flash is fast and cheap — great for factual Q&A (service times, directions, events)
Claude Haiku has better empathy — handles grief, prayer, crisis with more nuance

Configuration:

COORDINATOR_MODEL = "google/gemini-2.5-flash"    # Fast, factual
CARE_MODEL = "anthropic/claude-haiku-4-5-20251001"  # Empathetic, careful

Fallback: If one LLM fails, the system should fall back to the other. Currently no automatic fallback — TODO: implement LLM fallback chain.

Text-to-Speech (TTS) — The Voice

Service	Model	Role	Latency (TTFB)
Cartesia	Sonic 3	Primary TTS	~200ms

Cartesia Sonic 3 produces the most natural-sounding voice for phone calls. Supports custom voices per church.

Configuration: tts="cartesia/sonic-3:{voice_id}" in AgentSession

Default voices:

Male: Carson (86e30c1d-714b-4074-a1f2-1cb6b552fb49)
Female: Cindy (1242fb95-7ddd-44ac-8a05-9e8a22a6137d)
Default for new churches: Cindy (female)

Per-church custom voice: Set voice_id in church_voice_agents table. Must be a valid Cartesia UUID. ElevenLabs IDs will NOT work (caused dead air for Zewdei — fixed 2026-03-28).

Fallback: No TTS fallback currently configured. If Cartesia goes down, calls will connect but the agent will be silent. TODO: Add LiveKit TTS or Google TTS as fallback.

Voice Activity Detection (VAD) — Hearing Attention

Service	Model	Role
Silero	VAD v5	Detects when someone is speaking vs silence

Pre-warmed at agent startup (loaded once per worker process, not per call). Combined with the multilingual turn detector for end-of-utterance detection.

Service	Model	Role
LiveKit	Multilingual Turn Detector	Knows when caller has finished speaking

End-of-utterance delay: ~600ms. This is the pause after the caller stops speaking before the agent starts responding. Too short = agent interrupts. Too long = awkward silence.

RAG (Retrieval-Augmented Generation) — Memory

Component	Source	When Used
Church KB	`church_knowledge_base` table	Per-turn (500ms timeout)
Theological content	`unified_rag_content` + `sai_theological_lenses`	Session start (one-time)
Product knowledge	`product_knowledge` table	Session start (one-time)
Repeat caller history	`voice_call_logs` by phone	Session start (one-time)

Embeddings: OpenAI text-embedding-3-small RPCs: search_church_knowledge, search_unified_rag_content Theological lenses: 17 denominations mapped to lens IDs (Baptist→14, Catholic→7, etc.)

Moderation — Conscience

Check	What It Catches	Action
Threat	Violence, weapons, bomb threats	End call immediately
Crisis	Suicidal ideation, self-harm, coded language	Inject 988 Lifeline into LLM context
Abuse	Profanity, harassment	1st: warning. 2nd+: end call

Processing order: Moderation runs BEFORE the LLM sees the text. A crisis caller gets help resources injected into the response. A threat caller gets disconnected immediately.

Crisis detection includes: Direct statements ("kill myself"), coded language (elderly: "tired of living", religious: "ready to go home to be with the Lord", farewell: "giving away my things"), C-SSRS Q1, burden language.

Context-aware: "Ready to go to church" does NOT trigger crisis. "Ready to go" (standalone) DOES.

Noise Filtering — Focus

Category	Examples	Action
Pure noise	um, uh, hmm, ah, er	Always dropped
Backchannels	uh huh, mm hmm, i see	Always dropped
Context-dependent	okay, yeah, good, perfect	Dropped if agent didn't ask a question
Floor-takes	wait, stop, no, hold on	Always passed (barge-in)
Meaningful	thanks, bye, goodbye	Always passed

Tools — Hands

Tool	Agent	What It Does
`capture_visitor_info`	Coordinator	Saves visitor contact to DB
`send_directions_link`	Coordinator	Texts Google Maps link
`register_for_event`	Coordinator	Registers for church event
`send_giving_link`	Coordinator	Texts giving/donation URL
`check_availability`	Coordinator	Checks Cal.com calendar
`book_appointment`	Coordinator	Books via Cal.com
`submit_prayer_request`	Care	Saves prayer to DB + notifies team
`request_callback`	Care	Saves callback request + notifies pastor
`send_sms_link`	All	Texts any URL to caller
`end_call`	All	Says farewell, waits 8s, disconnects
`schedule_demo`	Sales	Captures demo request
`search_churches`	Sales	Searches PewSearch directory
`capture_support`	Sales	Logs tech support request

All tools are conditionally enabled based on church config (Cal.com keys, PCO credentials, giving_enabled flag, etc.). See church_voice_agents table.

Fallback Chain — Full Picture

PLATFORM LEVEL:
  Primary: LiveKit Cloud (Agents v1.5, Python)
  Fallback: None (voice-agent-livekit// is legacy, no longer maintained)
  Trigger: LiveKit Cloud outage > 1 hour

TELEPHONY:
  New customers: Telnyx (FQDN connection → LiveKit SIP)
  Legacy numbers: Twilio (SIP trunk → LiveKit SIP)
  If Telnyx FQDN fails: TeXML webhook bridge (churchwiseai.com/api/telnyx/voice-webhook)

STT (Ears):
  Primary: Deepgram Nova-3
  Fallback: NONE CONFIGURED ���️
  TODO: Add Google STT or Whisper as fallback

LLM (Brain):
  Coordinator: Gemini 2.5 Flash → (no auto-fallback) → Claude Haiku 4.5
  Care Agent: Claude Haiku 4.5 → (no auto-fallback) → Gemini 2.5 Flash
  TODO: Implement automatic LLM fallback

TTS (Voice):
  Primary: Cartesia Sonic 3
  Fallback: NONE CONFIGURED ⚠️
  TODO: Add Google TTS or LiveKit TTS as fallback

VAD:
  Primary: Silero v5 (pre-warmed)
  Fallback: Built into LiveKit (basic energy detection)

RECORDING:
  Status: NOT IMPLEMENTED
  Plan: LiveKit Egress (audio-only MP3) → S3/R2 bucket
  Cost: ~$0.004/min ($12/mo at 1000 calls)

POST-CALL:
  Transcript: conversation_item_added event → saved to voice_call_logs
  Classification: Gemini 2.5 Flash → summary, sentiment, topics, urgency
  Notifications: Resend (email) + Twilio (SMS), fire-and-forget

Key Files

File	What It Does
`voice-agent-livekit/main.py`	Entry point — SIP routing, session setup, transcript capture
`voice-agent-livekit/session.py`	Phone registry, Supabase client, call logs, classify_call()
`voice-agent-livekit/safety.py`	SafeAgent base class — pre-LLM moderation via llm_node
`voice-agent-livekit/moderation.py`	Threat/crisis/abuse regex patterns
`voice-agent-livekit/call_handler.py`	Noise filtering, farewell detection
`voice-agent-livekit/core/rag.py`	Embeddings + Supabase RPC search
`voice-agent-livekit/core/notifications.py`	Email/SMS fan-out, testing mode redirect
`voice-agent-livekit/core/prompt_fragments.py`	HEAR protocol, crisis protocol, guardrails
`voice-agent-livekit/core/tools.py`	SMS link sender, directions sender
`voice-agent-livekit/verticals/church/agents.py`	CoordinatorAgent + CareAgent classes
`voice-agent-livekit/verticals/church/prompts.py`	Per-church prompt builder
`voice-agent-livekit/verticals/church/tools.py`	Prayer, callback, visitor, event tools
`voice-agent-livekit/verticals/sales/agents.py`	SalesAgent, DemoRouterAgent, DemoAgent
`voice-agent-livekit/verticals/sales/prompts.py`	Sales prompt builder
`voice-agent-livekit/verticals/church/config.py`	Tier gating, default voice IDs

Documentation Sources

See knowledge/references/voice-agent-sources.md for the full list of LiveKit and Telnyx documentation, GitHub repos, community channels, and Context7 MCP library IDs.

The Body Analogy​

What Happens When Someone Calls​

Service Map — Every Component​

Telephony (Phone Lines)​

Speech-to-Text (STT) — The Ears​

Large Language Model (LLM) — The Brain​

Text-to-Speech (TTS) — The Voice​

Voice Activity Detection (VAD) — Hearing Attention​

Turn Detection — Social Awareness​

RAG (Retrieval-Augmented Generation) — Memory​

Moderation — Conscience​

Noise Filtering — Focus​

Tools — Hands​

Fallback Chain — Full Picture​

Key Files​

Documentation Sources​