Knowledge > Runbooks > Voice Ops > LiveKit or Cartesia Outage Response
LiveKit or Cartesia Outage Response
Respond to a platform outage that renders the voice agent unavailable. Two components can fail independently: LiveKit Cloud (SIP gateway + room management) and Cartesia (TTS). This runbook covers detection, distinguishing the failure point, communication, and recovery.
Background
The voice agent relies on:
- LiveKit Cloud (
cwa-voice-9x077mph) — handles the Twilio SIP trunk gateway and dispatches jobs to the Railway agent worker. If LiveKit is down, calls won't connect to the SIP gateway at all; Twilio returns a busy signal or error. - Cartesia TTS — handles text-to-speech output via
livekit-plugins-cartesia. If Cartesia TTS is down, the agent starts and can hear the caller (Deepgram STT still works) but produces silent or error audio responses. - Railway — hosts the Python agent worker. If Railway is down, LiveKit Cloud cannot dispatch jobs; calls won't be answered.
Prerequisites
- Access to LiveKit Cloud status: check the LiveKit Cloud dashboard for project
cwa-voice-9x077mph - Access to Cartesia status page:
status.cartesia.ai - Railway dashboard or
railway logsaccess - Supabase MCP access
- Founder contact (for immediate notification — this is a P0 incident)
Steps
Detection
-
Confirm the outage — do not assume until verified:
a. Call a church phone number forwarded through the LiveKit SIP gateway.b. If the call does not connect at all (busy, error): suspect LiveKit Cloud or Railway outage.c. If the call connects but the agent is silent or plays error audio: suspect Cartesia TTS outage.d. Check LiveKit Cloud dashboard for SIP gateway health.e. Check status.cartesia.ai for TTS service status.f. Check Railway dashboard for worker health. -
Distinguish outage type:
Symptom Likely Cause Calls don't connect (Twilio returns busy/error, no voice_call_logsrows)LiveKit Cloud SIP gateway is down, OR Railway worker is not running Calls connect but agent is silent / no TTS audio Cartesia TTS is down ( CARTESIA_API_KEYissues or Cartesia platform outage)Agent starts but crashes before greeting Railway worker error — check Railway logs for Python exceptions voice_call_logsrows exist witherror_messageApp-level error in the agent worker — check Railway logs SELECT id, created_at, error_messageFROM voice_call_logsORDER BY created_at DESCLIMIT 5;If no rows at all: the issue is before the agent — LiveKit SIP or Railway. If rows exist with
error_message: the agent is receiving calls but crashing.
Immediate Response
-
Notify the founder immediately — this is a P0 (all voice customers affected).
-
Log the incident in the database:
INSERT INTO ops_error_reports (severity, component, summary, started_at, status) VALUES ('P0','voice-agent','LiveKit/Cartesia outage — voice agent unavailable for all churches',now(),'investigating');Note the report ID for later updates.
Diagnosis by Failure Type
If LiveKit Cloud is down
- Monitor the LiveKit Cloud status page and dashboard for updates.
- Do NOT attempt redeploys — they will not fix LiveKit Cloud infrastructure issues.
- Caller experience: callers who dial a church number will hear Twilio's default behavior (voicemail, busy signal, or generic message) because the SIP trunk cannot reach the gateway.
- Fallback option: update the Twilio trunk to forward calls to a voicemail/message service if the outage exceeds 2 hours. Reverse this when LiveKit recovers.
If Railway worker is down (but LiveKit is up)
-
Check Railway dashboard for the service status. Common causes:
- Deploy failure: check Railway deploy logs for Python errors.
- Memory/CPU limit exceeded: check Railway metrics.
- Missing environment variable: check Railway env vars are all set (see troubleshooting.md for the full list).
If a redeploy is needed:
git push origin main # Railway auto-deploys
If Cartesia TTS is down (but calls connect)
- Monitor
status.cartesia.aifor TTS service updates.- The agent will start and can hear callers (Deepgram STT still works) but cannot produce audible responses.
- No code-level workaround. Wait for Cartesia TTS recovery.
During Outage
- Consider notifying affected churches if the outage lasts more than 2 hours:
SELECT pc.admin_email, c.nameFROM premium_churches pcJOIN churches c ON c.id = pc.church_idJOIN church_voice_agents cva ON cva.church_id = pc.church_idWHERE cva.is_active = true;
Recovery
-
When the platform reports resolution, test immediately:
- Place a call to the demo or test church number.
- Confirm the agent answers with the correct greeting.
- Check
voice_call_logsfor a new successful entry.
-
If recovery requires a redeploy (e.g., Railway worker needs a restart):
git push origin main # Railway auto-deploys; LiveKit connects automatically -
Update the incident log:
UPDATE ops_error_reportsSET status = 'resolved',resolved_at = now(),resolution_notes = 'Platform recovered. Voice agent verified working at [time].'WHERE id = '[report-uuid]'; -
Notify the founder that service is restored.
-
Post-incident review — check
voice_call_logsfor missed calls during the outage:SELECT COUNT(*), MIN(created_at), MAX(created_at)FROM voice_call_logsWHERE error_message IS NOT NULLAND created_at BETWEEN '[outage_start]' AND '[outage_end]';
Verification
- Test call completes successfully with correct church greeting.
voice_call_logsshows new entries with noerror_message.- LiveKit Cloud dashboard shows SIP gateway operational.
status.cartesia.aishows TTS systems operational.- Railway service is healthy with no restart loops.
See Also
- voice-agent-debug.md — if issues persist after platforms report recovery
- voice-agent-update.md — redeploy procedure