Skip to main content

Knowledge > Runbooks > Voice Ops > LiveKit or Cartesia Outage Response

LiveKit or Cartesia Outage Response

Respond to a platform outage that renders the voice agent unavailable. Two components can fail independently: LiveKit Cloud (SIP gateway + room management) and Cartesia (TTS). This runbook covers detection, distinguishing the failure point, communication, and recovery.

Background

The voice agent relies on:

  • LiveKit Cloud (cwa-voice-9x077mph) — handles the Twilio SIP trunk gateway and dispatches jobs to the Railway agent worker. If LiveKit is down, calls won't connect to the SIP gateway at all; Twilio returns a busy signal or error.
  • Cartesia TTS — handles text-to-speech output via livekit-plugins-cartesia. If Cartesia TTS is down, the agent starts and can hear the caller (Deepgram STT still works) but produces silent or error audio responses.
  • Railway — hosts the Python agent worker. If Railway is down, LiveKit Cloud cannot dispatch jobs; calls won't be answered.

Prerequisites

  • Access to LiveKit Cloud status: check the LiveKit Cloud dashboard for project cwa-voice-9x077mph
  • Access to Cartesia status page: status.cartesia.ai
  • Railway dashboard or railway logs access
  • Supabase MCP access
  • Founder contact (for immediate notification — this is a P0 incident)

Steps

Detection

  1. Confirm the outage — do not assume until verified:

    a. Call a church phone number forwarded through the LiveKit SIP gateway.
    b. If the call does not connect at all (busy, error): suspect LiveKit Cloud or Railway outage.
    c. If the call connects but the agent is silent or plays error audio: suspect Cartesia TTS outage.
    d. Check LiveKit Cloud dashboard for SIP gateway health.
    e. Check status.cartesia.ai for TTS service status.
    f. Check Railway dashboard for worker health.
  2. Distinguish outage type:

    SymptomLikely Cause
    Calls don't connect (Twilio returns busy/error, no voice_call_logs rows)LiveKit Cloud SIP gateway is down, OR Railway worker is not running
    Calls connect but agent is silent / no TTS audioCartesia TTS is down (CARTESIA_API_KEY issues or Cartesia platform outage)
    Agent starts but crashes before greetingRailway worker error — check Railway logs for Python exceptions
    voice_call_logs rows exist with error_messageApp-level error in the agent worker — check Railway logs
    SELECT id, created_at, error_message
    FROM voice_call_logs
    ORDER BY created_at DESC
    LIMIT 5;

    If no rows at all: the issue is before the agent — LiveKit SIP or Railway. If rows exist with error_message: the agent is receiving calls but crashing.

Immediate Response

  1. Notify the founder immediately — this is a P0 (all voice customers affected).

  2. Log the incident in the database:

    INSERT INTO ops_error_reports (
    severity, component, summary, started_at, status
    ) VALUES (
    'P0',
    'voice-agent',
    'LiveKit/Cartesia outage — voice agent unavailable for all churches',
    now(),
    'investigating'
    );

    Note the report ID for later updates.

Diagnosis by Failure Type

If LiveKit Cloud is down

  1. Monitor the LiveKit Cloud status page and dashboard for updates.
    • Do NOT attempt redeploys — they will not fix LiveKit Cloud infrastructure issues.
    • Caller experience: callers who dial a church number will hear Twilio's default behavior (voicemail, busy signal, or generic message) because the SIP trunk cannot reach the gateway.
    • Fallback option: update the Twilio trunk to forward calls to a voicemail/message service if the outage exceeds 2 hours. Reverse this when LiveKit recovers.

If Railway worker is down (but LiveKit is up)

  1. Check Railway dashboard for the service status. Common causes:

    • Deploy failure: check Railway deploy logs for Python errors.
    • Memory/CPU limit exceeded: check Railway metrics.
    • Missing environment variable: check Railway env vars are all set (see troubleshooting.md for the full list).

    If a redeploy is needed:

    git push origin main # Railway auto-deploys

If Cartesia TTS is down (but calls connect)

  1. Monitor status.cartesia.ai for TTS service updates.
    • The agent will start and can hear callers (Deepgram STT still works) but cannot produce audible responses.
    • No code-level workaround. Wait for Cartesia TTS recovery.

During Outage

  1. Consider notifying affected churches if the outage lasts more than 2 hours:
    SELECT pc.admin_email, c.name
    FROM premium_churches pc
    JOIN churches c ON c.id = pc.church_id
    JOIN church_voice_agents cva ON cva.church_id = pc.church_id
    WHERE cva.is_active = true;

Recovery

  1. When the platform reports resolution, test immediately:

    • Place a call to the demo or test church number.
    • Confirm the agent answers with the correct greeting.
    • Check voice_call_logs for a new successful entry.
  2. If recovery requires a redeploy (e.g., Railway worker needs a restart):

    git push origin main # Railway auto-deploys; LiveKit connects automatically
  3. Update the incident log:

    UPDATE ops_error_reports
    SET status = 'resolved',
    resolved_at = now(),
    resolution_notes = 'Platform recovered. Voice agent verified working at [time].'
    WHERE id = '[report-uuid]';
  4. Notify the founder that service is restored.

  5. Post-incident review — check voice_call_logs for missed calls during the outage:

    SELECT COUNT(*), MIN(created_at), MAX(created_at)
    FROM voice_call_logs
    WHERE error_message IS NOT NULL
    AND created_at BETWEEN '[outage_start]' AND '[outage_end]';

Verification

  • Test call completes successfully with correct church greeting.
  • voice_call_logs shows new entries with no error_message.
  • LiveKit Cloud dashboard shows SIP gateway operational.
  • status.cartesia.ai shows TTS systems operational.
  • Railway service is healthy with no restart loops.

See Also