Skip to main content

Voice FK-join regression (2026-04-22) — what happened and how we prevent it

What broke

From 2026-04-21 through 2026-04-22 (~24 hours), every inbound call to a dedicated-trunk church demo number (Covenant, New Life, Grace, St. Joseph, Maple Street, Christ the King) silently routed to the ChurchWiseAI Sales Agent instead of the church's own Coordinator.

Symptom observed by founder: dialed +14125300800 (Covenant Presbyterian), heard "Thank you for calling ChurchWiseAI. How can I help you today?" instead of Covenant's greeting.

Root cause

voice-agent-livekit/verticals/church/integrations/supabase_church.py _fetch_voice_agent_row() used PostgREST FK join syntax:

.select("*, churches!church_voice_agents_church_id_fkey (id, name, ...)")

This syntax requires a named FOREIGN KEY constraint in pg_constraint. The M2 migration on 2026-04-21 (tenant_id rename) dropped all FK constraints on church_voice_agents. PostgREST returned HTTP 400 "Could not find a relationship", which the Supabase Python SDK raised as an exception, caught by the top-level except Exception in load_church_data, which returned None. _build_church_path saw church is None and fell back to _build_sales_path.

Every demo-trunk call failed the same way. The shared ST_Xa3Bp9aixRFP "ChurchWiseAI All Lines" trunk happens to include +14144007103 (Medhanialem, a paying customer) and served a DIFFERENT code path (cached state, different route) so its calls worked and masked the regression.

Why our test suite missed it

  • Unit tests mocked Supabase. They returned whatever the test author configured — the real PostgREST error path was never exercised.
  • Behavioral tests (voice-behavioral-critical-on-pr.yml) run the LIVE LLM + agent flow but against mocked Supabase state.
  • No integration test called load_church_data() against the actual production database.

Also contributing: an earlier revert that wasn't heeded

  • d2f56d07 (voice-p01-fix agent direct commit, 10:37) added calls_limit + at-capacity TTS. Founder reverted at 10:47 (6c34ff1f) because testing surfaced the sales-fallback behavior.
  • Orchestrator (Claude) didn't notice the revert and merged PR #133 (e0a24776, 11:42) containing the same code. Re-introduced the visible symptom.
  • After diagnosing the FK-join root cause, PR #136 fixed the underlying bug. PR #133's code was a red herring — the FK regression predates it and affects the same calls path regardless of whether at-capacity logic is present.

This is the second time the M2 migration has bitten us in the voice path. See memory/feedback_never_migrate_before_audit.md — the rule there ("grep every repo for callers before any ALTER TABLE") exists because of this class of failure.

Fix

PR #136: _fetch_voice_agent_row() now uses a plain SELECT * with no FK joins. A new _fetch_church_row() helper queries churches directly by id — same resilient pattern already used for _fetch_premium_full().

Prevention going forward

  1. New CI workflow.github/workflows/voice-routing-integration-on-pr.yml runs on every PR touching voice-agent-livekit Python code. Executes:
    • tests/test_routing.py (unit: resolve_route over all PHONE_REGISTRY entries)
    • tests/test_load_church_data_integration.py (integration: LIVE Supabase call for every demo + paying-customer church_id, asserts non-None dict)
    • tests/test_calls_limit.py (when re-added)
  2. Registry entryknowledge/tests/registry.yaml adds voice-routing-integration with critical_path: true. Any future DB regression in the voice routing path is a blocking gate.
  3. Preview voice agent plan — tracked separately; second LiveKit agent with dedicated Telnyx number, deploy feature branches there first, founder validates before promoting to prod. Design doc: knowledge/decisions/2026-04-22-voice-preview-agent.md (pending).

Cost of the incident

  • ~24h of silently broken demo routing
  • ~45min of founder / orchestrator time to diagnose, revert, and fix
  • One double-revert cycle (PR #133 merge → revert → PR #136)
  • Customer impact: unknown but likely minimal — calls from dedicated-trunk numbers during this window went to Sales Agent which would have been confusing but not data-losing.

References

  • PR #136: fix _fetch_voice_agent_row FK join
  • Reverted commits: d2f56d07, e0a24776, 194372be
  • M2 migration: 2026-04-21 tenant_id rename (details in memory/feedback_never_migrate_before_audit.md)
  • Previous incident: voice lines down ~1h on 2026-04-21 from same migration class of failure