Adopt webhook inbox pattern for Stripe events
Status
DECIDED
Context
On 2026-04-14, an agent shipped a Stripe harness refactor. The webhook handler started returning HTTP 200 on silent provisioning failures — Stripe received a success response, stopped retrying, and a live paying customer's church was never provisioned. The founder lost a live payment and spent the evening diagnosing instead of onboarding a customer.
The prior architecture was inline provisioning: the webhook handler verified the Stripe signature, ran the full provisioning logic (church row, voice trunk, chatbot config, welcome email), and returned 200 when everything succeeded. Any exception in the provisioning chain — whether a Supabase timeout, a Telnyx API error, or a welcome email failure — could either (a) bubble up and force Stripe to retry the wrong handler state, or (b) be caught as "non-fatal" and silently succeed from Stripe's perspective while the church remained unprovisioned.
Retries were available (Stripe retries on non-2xx for 72 hours) but the inline pattern made it impossible to distinguish "succeeded" from "failed silently."
Decision
Adopt the webhook inbox pattern. The handler at /api/stripe/webhook becomes
a thin ack-and-enqueue:
- Verify Stripe signature
- Insert raw event into
stripe_webhook_inbox - Return 200 immediately
Real processing happens in /api/cron/process-stripe-webhooks (every minute),
which calls processStripeEvent() with exponential backoff and explicit retry
tracking. A P0 alert fires if a row is abandoned (exceeded max retries without
success). Visibility UI at /founder/[token]/webhook-inbox.
No provisioning logic runs inline in the webhook handler, ever.
Rationale
- Durability: Stripe's 200 now only means "we received this event." The event row is the durable record. Processing can be retried, debugged, and monitored independently of the webhook delivery window.
- Visibility: The inbox UI surfaces every event, its processing status, and any error. Silent failures are impossible — a failed row stays in a failed state, visible to the founder.
- Separation of concerns: Webhook delivery (Stripe → our endpoint) is decoupled from provisioning (our code → Supabase, Telnyx, LiveKit). A provisioning bug can be fixed and rows replayed without involving Stripe.
- P0 alert: If a row exceeds max retries, the founder gets an alert before the customer notices. The previous architecture had no such signal.
Consequences
- Good: Live payment loss incident cannot recur. Every Stripe event is auditable. Failed provisioning is retryable without customer impact.
- Bad: Processing now has up to 60-second latency (cron interval). For checkout completion, this means the customer may wait up to a minute for their welcome email and dashboard access. Acceptable tradeoff — the alternative is silent loss.
- Reversible? Yes — returning to inline provisioning is a one-PR change. Not recommended.
- Known remaining gap: Inner try/catches inside
provisionNewChurch(identity/voice/email/chatbot sub-steps) still swallow individual step errors. Welcome-email-failed logs CRITICAL but the worker marks the row succeeded. Follow-up: thread per-step outcomes intoprovisioning_summary, re-throw on critical steps.
Alternatives considered
- Fix the inline handler more carefully — rejected; the pattern itself is the problem. No amount of error handling in an inline webhook prevents Stripe from trusting a 200 that was returned before provisioning completed.
- Queue-based processing (SQS/Upstash) — considered; adds another vendor and operational surface. The cron-over-DB pattern achieves the same durability with infrastructure already present (Supabase + Vercel cron).
Links
- DECISION_LOG entry:
## 2026-04-14 (continued — Stripe webhook inbox pattern) - Memory:
~/.claude/projects/C--dev/memory/project_stripe_webhook_inbox.md - Related decision:
2026-04-14-critical-path-gate - Code:
churchwiseai-web/src/app/api/stripe/webhook/route.ts - Code:
churchwiseai-web/src/app/api/cron/process-stripe-webhooks/route.ts