Chatbot Excellence Blueprint
Purpose: Define the permanent architectural principles that make a world-class agentic chatbot. Specific LLMs, providers, and tools are pluggable — they change. The principles don't.
Rule: When evaluating any chatbot change, check this document first. When a new model launches, update the vendor slots — not the architecture.
Response Cascade Pipeline
1. Response Timing Thresholds (Permanent)
These are rooted in human cognition (Nielsen, 1993) and validated against 2025-2026 AI chatbot research. They don't change with technology.
| Threshold | Human Perception | Chatbot Requirement |
|---|---|---|
| < 200ms | Brain labels it "instant" | Visual receipt of user message + typing indicator |
| < 500ms | Flow of thought uninterrupted | Meaningful acknowledgment micro-copy (HEAR "Hear") |
| < 1s | Noticeable but acceptable | Time-to-first-token (TTFT) for streaming |
| < 2s | Edge of patience for text chat | First substantive content visible |
| < 5s | Danger zone — 59% expect response by now | Complete response for simple queries |
| > 10s | Attention breaks entirely | Must have rich progress indicators |
| > 2min | 58% abandon completely | System is broken |
Max acceptable dead silence (no visual feedback): 2 seconds.
2. The 5-Layer Response Pattern (Permanent)
Every chatbot response should flow through these layers. Each can be optimized independently.
Layer 1: Visual Receipt (< 200ms)
- Optimistic UI — user message appears instantly before server confirms
- Typing indicator or avatar + animation in bot area
- No network dependency — purely client-side
Layer 2: Empathetic Acknowledgment (< 500ms)
- NOT just a typing indicator — meaningful micro-copy that proves understanding
- Pattern-match emotional content client-side or via fast classifier
- Examples: "I hear that this is weighing on you..." / "Let me check that for you..."
- For pastoral context: this is HEAR "Hear" — the most critical moment
- Context-sensitive: detect urgency vs emotional distress. Urgent = skip to solution. Emotional = acknowledge first.
Layer 3: Streaming First Tokens (500ms - 2s)
- Token-by-token streaming via SSE
- Status:
submitted→streaming→ready - Users see reasoning unfold — feels "instant, alive, and trustworthy"
Layer 4: Progressive Enrichment (2s - 10s)
- Show what the agent is doing: "Searching knowledge base..." "Checking service times..."
- Surface partial results as they arrive
- Collapsible reasoning blocks for complex queries
Layer 5: Complete Response + Follow-up
- Full response with source attribution
- Proactive next-step suggestions (HEAR "Advance")
- Quick-reply buttons steering toward supported actions
3. The 5-Tier Response Cascade (Permanent)
Exit as early as possible. Each tier is faster and cheaper than the next.
| Tier | Mechanism | Target Latency | When to Use |
|---|---|---|---|
| 1. Exact Match | Intent detection → hardcoded response | < 50ms | Known questions with precise required answers |
| 2. Semantic Cache | Embed query → compare to cached embeddings | < 100ms | Repeated or near-identical questions |
| 3. Direct Retrieval | Hybrid search (keyword + vector) → return top chunk | 100-300ms | FAQ-style questions with high-confidence match |
| 4. LLM with Context | Standard RAG: retrieve → inject → generate | 1-3s | Novel questions requiring synthesis |
| 5. Agentic Multi-Step | Multi-round tool use + reasoning | 3-15s | Actions, complex multi-hop queries |
Principle: Most church questions ("What time are services?") are Tier 1-3. They should NEVER hit an LLM.
4. Multi-Stage Pipeline (Permanent Architecture)
World-class = multi-stage pipeline. Mediocre = single LLM call.
User Message
│
├─ [< 1ms] Intent Classification (embeddings + cosine, NOT an LLM)
│
├─ [< 50ms] Tier 1: Exact Match Check
│ └─ Hit? → Return immediately
│
├─ [< 100ms] Tier 2: Semantic Cache Check
│ └─ Hit? → Return cached response
│
├─ [100-300ms] Tier 3: Hybrid Retrieval (BM25 + Vector)
│ ├─ High confidence (> 0.90)? → Return directly (light formatting)
│ └─ Medium confidence (0.60-0.90)? → Feed to LLM as context
│
├─ [1-3s] Tier 4: LLM Generation with Retrieved Context
│ ├─ Route to fastest model for simple queries
│ └─ Route to smartest model for pastoral/complex
│
└─ [3-15s] Tier 5: Agentic Tool Loop
├─ Tool dispatch via fast model
├─ Parallel tool execution
└─ Final text via quality model
5. Model Routing Strategy (Permanent Principle, Pluggable Vendors)
Principle: Different tasks need different models. No single model is best at everything.
Routing Categories (Permanent)
| Category | What It Needs | Selection Criteria |
|---|---|---|
| Intent Classification | Speed, accuracy on short inputs | Fastest TTFT, lowest cost |
| Tool Dispatch | Speed, tool-selection accuracy | Fastest TTFT, good function-calling |
| Simple Factual Response | Speed, accuracy | Fast, cheap, good at short answers |
| Empathetic Pastoral Response | Empathy, nuance, warmth | Best at tone, longer generation OK |
| Complex Reasoning | Depth, multi-step logic | Smartest model, cost secondary |
| Crisis/Safety Detection | Recall (never miss), speed | Pattern match first, LLM as backup |
| Post-Conversation Summary | Accuracy, low cost | Can be async, cheapest capable model |
Current Vendor Slots (Re-evaluate quarterly)
Last evaluated: 2026-03-31
| Slot | Current Best | TTFT | Output Speed | Cost (input/output per 1M) | Notes |
|---|---|---|---|---|---|
| Fastest (dispatch/classify) | Gemini 2.5 Flash-Lite | 0.32s | 275+ t/s | $0.15 / $1.25 | Purpose-built for routing |
| Fast + capable | Gemini 2.5 Flash | 0.70s | 232 t/s | $0.30 / $2.50 | Good all-rounder |
| Empathetic text | Claude Haiku 4.5 | 0.69s | 86 t/s | $1.00 / $5.00 | Best pastoral tone |
| Deep reasoning | Claude Sonnet 4.6 | ~1.5s | ~80 t/s | $3.00 / $15.00 | Complex/sensitive |
| Budget fallback | GPT-5.4 Nano | ~0.3s | ~200 t/s | $0.20 / $1.25 | Brand new (Mar 2026) |
| Embeddings | text-embedding-3-small | N/A | N/A | ~$0.02 / 1M tokens | OpenAI |
Reliability (as of 2026-03-31)
- Anthropic: 99.04% uptime, frequent short outages
- Google: Fewer outages but longer when they happen (median 44h)
- Verdict: Multi-provider fallback is mandatory, not optional
6. Caching Strategy (Permanent Principles)
Prompt/Prefix Caching
- Principle: Static content (system prompt, tool definitions, few-shot examples) goes at the TOP. Variable content (user message, history) goes at the BOTTOM.
- All major providers offer ~90% cost reduction on cache hits
- Cache lifetime varies by provider (Anthropic: 5 min, refreshed on use)
Semantic Response Caching
- Principle: ~31% of chatbot queries are semantically similar. Cache them.
- Embed the query → compare to cached query embeddings → if similarity > threshold, return cached response
- Expected hit rate: 50-70% for a church chatbot (FAQ-heavy workload)
- Cache hit latency: < 100ms vs 1-5s for fresh generation
Cache-Augmented Generation (CAG)
- Principle: For bounded knowledge bases (< 100K tokens), preloading ALL content into context may beat RAG
- Church-sized knowledge (few hundred entries) fits easily
- Eliminates retrieval step entirely
- Trade-off: higher per-call token cost vs zero retrieval latency
7. Hybrid Retrieval (Permanent Architecture)
Pure vector search misses exact terms. Pure keyword search misses semantics. Always use both.
| Component | What It Catches | What It Misses |
|---|---|---|
| Vector Search | Semantic similarity ("ML models" = "machine learning") | Exact identifiers (phone numbers, names, Bible verses) |
| BM25 Keyword | Exact terms, proper nouns, IDs | Synonyms, context, paraphrases |
| Combined (RRF) | Both | 15-20% precision improvement over either alone |
Fusion: Reciprocal Rank Fusion (RRF) with k=60. Starting weights: Vector 0.7, BM25 0.3.
Confidence Thresholds (tune per embedding model):
| Decision | Threshold | Action |
|---|---|---|
| Semantic cache hit | > 0.95 | Return cached response |
| Direct retrieval answer | > 0.90 | Return top chunk (light formatting) |
| Standard RAG | 0.60-0.90 | Retrieve top-k, send to LLM |
| Query transformation | < 0.60 | Apply HyDE/multi-query before retrieval |
| Abstention | < 0.30 | "I don't have that information" |
8. Streaming Architecture (Permanent)
Principle: Token-by-token streaming is the standard. Full JSON POST → wait → display is a generation behind.
What Streaming Enables
- TTFT becomes the UX metric, not total response time
- Users tolerate 10s total if first token arrives in 500ms
- Tool execution progress visible in real-time
- Status states:
submitted→streaming→ready
Implementation Pattern (Next.js)
- Server:
streamText()→toUIMessageStreamResponse()(SSE) - Client:
useChat()hook with status-driven UI - Multi-step tool loops stream automatically
prepareStepcallback enables per-step model switching
9. HEAR Mapping to Chatbot Layers (Permanent for ChurchWiseAI)
| HEAR Step | Implementation | Timing |
|---|---|---|
| Hear | Lightweight classifier + empathetic micro-copy | < 500ms |
| Empathize | Tone-matched streaming tokens, adapted to detected emotion | 500ms-2s |
| Advance | Quick-reply buttons + proactive suggestions, move conversation forward | With complete response |
| Respond | Tool calls to connect to resources, invite next steps, capture what matters | 2-5s |
10. Testing & Monitoring Benchmarks (Permanent)
Latency SLAs
| Metric | Target |
|---|---|
| TTFT P50 | < 500ms |
| TTFT P95 | < 1,500ms |
| Total response P50 | < 2s (simple), < 5s (complex) |
| Total response P95 | < 5s (simple), < 10s (complex) |
Quality Benchmarks
| Metric | Mediocre | Good | World-Class |
|---|---|---|---|
| Resolution rate | 20-40% | 50-65% | 80-95% |
| First response latency | 3-8s | 1-3s | < 1s (streaming) |
| Knowledge accuracy | 70-80% | 85-90% | 95-99.9% |
| Human escalation rate | 60-80% | 35-50% | 5-20% |
Critical Alerts (Page immediately)
- Error rate > 5%
- TTFT P95 > 3s
- Empty responses > 2%
- Tool failure rate > 10%
- Hallucination score < 0.7
Golden Test Set Structure
- Core functionality (service times, directions, staff) — ~30 cases
- Tool invocation (prayer, contacts, callbacks) — ~20 cases
- Empathy / pastoral care (grief, crisis) — ~15 cases
- Guardrails (theology, off-topic, jailbreak) — ~15 cases
- Multi-turn coherence — ~10 cases
- Edge cases (typos, multilingual, ambiguous) — ~10 cases
11. Competitive Landscape (Re-evaluate quarterly)
Last evaluated: 2026-03-31
Top Agentic Chatbot Architectures
| Company | Key Innovation | Resolution Rate |
|---|---|---|
| Intercom Fin | Custom retrieval/reranking models, 7-phase pipeline | 66% avg, 86% for some |
| Sierra | Constellation of 15+ specialized models | 70%+ |
| Decagon | Agent Operating Procedures (natural language → deterministic logic) | 70-83% |
| Ada | Reasoning Engine + Playbooks | 80%+ |
| Cognigy | Nexus Engine + MCP interop | Enterprise focus |
What Separates World-Class from Mediocre
- Multi-stage pipeline, not single LLM call
- Sub-millisecond intent classification BEFORE the LLM
- Parallel tool execution + speculative pre-fetching
- Semantic caching (30%+ LLM call elimination)
- Knowledge-first, tools-second
- Streaming from first token
- Continuous self-improvement via automated quality scoring
12. Recommended Tooling Stack (Pluggable)
Last evaluated: 2026-03-31
| Layer | Current Recommendation | Why | Alternatives |
|---|---|---|---|
| SDK | Vercel AI SDK 6 | Native streaming, multi-provider, tool loops | LangChain, custom |
| Tracing | Langfuse (open source) | Free, deep tracing, cost tracking, A/B testing | LangSmith, Datadog |
| Evaluation | DeepEval | 60+ metrics, pytest integration, CI/CD gates | Promptfoo, Braintrust |
| Red Teaming | Promptfoo | Open source, used by OpenAI/Anthropic | Microsoft AI Red Team |
| Semantic Cache | Redis LangCache or pgvector-based | Sub-ms cache hits | GPTCache |
| Vector DB | Supabase pgvector (already in use) | Already deployed | Pinecone, Weaviate |
Appendix: Research Sources
Research compiled 2026-03-31 from 8 parallel research agents covering:
- Streaming/SSE architecture
- Instant acknowledgment UX patterns
- RAG direct-answer patterns
- Model latency benchmarks (Gemini vs Claude vs GPT)
- Agentic chatbot architecture (Intercom, Sierra, Decagon, Ada)
- Testing and monitoring best practices
- Church chatbot market landscape
- Vercel AI SDK patterns
Full research outputs archived in session. Key sources include Artificial Analysis benchmarks, Vercel AI SDK docs, LangChain State of Agent Engineering survey (1,340 respondents), Arxiv papers on SR-RAG/CRAG/CAG, and published architectures from Intercom Fin, Sierra, and Decagon.
13. Competitive Landscape — Church Chatbot Market (2026-03-31)
Nobody Combines Chat + Voice + Agentic Tools
| Competitor | Chat | Voice/Phone | Agentic Tools | Theological Awareness | ChMS Integration |
|---|---|---|---|---|---|
| Gloo + Faith Assistant | Yes | No | No | Trained on church sermons | No |
| AgentiveAIQ | Yes | No | No | No | No |
| Pastors.ai | Yes | No | No | No | No |
| ChurchBot.chat | Yes | SMS relay | Forms only | No | No |
| OnlineGiving.org | Yes | Voice giving | Sign-ups only | No | Own platform |
| My AI Front Desk | No | Yes | Basic | No | No |
| Zanus AI | Yes | No | Scheduling, volunteers | No | Planning Center |
| ChurchWiseAI | Yes | Yes | 39 tools, 8 categories | 17 traditions | Planned |
What Church Leaders Complain About (Barna, Christianity Today)
- "Feels like a search engine, not a person" — no empathy
- "Doesn't know MY church" — generic, not church-specific
- "I don't trust the theology" — no source transparency
- "Nobody answers our phone" — 55% of churches unreachable by phone
- "Data privacy scares me" — confession-level data shared with AI
- "Replaces human connection instead of enhancing it" — bridge to pastor, not replacement
- "Onboarding is too complex" — pastors are not technical
- "It can't take action" — answers questions but doesn't DO anything
ChurchWiseAI's Unfilled Competitive Moat
ChurchWiseAI is the ONLY product that combines: chat + voice + 39 agentic tools + 17 theological traditions + per-church doctrinal config + HEAR empathetic protocol + crisis/safety protocol. No competitor has more than 2 of these. The gap is in EXECUTION (latency, streaming, caching) — not features.
Pricing Context
Market clusters at $30-130/mo for mid-tier church chatbots. Nobody bundles voice. ChurchWiseAI's pricing ($14.95-$99.95/mo) is competitive-to-aggressive.
14. Streaming Architecture Details (Permanent)
Protocol: SSE over HTTP POST
- SSE is the 2025-2026 standard for LLM streaming (OpenAI, Anthropic, Google all use it)
- Browser's native
EventSourceonly supports GET — usefetch()+ReadableStreamreader - Anthropic uses named events:
message_start,content_block_delta,tool_use,message_stop
Vercel AI SDK 6 Standard Architecture
Server: streamText() → toUIMessageStreamResponse() (SSE)
Client: useChat() hook with status-driven UI
Status states: submitted → streaming → ready
Tool states: input-streaming → input-available → output-available
Per-Step Model Switching (prepareStep)
prepareStep: async ({ stepNumber }) => {
if (stepNumber === 0) return { model: fast('flash-lite') }; // classify
return { model: smart('claude-haiku') }; // generate
}
AG-UI Protocol (Emerging Standard)
Backed by CopilotKit, LangChain, Microsoft, Oracle. Event flow:
RUN_STARTED → STEP_STARTED → TEXT_MESSAGE_START → TEXT_MESSAGE_CONTENT (tokens)
→ TOOL_CALL_START → TOOL_CALL_ARGS → TOOL_CALL_RESULT → STEP_FINISHED → RUN_FINISHED