Chatbot Excellence Blueprint

Purpose: Define the permanent architectural principles that make a world-class agentic chatbot. Specific LLMs, providers, and tools are pluggable — they change. The principles don't.

Rule: When evaluating any chatbot change, check this document first. When a new model launches, update the vendor slots — not the architecture.

Response Cascade Pipeline

1. Response Timing Thresholds (Permanent)

These are rooted in human cognition (Nielsen, 1993) and validated against 2025-2026 AI chatbot research. They don't change with technology.

Threshold	Human Perception	Chatbot Requirement
< 200ms	Brain labels it "instant"	Visual receipt of user message + typing indicator
< 500ms	Flow of thought uninterrupted	Meaningful acknowledgment micro-copy (HEAR "Hear")
< 1s	Noticeable but acceptable	Time-to-first-token (TTFT) for streaming
< 2s	Edge of patience for text chat	First substantive content visible
< 5s	Danger zone — 59% expect response by now	Complete response for simple queries
> 10s	Attention breaks entirely	Must have rich progress indicators
> 2min	58% abandon completely	System is broken

Max acceptable dead silence (no visual feedback): 2 seconds.

2. The 5-Layer Response Pattern (Permanent)

Every chatbot response should flow through these layers. Each can be optimized independently.

Layer 1: Visual Receipt (< 200ms)

Optimistic UI — user message appears instantly before server confirms
Typing indicator or avatar + animation in bot area
No network dependency — purely client-side

Layer 2: Empathetic Acknowledgment (< 500ms)

NOT just a typing indicator — meaningful micro-copy that proves understanding
Pattern-match emotional content client-side or via fast classifier
Examples: "I hear that this is weighing on you..." / "Let me check that for you..."
For pastoral context: this is HEAR "Hear" — the most critical moment
Context-sensitive: detect urgency vs emotional distress. Urgent = skip to solution. Emotional = acknowledge first.

Layer 3: Streaming First Tokens (500ms - 2s)

Token-by-token streaming via SSE
Status: submitted → streaming → ready
Users see reasoning unfold — feels "instant, alive, and trustworthy"

Layer 4: Progressive Enrichment (2s - 10s)

Show what the agent is doing: "Searching knowledge base..." "Checking service times..."
Surface partial results as they arrive
Collapsible reasoning blocks for complex queries

Layer 5: Complete Response + Follow-up

Full response with source attribution
Proactive next-step suggestions (HEAR "Advance")
Quick-reply buttons steering toward supported actions

3. The 5-Tier Response Cascade (Permanent)

Exit as early as possible. Each tier is faster and cheaper than the next.

Tier	Mechanism	Target Latency	When to Use
1. Exact Match	Intent detection → hardcoded response	< 50ms	Known questions with precise required answers
2. Semantic Cache	Embed query → compare to cached embeddings	< 100ms	Repeated or near-identical questions
3. Direct Retrieval	Hybrid search (keyword + vector) → return top chunk	100-300ms	FAQ-style questions with high-confidence match
4. LLM with Context	Standard RAG: retrieve → inject → generate	1-3s	Novel questions requiring synthesis
5. Agentic Multi-Step	Multi-round tool use + reasoning	3-15s	Actions, complex multi-hop queries

Principle: Most church questions ("What time are services?") are Tier 1-3. They should NEVER hit an LLM.

4. Multi-Stage Pipeline (Permanent Architecture)

World-class = multi-stage pipeline. Mediocre = single LLM call.

User Message
    │
    ├─ [< 1ms] Intent Classification (embeddings + cosine, NOT an LLM)
    │
    ├─ [< 50ms] Tier 1: Exact Match Check
    │   └─ Hit? → Return immediately
    │
    ├─ [< 100ms] Tier 2: Semantic Cache Check
    │   └─ Hit? → Return cached response
    │
    ├─ [100-300ms] Tier 3: Hybrid Retrieval (BM25 + Vector)
    │   ├─ High confidence (> 0.90)? → Return directly (light formatting)
    │   └─ Medium confidence (0.60-0.90)? → Feed to LLM as context
    │
    ├─ [1-3s] Tier 4: LLM Generation with Retrieved Context
    │   ├─ Route to fastest model for simple queries
    │   └─ Route to smartest model for pastoral/complex
    │
    └─ [3-15s] Tier 5: Agentic Tool Loop
        ├─ Tool dispatch via fast model
        ├─ Parallel tool execution
        └─ Final text via quality model

5. Model Routing Strategy (Permanent Principle, Pluggable Vendors)

Principle: Different tasks need different models. No single model is best at everything.

Routing Categories (Permanent)

Category	What It Needs	Selection Criteria
Intent Classification	Speed, accuracy on short inputs	Fastest TTFT, lowest cost
Tool Dispatch	Speed, tool-selection accuracy	Fastest TTFT, good function-calling
Simple Factual Response	Speed, accuracy	Fast, cheap, good at short answers
Empathetic Pastoral Response	Empathy, nuance, warmth	Best at tone, longer generation OK
Complex Reasoning	Depth, multi-step logic	Smartest model, cost secondary
Crisis/Safety Detection	Recall (never miss), speed	Pattern match first, LLM as backup
Post-Conversation Summary	Accuracy, low cost	Can be async, cheapest capable model

Current Vendor Slots (Re-evaluate quarterly)

Last evaluated: 2026-03-31

Slot	Current Best	TTFT	Output Speed	Cost (input/output per 1M)	Notes
Fastest (dispatch/classify)	Gemini 2.5 Flash-Lite	0.32s	275+ t/s	$0.15 / $1.25	Purpose-built for routing
Fast + capable	Gemini 2.5 Flash	0.70s	232 t/s	$0.30 / $2.50	Good all-rounder
Empathetic text	Claude Haiku 4.5	0.69s	86 t/s	$1.00 / $5.00	Best pastoral tone
Deep reasoning	Claude Sonnet 4.6	~1.5s	~80 t/s	$3.00 / $15.00	Complex/sensitive
Budget fallback	GPT-5.4 Nano	~0.3s	~200 t/s	$0.20 / $1.25	Brand new (Mar 2026)
Embeddings	text-embedding-3-small	N/A	N/A	~$0.02 / 1M tokens	OpenAI

Reliability (as of 2026-03-31)

Anthropic: 99.04% uptime, frequent short outages
Google: Fewer outages but longer when they happen (median 44h)
Verdict: Multi-provider fallback is mandatory, not optional

6. Caching Strategy (Permanent Principles)

Prompt/Prefix Caching

Principle: Static content (system prompt, tool definitions, few-shot examples) goes at the TOP. Variable content (user message, history) goes at the BOTTOM.
All major providers offer ~90% cost reduction on cache hits
Cache lifetime varies by provider (Anthropic: 5 min, refreshed on use)

Semantic Response Caching

Principle: ~31% of chatbot queries are semantically similar. Cache them.
Embed the query → compare to cached query embeddings → if similarity > threshold, return cached response
Expected hit rate: 50-70% for a church chatbot (FAQ-heavy workload)
Cache hit latency: < 100ms vs 1-5s for fresh generation

Cache-Augmented Generation (CAG)

Principle: For bounded knowledge bases (< 100K tokens), preloading ALL content into context may beat RAG
Church-sized knowledge (few hundred entries) fits easily
Eliminates retrieval step entirely
Trade-off: higher per-call token cost vs zero retrieval latency

7. Hybrid Retrieval (Permanent Architecture)

Pure vector search misses exact terms. Pure keyword search misses semantics. Always use both.

Component	What It Catches	What It Misses
Vector Search	Semantic similarity ("ML models" = "machine learning")	Exact identifiers (phone numbers, names, Bible verses)
BM25 Keyword	Exact terms, proper nouns, IDs	Synonyms, context, paraphrases
Combined (RRF)	Both	15-20% precision improvement over either alone

Fusion: Reciprocal Rank Fusion (RRF) with k=60. Starting weights: Vector 0.7, BM25 0.3.

Confidence Thresholds (tune per embedding model):

Decision	Threshold	Action
Semantic cache hit	> 0.95	Return cached response
Direct retrieval answer	> 0.90	Return top chunk (light formatting)
Standard RAG	0.60-0.90	Retrieve top-k, send to LLM
Query transformation	< 0.60	Apply HyDE/multi-query before retrieval
Abstention	< 0.30	"I don't have that information"

8. Streaming Architecture (Permanent)

Principle: Token-by-token streaming is the standard. Full JSON POST → wait → display is a generation behind.

What Streaming Enables

TTFT becomes the UX metric, not total response time
Users tolerate 10s total if first token arrives in 500ms
Tool execution progress visible in real-time
Status states: submitted → streaming → ready

Implementation Pattern (Next.js)

Server: streamText() → toUIMessageStreamResponse() (SSE)
Client: useChat() hook with status-driven UI
Multi-step tool loops stream automatically
prepareStep callback enables per-step model switching

9. HEAR Mapping to Chatbot Layers (Permanent for ChurchWiseAI)

HEAR Step	Implementation	Timing
Hear	Lightweight classifier + empathetic micro-copy	< 500ms
Empathize	Tone-matched streaming tokens, adapted to detected emotion	500ms-2s
Advance	Quick-reply buttons + proactive suggestions, move conversation forward	With complete response
Respond	Tool calls to connect to resources, invite next steps, capture what matters	2-5s

10. Testing & Monitoring Benchmarks (Permanent)

Latency SLAs

Metric	Target
TTFT P50	< 500ms
TTFT P95	< 1,500ms
Total response P50	< 2s (simple), < 5s (complex)
Total response P95	< 5s (simple), < 10s (complex)

Quality Benchmarks

Metric	Mediocre	Good	World-Class
Resolution rate	20-40%	50-65%	80-95%
First response latency	3-8s	1-3s	< 1s (streaming)
Knowledge accuracy	70-80%	85-90%	95-99.9%
Human escalation rate	60-80%	35-50%	5-20%

Critical Alerts (Page immediately)

Error rate > 5%
TTFT P95 > 3s
Empty responses > 2%
Tool failure rate > 10%
Hallucination score < 0.7

Golden Test Set Structure

Core functionality (service times, directions, staff) — ~30 cases
Tool invocation (prayer, contacts, callbacks) — ~20 cases
Empathy / pastoral care (grief, crisis) — ~15 cases
Guardrails (theology, off-topic, jailbreak) — ~15 cases
Multi-turn coherence — ~10 cases
Edge cases (typos, multilingual, ambiguous) — ~10 cases

11. Competitive Landscape (Re-evaluate quarterly)

Last evaluated: 2026-03-31

Top Agentic Chatbot Architectures

Company	Key Innovation	Resolution Rate
Intercom Fin	Custom retrieval/reranking models, 7-phase pipeline	66% avg, 86% for some
Sierra	Constellation of 15+ specialized models	70%+
Decagon	Agent Operating Procedures (natural language → deterministic logic)	70-83%
Ada	Reasoning Engine + Playbooks	80%+
Cognigy	Nexus Engine + MCP interop	Enterprise focus

What Separates World-Class from Mediocre

Multi-stage pipeline, not single LLM call
Sub-millisecond intent classification BEFORE the LLM
Parallel tool execution + speculative pre-fetching
Semantic caching (30%+ LLM call elimination)
Knowledge-first, tools-second
Streaming from first token
Continuous self-improvement via automated quality scoring

12. Recommended Tooling Stack (Pluggable)

Last evaluated: 2026-03-31

Layer	Current Recommendation	Why	Alternatives
SDK	Vercel AI SDK 6	Native streaming, multi-provider, tool loops	LangChain, custom
Tracing	Langfuse (open source)	Free, deep tracing, cost tracking, A/B testing	LangSmith, Datadog
Evaluation	DeepEval	60+ metrics, pytest integration, CI/CD gates	Promptfoo, Braintrust
Red Teaming	Promptfoo	Open source, used by OpenAI/Anthropic	Microsoft AI Red Team
Semantic Cache	Redis LangCache or pgvector-based	Sub-ms cache hits	GPTCache
Vector DB	Supabase pgvector (already in use)	Already deployed	Pinecone, Weaviate

Appendix: Research Sources

Research compiled 2026-03-31 from 8 parallel research agents covering:

Streaming/SSE architecture
Instant acknowledgment UX patterns
RAG direct-answer patterns
Model latency benchmarks (Gemini vs Claude vs GPT)
Agentic chatbot architecture (Intercom, Sierra, Decagon, Ada)
Testing and monitoring best practices
Church chatbot market landscape
Vercel AI SDK patterns

Full research outputs archived in session. Key sources include Artificial Analysis benchmarks, Vercel AI SDK docs, LangChain State of Agent Engineering survey (1,340 respondents), Arxiv papers on SR-RAG/CRAG/CAG, and published architectures from Intercom Fin, Sierra, and Decagon.

13. Competitive Landscape — Church Chatbot Market (2026-03-31)

Nobody Combines Chat + Voice + Agentic Tools

Competitor	Chat	Voice/Phone	Agentic Tools	Theological Awareness	ChMS Integration
Gloo + Faith Assistant	Yes	No	No	Trained on church sermons	No
AgentiveAIQ	Yes	No	No	No	No
Pastors.ai	Yes	No	No	No	No
ChurchBot.chat	Yes	SMS relay	Forms only	No	No
OnlineGiving.org	Yes	Voice giving	Sign-ups only	No	Own platform
My AI Front Desk	No	Yes	Basic	No	No
Zanus AI	Yes	No	Scheduling, volunteers	No	Planning Center
ChurchWiseAI	Yes	Yes	39 tools, 8 categories	17 traditions	Planned

What Church Leaders Complain About (Barna, Christianity Today)

"Feels like a search engine, not a person" — no empathy
"Doesn't know MY church" — generic, not church-specific
"I don't trust the theology" — no source transparency
"Nobody answers our phone" — 55% of churches unreachable by phone
"Data privacy scares me" — confession-level data shared with AI
"Replaces human connection instead of enhancing it" — bridge to pastor, not replacement
"Onboarding is too complex" — pastors are not technical
"It can't take action" — answers questions but doesn't DO anything

ChurchWiseAI's Unfilled Competitive Moat

ChurchWiseAI is the ONLY product that combines: chat + voice + 39 agentic tools + 17 theological traditions + per-church doctrinal config + HEAR empathetic protocol + crisis/safety protocol. No competitor has more than 2 of these. The gap is in EXECUTION (latency, streaming, caching) — not features.

Pricing Context

Market clusters at $30-130/mo for mid-tier church chatbots. Nobody bundles voice. ChurchWiseAI's pricing ($14.95-$99.95/mo) is competitive-to-aggressive.

14. Streaming Architecture Details (Permanent)

Protocol: SSE over HTTP POST

SSE is the 2025-2026 standard for LLM streaming (OpenAI, Anthropic, Google all use it)
Browser's native EventSource only supports GET — use fetch() + ReadableStream reader
Anthropic uses named events: message_start, content_block_delta, tool_use, message_stop

Vercel AI SDK 6 Standard Architecture

Server: streamText() → toUIMessageStreamResponse() (SSE)
Client: useChat() hook with status-driven UI

Status states: submitted → streaming → ready
Tool states: input-streaming → input-available → output-available

Per-Step Model Switching (prepareStep)

prepareStep: async ({ stepNumber }) => {
  if (stepNumber === 0) return { model: fast('flash-lite') };  // classify
  return { model: smart('claude-haiku') };                      // generate
}

AG-UI Protocol (Emerging Standard)

Backed by CopilotKit, LangChain, Microsoft, Oracle. Event flow:

RUN_STARTED → STEP_STARTED → TEXT_MESSAGE_START → TEXT_MESSAGE_CONTENT (tokens)
→ TOOL_CALL_START → TOOL_CALL_ARGS → TOOL_CALL_RESULT → STEP_FINISHED → RUN_FINISHED

Response Cascade Pipeline​

1. Response Timing Thresholds (Permanent)​

2. The 5-Layer Response Pattern (Permanent)​

Layer 1: Visual Receipt (< 200ms)​

Layer 2: Empathetic Acknowledgment (< 500ms)​

Layer 3: Streaming First Tokens (500ms - 2s)​

Layer 4: Progressive Enrichment (2s - 10s)​

Layer 5: Complete Response + Follow-up​

3. The 5-Tier Response Cascade (Permanent)​

4. Multi-Stage Pipeline (Permanent Architecture)​

5. Model Routing Strategy (Permanent Principle, Pluggable Vendors)​

Routing Categories (Permanent)​

Current Vendor Slots (Re-evaluate quarterly)​

Reliability (as of 2026-03-31)​

6. Caching Strategy (Permanent Principles)​

Prompt/Prefix Caching​

Semantic Response Caching​

Cache-Augmented Generation (CAG)​

7. Hybrid Retrieval (Permanent Architecture)​

8. Streaming Architecture (Permanent)​

What Streaming Enables​

Implementation Pattern (Next.js)​

9. HEAR Mapping to Chatbot Layers (Permanent for ChurchWiseAI)​

10. Testing & Monitoring Benchmarks (Permanent)​

Latency SLAs​

Quality Benchmarks​

Critical Alerts (Page immediately)​

Golden Test Set Structure​

11. Competitive Landscape (Re-evaluate quarterly)​

Top Agentic Chatbot Architectures​

What Separates World-Class from Mediocre​

12. Recommended Tooling Stack (Pluggable)​

Appendix: Research Sources​

13. Competitive Landscape — Church Chatbot Market (2026-03-31)​

Nobody Combines Chat + Voice + Agentic Tools​

What Church Leaders Complain About (Barna, Christianity Today)​

ChurchWiseAI's Unfilled Competitive Moat​

Pricing Context​

14. Streaming Architecture Details (Permanent)​

Protocol: SSE over HTTP POST​

Vercel AI SDK 6 Standard Architecture​

Per-Step Model Switching (prepareStep)​

AG-UI Protocol (Emerging Standard)​

Response Cascade Pipeline

1. Response Timing Thresholds (Permanent)

2. The 5-Layer Response Pattern (Permanent)

Layer 1: Visual Receipt (< 200ms)

Layer 2: Empathetic Acknowledgment (< 500ms)

Layer 3: Streaming First Tokens (500ms - 2s)

Layer 4: Progressive Enrichment (2s - 10s)

Layer 5: Complete Response + Follow-up

3. The 5-Tier Response Cascade (Permanent)

4. Multi-Stage Pipeline (Permanent Architecture)

5. Model Routing Strategy (Permanent Principle, Pluggable Vendors)

Routing Categories (Permanent)

Current Vendor Slots (Re-evaluate quarterly)

Reliability (as of 2026-03-31)

6. Caching Strategy (Permanent Principles)

Prompt/Prefix Caching

Semantic Response Caching

Cache-Augmented Generation (CAG)

7. Hybrid Retrieval (Permanent Architecture)

8. Streaming Architecture (Permanent)

What Streaming Enables

Implementation Pattern (Next.js)

9. HEAR Mapping to Chatbot Layers (Permanent for ChurchWiseAI)

10. Testing & Monitoring Benchmarks (Permanent)

Latency SLAs

Quality Benchmarks

Critical Alerts (Page immediately)

Golden Test Set Structure

11. Competitive Landscape (Re-evaluate quarterly)

Top Agentic Chatbot Architectures

What Separates World-Class from Mediocre

12. Recommended Tooling Stack (Pluggable)

Appendix: Research Sources

13. Competitive Landscape — Church Chatbot Market (2026-03-31)

Nobody Combines Chat + Voice + Agentic Tools

What Church Leaders Complain About (Barna, Christianity Today)

ChurchWiseAI's Unfilled Competitive Moat

Pricing Context

14. Streaming Architecture Details (Permanent)

Protocol: SSE over HTTP POST

Vercel AI SDK 6 Standard Architecture

Per-Step Model Switching (prepareStep)

AG-UI Protocol (Emerging Standard)