Skip to main content

Manual Testing Retrospective: 40+ Issues Found in One Walk-Through

Date: 2026-03-30 Tester: CEO (John), manually walking the Starter Chat signup-to-dashboard journey Persona used: "Pastor Dave" -- non-technical Protestant pastor, 47, suburban Ohio, 150-member church Duration: Single session Issues found: 40+ Issues previously caught by automated testing: 0 of these 40+


Executive Summary

On March 30, 2026, the CEO manually tested the Starter Chat product by walking through the entire customer journey as a non-technical pastor. He found 40+ issues across 5 categories that weeks of automated testing (24 Playwright agents, 25 personas, 10 journey YAMLs, 62-touchpoint acceptance specs, code resilience audits) had completely missed.

This is not a failure of the testing tools. It is a failure of what the testing tools were asked to check. Every automated test asked "does the code work?" The CEO asked "would Pastor Dave succeed?" These are fundamentally different questions, and the gap between them is where 40+ bugs lived undetected.


Part 1: Root Cause Analysis

1.1 The Five Root Causes

Every issue found falls into one of five root causes. Understanding these is more important than fixing any individual bug.

Root Cause A: Marketing Copy Drift (10 issues)

What happened: Multiple agents edited marketing pages, pricing cards, FAQ sections, and email templates independently over weeks. Each change was locally correct at the time it was made. But no agent checked whether their change was consistent with all OTHER pages that reference the same data.

Why automated tests missed it: Automated tests verify what a single page shows. They do not cross-reference claims across pages. When an agent changed the tool count from 33 to 39, they updated the homepage but not the pricing page FAQ. When the agent architecture changed from 2 visible to 4 visible agents for Pro/Suite, marketing cards were not updated. Each page passed its own tests.

Specific issues caused:

  • Agent count wrong on pricing cards (2 shown, should be 4 for Pro/Suite)
  • Tool count stale in descriptions ("33 tools" should be "39 tools")
  • Church size language pigeonholing plans ("50-200 member churches")
  • Wrong demo phone numbers in FAQ (sales line, not demo lines)
  • Crisis FAQ implying non-existent system integration ("triggers" vs "shares")
  • Starter Kit email promising FAQ management (not available in Starter)
  • Pro Website upsell on Starter pricing card (conversion leak for entry-level buyers)
  • "Book a Strategy Call" prominent for $14.95 plan (founder time wasted on low-value leads)
  • Founder pricing badge repeated confusingly
  • "Voice + Chatbot" badge on This Week panel for chat-only plan

The pattern: Every one of these is a CONSISTENCY problem, not a CORRECTNESS problem. The data was correct somewhere in the system. It just was not propagated to every place that displays it.

Root Cause B: Payment Flow Architecture (5 issues)

What happened: The onboarding flow creates database records (premium_churches, churches, identities, organization_settings) BEFORE Stripe confirms payment. This is an architectural decision, not a bug -- but it creates downstream failures when checkout is abandoned or when the welcome email fires before payment.

Why automated tests missed it: Tests check the happy path: submit form, complete payment, verify dashboard loads. No test checked what happens when someone submits the form and then abandons Stripe checkout. No test checked the database state BETWEEN form submission and payment completion. No test verified the ORDER of operations (DB write vs payment vs email).

Specific issues caused:

  • DB records created before Stripe payment (3 broken flows when checkout abandoned)
  • Welcome email sent before payment (creates expectation of access before payment succeeds)
  • No trial notice in welcome email (customer does not know they have 14 days free)
  • Stripe showing CAD instead of USD (Adaptive Pricing not overridden)
  • No founder notification on new sale/trial (founder has no visibility into pipeline)

The pattern: These are SEQUENCE and LIFECYCLE bugs. They exist in the gaps between systems (form -> DB -> Stripe -> email -> dashboard), not within any single system.

Root Cause C: Tier-Gating Leakage (8 issues)

What happened: Dashboard components were built with features that span multiple tiers. Tier-gating was applied to major features (FAQ management, document upload, analytics) but not to every individual UI element within visible components. Voice-related UI elements leaked into chat-only plans. Getting Started steps referenced features the tier does not have.

Why automated tests missed it: The acceptance spec (starter-chat.md) defines 62 touchpoints with "Should See" and "Should NOT See" lists. But the spec operates at the SECTION level, not the ELEMENT level. It says "Agents tab: Care + Coordinator visible, Discipleship + Stewardship hidden." It does NOT say "Each agent card must NOT show a Voice badge." The granularity of the spec was too coarse.

Specific issues caused:

  • Voice badge showing on agent cards for chat-only plans
  • Voice greeting counted in training progress for chat-only plans
  • Document upload visible with lock icon but no upgrade message
  • SMS phone field visible for chat-only plans
  • Getting Started steps untrackable ("Customize agents" marked done on visit)
  • Suggested questions not loading saved values
  • Sharing links scattered across 3 tabs
  • Compliance checklist scaring customers ("Insurance provider notified")

The pattern: These are GRANULARITY bugs. The spec was correct at the macro level but did not drill down to every individual UI element. A pastor sees every element -- not just the sections the spec documents.

Root Cause D: Jargon and Pastor-Hostile UX (12 issues)

What happened: The dashboard was built by engineers for engineers. Technical terminology that is obvious to a developer is meaningless or frightening to a non-technical pastor. No automated test can detect "this word will confuse Pastor Dave" because confusion is a human judgment, not a computable property.

Why automated tests missed it: Automated tests check for the PRESENCE of content, not its COMPREHENSIBILITY. A test can verify that the text "2 personas" renders correctly. It cannot determine that a pastor has no idea what "personas" means in this context. Similarly, "believers_baptism_only" appearing as a raw variable name passes every rendering test -- the text is there, it is correct, and it is unintelligible.

Specific issues caused:

  • "Ministry tools" with no tooltip explaining what a "tool" is
  • "Care Agent" / "Coordinator Agent" with no explanation of what they do
  • "2 personas" meaningless to pastors (now showing specialization areas instead)
  • Doctrinal positions showing raw variable names (believers_baptism_only)
  • Handoff rules implying pastor needs to configure something complex
  • "Hero Photo URL" -- pastors do not know what this means
  • Human escalation settings buried in agent personality panels
  • Custom practice examples all showing same baptism text
  • Sermon section not denomination-aware (sermon vs homily for Catholics)
  • Safety Guide framed as legal requirement, not helpful resource
  • FAQ columns with uneven gaps
  • Bold/emphasis missing on key marketing copy

The pattern: These are EMPATHY bugs. They require understanding the mental model of a non-technical pastor, not the mental model of a developer. No amount of code testing detects them because the code is working perfectly.

Root Cause E: Email Content Mismatch (5 issues)

What happened: Email templates were written generically and not validated against each tier's actual feature set. The welcome email promises content (like FAQ management) that a Starter customer cannot access. The AI Starter Kit email references features that do not exist at the Starter tier. No test verified that email copy matches the tier's actual capabilities.

Why automated tests missed it: Email tests check delivery (was it sent?), formatting (does it render?), and links (do they work?). No test reads the email body and cross-references every claim against the tier's feature set. "Your AI Starter Kit includes FAQ management tips" passes every mechanical test but is a lie to a Starter customer.

Specific issues caused:

  • Duplicate "3 things" section in welcome email
  • AI Starter Kit email referencing FAQ management (Starter cannot do this)
  • No PDF download link in Starter Kit email
  • Magic link (/auth/magic) returning 505 error
  • No fallback plain-text URLs in emails

The pattern: These are PROMISE-vs-REALITY bugs. The email makes a promise. The product does not deliver. No test checks the relationship between the promise and the delivery.


1.2 Why the Existing Testing Infrastructure Missed All 40+

The ChurchWiseAI testing infrastructure as of March 30 includes:

LayerWhat It TestsWhat It Misses
Playwright specs (159 files)Page loads, element presence, link integrity, API responsesContent accuracy, cross-page consistency, comprehensibility
5-Question FrameworkPage-level goal evaluation with persona empathyOnly as good as the questions asked; never run by a human against production
25 Personas (YAML)Diverse user types with concerns and goalsPersonas are defined but tests run as AGENTS, not as confused humans
10 Journey YAMLsStep-by-step journey definitionsSteps define URLs and expected elements, not experiential quality
62-touchpoint acceptance specWhat each tier should/should not seeOperates at section granularity, not element granularity
Code resilience auditAnti-patterns, security, error handlingCode-level only, no UX or content analysis
QA Checklist (10 sections)Build, security, SEO, DB, content accuracyContent accuracy section checks canonical numbers but not copy drift

The fundamental gap: Every layer tests the system from the INSIDE OUT. "Does this component render the right props?" "Does this API return the right data?" "Does this page have the right elements?" None of them test from the OUTSIDE IN: "Would a real pastor, sitting at a real computer, with no knowledge of our codebase, actually succeed?"

The 5-Question Framework was designed to bridge this gap (Q3: "If I were this persona, would I know what to do next?"). But it has only ever been run by AI agents reading page content -- never by a human walking through production with fresh eyes. An AI agent reading a page does not experience confusion the way a pastor does. The agent knows what "personas" means. The agent can parse "believers_baptism_only" as a variable name. The agent does not feel scared by a compliance checklist.


Part 2: The Persona-Based Testing Gap

2.1 What the CEO Did Differently

The CEO walked through the product as "Pastor Dave" -- not as an engineer, not as an AI agent, not as someone who knows the codebase. He:

  1. Started from Google (or the homepage), not from a specific URL in a test file
  2. Read every word on every page as someone who has never heard of ChurchWiseAI
  3. Did not skip anything because "that is tested elsewhere"
  4. Asked "do I understand this?" at every element, not "does this render?"
  5. Checked emails as a customer, reading the promises and comparing them to what was available
  6. Noticed inconsistencies between pages because he saw them in sequence, not in isolation
  7. Felt confused by jargon and noted it, rather than parsing it as a test assertion
  8. Tried to USE the product, not just verify it loads

2.2 Why AI Agents Cannot Fully Replace This

AI agents are excellent at:

  • Checking element presence/absence (Q1, Q2)
  • Cross-referencing specs (Q2)
  • Identifying obvious UX issues (Q3, Q4)
  • Tracking goal progress (Q5)
  • Running at scale across many pages and journeys

AI agents struggle with:

  • Emotional confusion -- "this compliance checklist scares me" is a human reaction
  • Cumulative frustration -- seeing the same jargon on page after page compounds
  • Expectation gaps -- an email promises X, the dashboard delivers Y, the dissonance is felt, not computed
  • Visual hierarchy as experienced -- an agent reads all text equally; a human sees what is bold, large, or above the fold
  • Fresh eyes -- agents have read the codebase; they cannot truly pretend they have not
  • Sequence effects -- seeing the pricing page AFTER the homepage changes what you notice; agents test pages in isolation

2.3 The Real Gap: Layer B Has Never Been Run By a Human

The 5-Question Framework defines three layers:

  • Layer A (Mechanical/Playwright) -- runs in CI, automated
  • Layer B (AI Goal-Based) -- designed to be run weekly by AI agents
  • Layer C (Outcome Verification) -- database/email/API checks after journey

Layer B was conceived correctly but has a critical blind spot: it assumes AI agents can simulate human confusion. They cannot. Layer B needs a Layer B-Prime: periodic human walk-throughs using the same 5 Questions but with actual human perception.


Part 3: New Testing Methodology -- Filling the Gaps

3.1 Marketing Consistency Checks (Root Cause A)

Problem: Claims about agent counts, tool counts, pricing, features, and product behavior appear on 20+ pages. When one changes, the others drift.

New test: Cross-Page Claim Consistency Scanner

Create a canonical claims registry and scan all marketing pages against it.

# knowledge/tests/claims-registry.yaml
claims:
tool_count:
canonical_value: "39"
source: knowledge/data/features.yaml
pages_that_reference:
- /pricing (PricingGrid.tsx)
- / (homepage stats bar)
- /chatbot (feature section)
- /voice (feature section)
- /ai-for/[denomination] (stats)
patterns_to_search:
- '\d+ tools'
- '\d+ ministry tools'
- '\d+ AI tools'

agent_count_starter:
canonical_value: "2"
source: knowledge/data/features.yaml
pages_that_reference:
- /pricing (Starter card)
- /onboard (plan description)
patterns_to_search:
- '\d+ agents'
- '\d+ AI agents'

agent_count_pro:
canonical_value: "4"
source: knowledge/data/features.yaml
pages_that_reference:
- /pricing (Pro card)
- /chatbot (Pro features)

demo_phone_number:
canonical_value: "+14145551234" # actual demo line
source: CLAUDE.md (voice agent section)
pages_that_reference:
- /pricing (FAQ section)
- /demo
- /voice
patterns_to_search:
- '\+1\d{10}'
- '\(\d{3}\) \d{3}-\d{4}'

Implementation: Add a Playwright spec or script that:

  1. Loads the claims registry
  2. For each claim, visits every listed page
  3. Searches for the pattern
  4. Compares found values to canonical value
  5. Reports any mismatch as SPEC VIOLATION

Frequency: Every deploy (add to CI).

3.2 Tier-Gating Element-Level Verification (Root Cause C)

Problem: The acceptance spec checks at section granularity. Individual UI elements within visible sections leak features from other tiers.

New test: Element-Level Tier Audit

For each tier, enumerate EVERY UI element that varies by tier -- not just tabs and sections, but individual badges, labels, form fields, progress indicators, and CTAs.

Add to each acceptance spec a new section: "Element-Level Gating" with entries like:

### Element-Level Gating (Starter Chat)

| Component | Element | Expected | Actual Check |
|-----------|---------|----------|--------------|
| AgentCard | Voice badge | HIDDEN | data-testid="voice-badge" should not exist |
| TrainingProgress | Voice greeting step | HIDDEN | text "voice greeting" should not appear |
| TrainingProgress | Total steps denominator | Exclude voice steps | count should match chat-only steps |
| AgentCard | Voice greeting input | HIDDEN | input[name="voice_greeting"] should not exist |
| OverviewTab | "This Week" panel badge | "Chatbot" only | text should NOT contain "Voice" |
| GettingStarted | Step: Customize agents | Completion = non-trivial | should NOT mark done on tab visit |
| SettingsTab | SMS phone field | HIDDEN | input[name="sms_phone"] should not exist |
| DocumentUpload | Lock icon + upgrade CTA | Upgrade message present | text should contain "Upgrade to Pro" |
| SharingLinks | All share links | Single location | all share CTAs in one section |
| ComplianceChecklist | Legal items | Church-appropriate language | no "Insurance provider notified" for Starter |

Implementation: Generate Playwright assertions from this table. Each row becomes one expect() call. Add data-testid attributes to components where they do not exist.

Frequency: Every deploy (add to CI).

3.3 Email Content vs Feature Validation (Root Cause E)

Problem: Emails promise features the tier does not have.

New test: Email-Feature Cross-Reference

For each email template, extract every feature claim and verify it against the tier's feature set.

# knowledge/tests/email-feature-validation.yaml
emails:
welcome_email:
template: src/lib/emails/welcome-email.ts
tiers_that_receive: [starter_chat, starter_voice, starter_both, pro_chat, pro_both, suite_chat, suite_both]
claims_to_verify:
- claim: "14-day free trial"
condition: chat plans only
tiers_true: [starter_chat, pro_chat, suite_chat]
tiers_false: [starter_voice, starter_both, pro_both, suite_both]
- claim: "FAQ management"
tiers_true: [pro_chat, pro_both, suite_chat, suite_both]
tiers_false: [starter_chat, starter_voice, starter_both]
- claim: "Magic link to dashboard"
all_tiers: true
verify: /auth/magic route returns 200

starter_kit_email:
template: src/lib/emails/starter-kit-email.ts
tiers_that_receive: [starter_chat, starter_both]
claims_to_verify:
- claim: "FAQ management tips"
tiers_true: [] # Starter does NOT have FAQ management
tiers_false: [starter_chat, starter_both]
finding: "SPEC VIOLATION: email promises feature tier does not have"
- claim: "PDF download link"
all_tiers: true
verify: link href returns 200

Implementation: Parse email templates at build time. For each tier, verify every claim is accurate. Flag any claim that references a feature the tier does not have. Also verify every link in every email returns 200.

Frequency: Every deploy that touches email templates, plus weekly sweep.

3.4 Jargon Detection (Root Cause D)

Problem: Technical terminology in pastor-facing UI causes confusion. No automated test detects "this word will confuse a non-technical user."

New test: Jargon Scanner

Maintain a dictionary of terms that are meaningful to developers but not to pastors. Scan all customer-facing pages for these terms.

# knowledge/tests/jargon-dictionary.yaml
terms:
# Terms that should NEVER appear in customer-facing UI
forbidden:
- pattern: 'persona[s]?'
replacement: 'specialization area' or 'ministry focus'
- pattern: 'RAG'
replacement: 'knowledge base'
- pattern: 'LLM'
replacement: 'AI'
- pattern: 'endpoint'
replacement: 'connection' or 'service'
- pattern: 'webhook'
replacement: never show to customer
- pattern: 'slug'
replacement: never show to customer
- pattern: 'token'
context: only in auth flows
replacement: 'access link'

# Terms that need a tooltip or explanation
needs_explanation:
- pattern: 'ministry tools?'
explanation: "AI-powered actions like prayer request capture, visitor logging, appointment scheduling"
- pattern: 'Care Agent'
explanation: "Your AI assistant that handles pastoral care conversations -- prayer requests, counseling referrals, crisis support"
- pattern: 'Coordinator Agent'
explanation: "Your AI assistant that handles logistics -- service times, directions, event info, staff routing"
- pattern: 'handoff rules?'
explanation: "When and how the AI transfers a conversation to a real person"
- pattern: 'theological lens'
explanation: "Your church's tradition (Baptist, Catholic, Lutheran, etc.) that shapes how the AI responds"

# Variable names that should NEVER render as-is in UI
raw_variable_patterns:
- 'believers_baptism_only'
- 'infant_baptism'
- 'both_baptism'
- '_enabled$'
- '_config$'
- 'snake_case_anything'
- '^[a-z]+_[a-z]+' # any snake_case string

Implementation: Two layers:

  1. Build-time scan: Grep all .tsx files in customer-facing routes for forbidden terms. Fail the build if found.
  2. Runtime tooltips: For "needs explanation" terms, verify that a tooltip or info icon exists adjacent to the term. Playwright can check for title attributes, aria-describedby, or adjacent help icons.
  3. Variable name leak detection: Scan rendered page content for snake_case strings. Any snake_case text visible to the user is a rendering bug.

Frequency: Every deploy (build-time scan in CI). Weekly for tooltip verification.

3.5 Customer Journey Simulation -- Human Protocol (Root Cause: Layer B Gap)

Problem: AI agents test journeys by reading page content, not by experiencing them. The 5-Question Framework needs a human complement.

New process: Monthly CEO Walk-Through Protocol

Once per month, the CEO (or designated tester) walks through one complete customer journey using the following protocol:

MONTHLY HUMAN JOURNEY TEST
===========================
Date: ___________
Journey: ___________
Persona: ___________
Browser: Incognito, no extensions
Device: ___________

RULES:
1. Do NOT look at the codebase before or during the test
2. Do NOT use direct URLs -- start from Google or the homepage
3. Read EVERY word on EVERY page as if you have never seen it
4. Note EVERY moment of confusion, even if brief
5. Check EVERY email within 60 seconds of receiving it
6. Compare email promises to actual dashboard features
7. Try to USE the product, not just look at it
8. Time yourself -- if any step takes more than 2 minutes, note it

FOR EACH PAGE, ANSWER:
- Do I understand every word on this page? (Y/N, list confusing terms)
- Do I know what to do next? (Y/N, what is unclear)
- Is this consistent with what I saw on the previous page? (Y/N, what changed)
- Would I trust this company based on this page? (Y/N, what feels off)
- Does anything scare me or make me want to leave? (Y/N, what)

AFTER COMPLETING THE JOURNEY:
- How many pages did I visit total?
- How many times was I confused?
- How many broken links did I find?
- How many email/product mismatches did I find?
- Would I recommend this to another pastor? (Y/N, why)
- What was the single biggest friction point?

Frequency: Monthly, rotating through journeys. Priority order:

  1. Starter Chat (highest volume, lowest friction expected)
  2. Pro Chat (most features to verify)
  3. Voice Starter (telephony adds complexity)
  4. PewSearch Premium (cross-product)
  5. Suite Both (full feature surface)

3.6 Payment Flow Sequence Testing (Root Cause B)

Problem: Tests check the happy path end-state but not the intermediate states or failure paths in the payment flow.

New test: Payment Flow State Machine Test

States: Form Submitted | Checkout Started | Checkout Abandoned |
Payment Succeeded | Payment Failed | Webhook Received

Test every state transition:

1. Form Submitted -> Checkout Abandoned
VERIFY: No premium_churches record exists
VERIFY: No welcome email sent
VERIFY: No MailerLite subscriber added
VERIFY: No organization_settings record

2. Form Submitted -> Payment Succeeded -> Webhook Received
VERIFY: premium_churches created AFTER webhook (not before)
VERIFY: Welcome email sent AFTER webhook (not before)
VERIFY: Email mentions 14-day trial (for chat plans)
VERIFY: Currency is USD (not localized)
VERIFY: Founder notification sent (email or Slack)

3. Payment Succeeded -> Webhook Delayed (30s+)
VERIFY: Return page shows spinner, not error
VERIFY: Return page polls for record
VERIFY: Dashboard accessible after webhook arrives

4. Duplicate Webhook Received
VERIFY: No duplicate records created
VERIFY: No duplicate emails sent
VERIFY: webhook_events table prevents reprocessing

Implementation: Stripe CLI test mode with stripe trigger checkout.session.completed. Verify database state at each step.

Frequency: After any change to onboarding, checkout, or webhook handlers.


Part 4: Self-Annealing Recommendations

These recommendations make the system automatically detect and prevent the types of issues found today, without requiring manual testing.

4.1 Cross-Page Consistency Guard (prevents Root Cause A)

Mechanism: A pre-commit hook or CI step that:

  1. Reads knowledge/data/features.yaml and knowledge/data/pricing.yaml
  2. Scans every .tsx file in marketing routes (/pricing, /, /chatbot, /voice, /ai-for/)
  3. Flags any hardcoded number that does not match the canonical source
  4. Fails the build if a mismatch is found

Scope: Tool counts, agent counts, pricing, tradition counts, church counts, phone numbers.

4.2 Tier-Gating Regression Guard (prevents Root Cause C)

Mechanism: A Playwright test suite that:

  1. Logs into the admin dashboard as each tier (using test accounts)
  2. For each tier, verifies every element in the Element-Level Gating table
  3. Fails if any voice-related element appears for chat-only plans
  4. Fails if any Pro+ element appears for Starter plans

Scope: Every dashboard component with tier-conditional rendering.

4.3 Email Template Lint (prevents Root Cause E)

Mechanism: A build-time check that:

  1. Parses each email template
  2. Extracts feature references (FAQ, document upload, voice, analytics, etc.)
  3. For each tier that receives the email, verifies the feature exists at that tier
  4. Fails the build if an email promises a feature the tier does not have

Scope: All email templates in src/lib/emails/.

4.4 Jargon Lint (prevents Root Cause D)

Mechanism: A custom ESLint rule or build-time scan that:

  1. Reads the jargon dictionary
  2. Scans all customer-facing components for forbidden terms
  3. Warns on "needs explanation" terms without adjacent tooltips
  4. Fails on raw variable names rendered as text (snake_case in UI)

Scope: All components in routes that customers see (marketing pages, dashboard, chat interfaces, emails).

4.5 Payment-First Architecture Enforcement (prevents Root Cause B)

Mechanism: Integration tests that:

  1. Submit the onboard form
  2. Verify zero DB records exist before checkout completion
  3. Complete checkout
  4. Verify records exist only after webhook processing
  5. Verify email sent only after webhook processing

Scope: Every checkout flow (onboard, upgrade, PewSearch claim).

4.6 Drift Detection via Knowledge Derivation

Mechanism: Extend the existing pnpm derive system to:

  1. Read canonical values from knowledge/data/*.yaml
  2. Scan all marketing pages and dashboard components for references
  3. Generate a drift report comparing found values to canonical values
  4. Fail if any drift detected

This builds on the existing derivation system but extends it to UI content, not just documentation.


Part 5: Updated Testing Architecture

Before (as of 2026-03-29)

Layer A: Mechanical (Playwright) -- "Does it load?"
Layer B: AI Goal-Based (5-Question) -- "Would a persona succeed?"
Layer C: Outcome Verification -- "Did the backend work?"
Layer D: Code Resilience -- "Are there anti-patterns?"

After (as of 2026-03-30)

Layer A: Mechanical (Playwright) -- "Does it load?"
Layer B: AI Goal-Based (5-Question) -- "Would a persona succeed?"
Layer B': Human Walk-Through -- "Does a REAL human succeed?" [NEW]
Layer C: Outcome Verification -- "Did the backend work?"
Layer D: Code Resilience -- "Are there anti-patterns?"
Layer E: Cross-Page Consistency -- "Do all pages agree?" [NEW]
Layer F: Tier-Gating Element Audit -- "Does every element respect tiers?" [NEW]
Layer G: Email-Feature Validation -- "Do emails match features?" [NEW]
Layer H: Jargon Detection -- "Would a pastor understand this?" [NEW]
Layer I: Payment Sequence Verification -- "Is the payment flow atomic?" [NEW]

Testing Cadence

LayerFrequencyWho/What Runs It
A: MechanicalEvery deployCI/CD (Playwright)
B: AI Goal-BasedWeekly + before launchAI agent via /qa goals
B': Human Walk-ThroughMonthlyCEO or designated tester
C: Outcome VerificationPer journeyAutomated after Layer B
D: Code ResilienceBefore launch + monthlyAI agent via /qa resilience
E: Cross-Page ConsistencyEvery deployCI/CD (custom scanner)
F: Tier-Gating Element AuditEvery deployCI/CD (Playwright per-tier)
G: Email-Feature ValidationEvery email template changeBuild-time check
H: Jargon DetectionEvery deployBuild-time scan + weekly tooltip check
I: Payment SequenceAfter checkout/webhook changesIntegration test (Stripe CLI)

Part 6: Checklist for Future Manual Testing

When the CEO (or any human) does a manual walk-through, use this checklist in addition to the persona protocol in Section 3.5.

Pre-Test Setup

  • Use an incognito/private browser window
  • Use a REAL email address you can check
  • Do NOT look at the codebase or admin tools beforehand
  • Have the persona card printed or visible (name, age, role, tech comfort, key concern)
  • Set a timer for each step

Marketing Pages (Root Cause A checks)

  • Count the number of "tools" mentioned -- is it consistent across pages?
  • Count the number of "agents" mentioned -- is it consistent across pages?
  • Check every phone number -- is it a real demo line or a sales/support line?
  • Read every FAQ answer -- does it match the actual product?
  • Check every badge/label on pricing cards -- are they tier-appropriate?
  • Look for upsells to products above the tier being tested
  • Look for "Book a Call" CTAs -- are they appropriate for this price point?
  • Check church size language -- does it exclude your persona's church?

Payment Flow (Root Cause B checks)

  • Note the exact price shown at every step (page, form, Stripe checkout)
  • Check the currency -- USD, not localized
  • Note whether trial is mentioned and consistent (14 days)
  • ABANDON checkout mid-flow -- check email and DB for orphan records
  • Complete checkout -- verify email arrives AFTER payment, not during form submission
  • Check for founder notification of the new signup

Dashboard (Root Cause C checks)

  • For EVERY visible element, ask: "Is this relevant to my tier?"
  • Look for voice-related content on chat-only plans
  • Check training progress -- are all counted steps achievable at this tier?
  • Check Getting Started -- can each step actually be completed?
  • Look for lock icons -- do they explain how to unlock?
  • Find every sharing/embed link -- are they all in one place?

Comprehensibility (Root Cause D checks)

  • Read every label and heading out loud -- would a pastor understand it?
  • Look for snake_case text, technical variable names, or code artifacts
  • Check every form field label -- would a pastor know what to enter?
  • Look for compliance/legal language -- is it reassuring or scary?
  • Check agent names and descriptions -- do they explain what the agent does?
  • Look for tooltips on technical terms -- are they present and helpful?

Emails (Root Cause E checks)

  • Read every email as a customer, not as an engineer
  • For each feature mentioned in the email, verify it is available at this tier
  • Click every link in every email -- do they all work?
  • Check the "from" address and brand name -- consistent?
  • Look for plain-text fallback URLs

Cross-Page Consistency (catch-all)

  • Compare the pricing page claims to the dashboard reality
  • Compare the email promises to the dashboard features
  • Compare the homepage claims to the pricing page details
  • Note any number that appears differently on different pages

Part 7: Immediate Action Items from Today's Test

Critical (fix before launch)

  1. Payment-first architecture: Move ALL DB writes to webhook handler. No records before Stripe confirms payment.
  2. Magic link fix: /auth/magic returning 505 -- investigate and fix.
  3. Email content per tier: Make welcome email and Starter Kit email tier-aware. Remove feature references that do not apply.
  4. USD currency enforcement: Force currency: 'usd' in all Stripe checkout sessions.
  5. Voice badge removal: Strip all voice-related UI from chat-only plan dashboards.

Important (fix before first customer)

  1. Agent count accuracy: Update pricing cards to show correct agent counts per tier.
  2. Tool count update: Change all "33 tools" references to "39 tools."
  3. Demo phone numbers: Replace sales line in FAQ with actual demo numbers.
  4. Jargon cleanup: Add tooltips for "ministry tools," agent names, "theological lens."
  5. Variable name rendering: Fix doctrinal position display to show human-readable labels.
  6. Founder notifications: Send email to founder on every new trial/sale.
  7. Training progress per tier: Calculate completion based on tier-available steps only.
  8. Getting Started tracking: Implement real completion tracking (not "mark done on visit").

Minor (fix in next sprint)

  1. Church size language: Remove or broaden "50-200 member churches" copy.
  2. Strategy Call placement: Remove or deprioritize for Starter tier.
  3. Compliance checklist tone: Reframe as helpful resource, not legal requirement.
  4. Sharing link consolidation: Move all share/embed links to one location.
  5. FAQ column alignment: Fix uneven gaps in FAQ layout.
  6. Bold/emphasis in marketing: Add emphasis to key selling points.
  7. Founder pricing badge: Show once, not repeated on each card.

Part 8: Systemic Lessons

Lesson 1: "Does the code work?" is not "Does the customer succeed?"

This is the CLAUDE.md north star, and we were not living up to it. Every test asked "does the code work?" -- page loads, elements render, API returns data, database writes succeed. None asked "would Pastor Dave, sitting at his desk on a Tuesday afternoon, actually get his chatbot set up and helping his congregation?"

The 5-Question Framework was designed to bridge this gap. It has the right questions. It has the right personas. But it was only ever run by AI agents, who read code and page content with developer eyes. It needs to be run by a human who does not know the code.

Lesson 2: Consistency bugs are the hardest to catch

A tool count that is correct on 19 of 20 pages is nearly impossible to catch with per-page tests. You need cross-page tests that compare values. The claims registry (Section 3.1) and the derive system (Section 4.6) address this, but they must be built and enforced.

Lesson 3: Granularity of specs matters enormously

The starter-chat.md acceptance spec has 62 touchpoints. It is thorough. But it operates at the section level: "Agents tab: Care + Coordinator visible." It does not say "each agent card must not have a voice badge." The element-level gating table (Section 3.2) adds the missing granularity.

Lesson 4: Email is the most under-tested touchpoint

Emails were the most neglected part of the testing infrastructure. They are also one of the most impactful customer touchpoints -- an email with broken links or false promises creates immediate distrust. Email templates need the same rigor as dashboard components.

Lesson 5: One human walk-through found more customer-facing issues than 24 automated agents

This is not an argument against automated testing. Automated testing catches hundreds of real bugs. But it IS an argument for regular human walk-throughs. The CEO should test one journey per month, using the protocol in Section 3.5, and the findings should be treated as high-priority issues.


Appendix: Issue-to-Root-Cause Mapping

#IssueCategoryRoot Cause
1Agent count wrong on pricing cardsMarketing DriftA
2Tool count stale ("33" not "39")Marketing DriftA
3Church size language pigeonholingMarketing DriftA
4Wrong demo phone numbers in FAQMarketing DriftA
5Crisis FAQ implying integrationMarketing DriftA
6Starter Kit email wrong contentMarketing DriftA + E
7Pro Website upsell on Starter cardMarketing DriftA
8Strategy Call for $14.95 planMarketing DriftA
9Founder pricing badge repeatedMarketing DriftA
10Voice badge on chat-only panelMarketing DriftA + C
11DB records before Stripe paymentPayment FlowB
12Welcome email before paymentPayment FlowB
13No trial notice in welcome emailPayment FlowB + E
14Stripe showing CAD not USDPayment FlowB
15No founder notification on salePayment FlowB
16Voice badge on agent cards (chat-only)Tier-GatingC
17Voice greeting in training progressTier-GatingC
18Document upload visible with lockTier-GatingC
19SMS phone field visible (chat-only)Tier-GatingC
20Getting Started steps untrackableTier-GatingC
21Suggested questions not loadingTier-GatingC
22Sharing links scatteredTier-GatingC
23Compliance checklist scaring usersTier-GatingC + D
24"Ministry tools" no tooltipUX/CopyD
25Agent names unexplainedUX/CopyD
26"2 personas" meaninglessUX/CopyD
27Raw variable names in doctrinal positionsUX/CopyD
28Handoff rules implying config neededUX/CopyD
29"Hero Photo URL" jargonUX/CopyD
30Human escalation buriedUX/CopyD
31Same baptism example everywhereUX/CopyD
32Sermon/homily not denomination-awareUX/CopyD
33Safety Guide framed as legalUX/CopyD
34FAQ columns unevenUX/CopyD
35Missing bold/emphasisUX/CopyD
36Duplicate "3 things" in welcome emailEmailE
37Starter Kit email wrong featuresEmailE
38No PDF link in Starter Kit emailEmailE
39Magic link 505 errorEmailE
40No plain-text fallback URLsEmailE

ADDENDUM: Persona Prompts & TAG-Based Testing (added post-retrospective)

The CEO's Insight

"You said AI agents can't detect confusion. I think you're selling yourself short. Instead of 'You are an expert QA engineer,' try 'You are a tired pastor who confuses easily.' The persona IS the test."

Two New Methodologies

1. Persona Test Prompts — Full library at knowledge/tests/persona-test-prompts.md

Instead of expert prompts, test with:

  • The Tired Pastor — catches jargon, unclear UX, missing guidance
  • The Anxious Board Member — catches scary compliance language, missing safety info
  • The Justice-Minded Fact Checker — catches claim drift (tool counts, pricing, agent counts)
  • The Overwhelmed First-Timer — catches missing onboarding, too many options
  • The Catholic Secretary — catches Protestant assumptions, denomination-specific terminology
  • The Skeptical IT Director — catches vague security claims, missing API docs
  • The Budget Treasurer — catches hidden fees, upsell pressure, unclear pricing

Run 3+ personas per journey. Each catches what the others miss.

2. TAG-Based Consistency Registry — Full registry at knowledge/tests/tag-registry.yaml

Every customer-visible claim gets a TAG with:

  • Canonical value (the correct number/text)
  • Every location it appears in the codebase
  • Whether a tooltip/explanation is required
  • Per-tier variations

Example: #tools_count has canonical value "39", appears in 6 files, per-tier values (12/35/39), requires tooltip.

Before any marketing/UI change, search for the TAG and update ALL occurrences. After any change, verify consistency across all locations.

How These Prevent the 40+ Issues

Root CausePersona That Catches ItTAG That Tracks It
Marketing driftJustice-Minded Fact Checker#tools_count, #agent_count, #pricing
Tier-gating leakageTired Pastor, First-Timer#tier_features, #channel_gating
JargonTired Pastor, First-Timer#jargon_forbidden
Email mismatchJustice-Minded#tier_features cross-ref with email content
Denomination issuesCatholic Secretary#denomination_labels
Compliance fearAnxious Board Member(compliance section audit)
Payment flowJustice-Minded#pricing, payment state machine test

Integration with Existing Test Infrastructure

These methodologies slot into the existing testing layers:

  • Layer A (Unit): unchanged
  • Layer B (Integration): unchanged
  • Layer C (E2E Playwright): unchanged
  • Layer D (5-Question AI): unchanged
  • Layer E (Persona Prompts): NEW — 3+ persona agents per journey
  • Layer F (TAG Consistency): NEW — automated cross-page claim verification
  • Layer G (Expected Output): unchanged
  • Layer H (Code Resilience): unchanged
  • Layer I (Monthly CEO Walk): unchanged

Layers E and F are the bridge between "does the code work?" and "does the customer succeed?"