Manual Testing Retrospective: 40+ Issues Found in One Walk-Through

Date: 2026-03-30 Tester: CEO (John), manually walking the Starter Chat signup-to-dashboard journey Persona used: "Pastor Dave" -- non-technical Protestant pastor, 47, suburban Ohio, 150-member church Duration: Single session Issues found: 40+ Issues previously caught by automated testing: 0 of these 40+

Executive Summary

On March 30, 2026, the CEO manually tested the Starter Chat product by walking through the entire customer journey as a non-technical pastor. He found 40+ issues across 5 categories that weeks of automated testing (24 Playwright agents, 25 personas, 10 journey YAMLs, 62-touchpoint acceptance specs, code resilience audits) had completely missed.

This is not a failure of the testing tools. It is a failure of what the testing tools were asked to check. Every automated test asked "does the code work?" The CEO asked "would Pastor Dave succeed?" These are fundamentally different questions, and the gap between them is where 40+ bugs lived undetected.

Part 1: Root Cause Analysis

1.1 The Five Root Causes

Every issue found falls into one of five root causes. Understanding these is more important than fixing any individual bug.

Root Cause A: Marketing Copy Drift (10 issues)

What happened: Multiple agents edited marketing pages, pricing cards, FAQ sections, and email templates independently over weeks. Each change was locally correct at the time it was made. But no agent checked whether their change was consistent with all OTHER pages that reference the same data.

Why automated tests missed it: Automated tests verify what a single page shows. They do not cross-reference claims across pages. When an agent changed the tool count from 33 to 39, they updated the homepage but not the pricing page FAQ. When the agent architecture changed from 2 visible to 4 visible agents for Pro/Suite, marketing cards were not updated. Each page passed its own tests.

Specific issues caused:

Agent count wrong on pricing cards (2 shown, should be 4 for Pro/Suite)
Tool count stale in descriptions ("33 tools" should be "39 tools")
Church size language pigeonholing plans ("50-200 member churches")
Wrong demo phone numbers in FAQ (sales line, not demo lines)
Crisis FAQ implying non-existent system integration ("triggers" vs "shares")
Starter Kit email promising FAQ management (not available in Starter)
Pro Website upsell on Starter pricing card (conversion leak for entry-level buyers)
"Book a Strategy Call" prominent for $14.95 plan (founder time wasted on low-value leads)
Founder pricing badge repeated confusingly
"Voice + Chatbot" badge on This Week panel for chat-only plan

The pattern: Every one of these is a CONSISTENCY problem, not a CORRECTNESS problem. The data was correct somewhere in the system. It just was not propagated to every place that displays it.

Root Cause B: Payment Flow Architecture (5 issues)

What happened: The onboarding flow creates database records (premium_churches, churches, identities, organization_settings) BEFORE Stripe confirms payment. This is an architectural decision, not a bug -- but it creates downstream failures when checkout is abandoned or when the welcome email fires before payment.

Why automated tests missed it: Tests check the happy path: submit form, complete payment, verify dashboard loads. No test checked what happens when someone submits the form and then abandons Stripe checkout. No test checked the database state BETWEEN form submission and payment completion. No test verified the ORDER of operations (DB write vs payment vs email).

Specific issues caused:

DB records created before Stripe payment (3 broken flows when checkout abandoned)
Welcome email sent before payment (creates expectation of access before payment succeeds)
No trial notice in welcome email (customer does not know they have 14 days free)
Stripe showing CAD instead of USD (Adaptive Pricing not overridden)
No founder notification on new sale/trial (founder has no visibility into pipeline)

The pattern: These are SEQUENCE and LIFECYCLE bugs. They exist in the gaps between systems (form -> DB -> Stripe -> email -> dashboard), not within any single system.

Root Cause C: Tier-Gating Leakage (8 issues)

What happened: Dashboard components were built with features that span multiple tiers. Tier-gating was applied to major features (FAQ management, document upload, analytics) but not to every individual UI element within visible components. Voice-related UI elements leaked into chat-only plans. Getting Started steps referenced features the tier does not have.

Why automated tests missed it: The acceptance spec (starter-chat.md) defines 62 touchpoints with "Should See" and "Should NOT See" lists. But the spec operates at the SECTION level, not the ELEMENT level. It says "Agents tab: Care + Coordinator visible, Discipleship + Stewardship hidden." It does NOT say "Each agent card must NOT show a Voice badge." The granularity of the spec was too coarse.

Specific issues caused:

Voice badge showing on agent cards for chat-only plans
Voice greeting counted in training progress for chat-only plans
Document upload visible with lock icon but no upgrade message
SMS phone field visible for chat-only plans
Getting Started steps untrackable ("Customize agents" marked done on visit)
Suggested questions not loading saved values
Sharing links scattered across 3 tabs
Compliance checklist scaring customers ("Insurance provider notified")

The pattern: These are GRANULARITY bugs. The spec was correct at the macro level but did not drill down to every individual UI element. A pastor sees every element -- not just the sections the spec documents.

Root Cause D: Jargon and Pastor-Hostile UX (12 issues)

What happened: The dashboard was built by engineers for engineers. Technical terminology that is obvious to a developer is meaningless or frightening to a non-technical pastor. No automated test can detect "this word will confuse Pastor Dave" because confusion is a human judgment, not a computable property.

Why automated tests missed it: Automated tests check for the PRESENCE of content, not its COMPREHENSIBILITY. A test can verify that the text "2 personas" renders correctly. It cannot determine that a pastor has no idea what "personas" means in this context. Similarly, "believers_baptism_only" appearing as a raw variable name passes every rendering test -- the text is there, it is correct, and it is unintelligible.

Specific issues caused:

"Ministry tools" with no tooltip explaining what a "tool" is
"Care Agent" / "Coordinator Agent" with no explanation of what they do
"2 personas" meaningless to pastors (now showing specialization areas instead)
Doctrinal positions showing raw variable names (believers_baptism_only)
Handoff rules implying pastor needs to configure something complex
"Hero Photo URL" -- pastors do not know what this means
Human escalation settings buried in agent personality panels
Custom practice examples all showing same baptism text
Sermon section not denomination-aware (sermon vs homily for Catholics)
Safety Guide framed as legal requirement, not helpful resource
FAQ columns with uneven gaps
Bold/emphasis missing on key marketing copy

The pattern: These are EMPATHY bugs. They require understanding the mental model of a non-technical pastor, not the mental model of a developer. No amount of code testing detects them because the code is working perfectly.

Root Cause E: Email Content Mismatch (5 issues)

What happened: Email templates were written generically and not validated against each tier's actual feature set. The welcome email promises content (like FAQ management) that a Starter customer cannot access. The AI Starter Kit email references features that do not exist at the Starter tier. No test verified that email copy matches the tier's actual capabilities.

Why automated tests missed it: Email tests check delivery (was it sent?), formatting (does it render?), and links (do they work?). No test reads the email body and cross-references every claim against the tier's feature set. "Your AI Starter Kit includes FAQ management tips" passes every mechanical test but is a lie to a Starter customer.

Specific issues caused:

Duplicate "3 things" section in welcome email
AI Starter Kit email referencing FAQ management (Starter cannot do this)
No PDF download link in Starter Kit email
Magic link (/auth/magic) returning 505 error
No fallback plain-text URLs in emails

The pattern: These are PROMISE-vs-REALITY bugs. The email makes a promise. The product does not deliver. No test checks the relationship between the promise and the delivery.

1.2 Why the Existing Testing Infrastructure Missed All 40+

The ChurchWiseAI testing infrastructure as of March 30 includes:

Layer	What It Tests	What It Misses
Playwright specs (159 files)	Page loads, element presence, link integrity, API responses	Content accuracy, cross-page consistency, comprehensibility
5-Question Framework	Page-level goal evaluation with persona empathy	Only as good as the questions asked; never run by a human against production
25 Personas (YAML)	Diverse user types with concerns and goals	Personas are defined but tests run as AGENTS, not as confused humans
10 Journey YAMLs	Step-by-step journey definitions	Steps define URLs and expected elements, not experiential quality
62-touchpoint acceptance spec	What each tier should/should not see	Operates at section granularity, not element granularity
Code resilience audit	Anti-patterns, security, error handling	Code-level only, no UX or content analysis
QA Checklist (10 sections)	Build, security, SEO, DB, content accuracy	Content accuracy section checks canonical numbers but not copy drift

The fundamental gap: Every layer tests the system from the INSIDE OUT. "Does this component render the right props?" "Does this API return the right data?" "Does this page have the right elements?" None of them test from the OUTSIDE IN: "Would a real pastor, sitting at a real computer, with no knowledge of our codebase, actually succeed?"

The 5-Question Framework was designed to bridge this gap (Q3: "If I were this persona, would I know what to do next?"). But it has only ever been run by AI agents reading page content -- never by a human walking through production with fresh eyes. An AI agent reading a page does not experience confusion the way a pastor does. The agent knows what "personas" means. The agent can parse "believers_baptism_only" as a variable name. The agent does not feel scared by a compliance checklist.

Part 2: The Persona-Based Testing Gap

2.1 What the CEO Did Differently

The CEO walked through the product as "Pastor Dave" -- not as an engineer, not as an AI agent, not as someone who knows the codebase. He:

Started from Google (or the homepage), not from a specific URL in a test file
Read every word on every page as someone who has never heard of ChurchWiseAI
Did not skip anything because "that is tested elsewhere"
Asked "do I understand this?" at every element, not "does this render?"
Checked emails as a customer, reading the promises and comparing them to what was available
Noticed inconsistencies between pages because he saw them in sequence, not in isolation
Felt confused by jargon and noted it, rather than parsing it as a test assertion
Tried to USE the product, not just verify it loads

2.2 Why AI Agents Cannot Fully Replace This

AI agents are excellent at:

Checking element presence/absence (Q1, Q2)
Cross-referencing specs (Q2)
Identifying obvious UX issues (Q3, Q4)
Tracking goal progress (Q5)
Running at scale across many pages and journeys

AI agents struggle with:

Emotional confusion -- "this compliance checklist scares me" is a human reaction
Cumulative frustration -- seeing the same jargon on page after page compounds
Expectation gaps -- an email promises X, the dashboard delivers Y, the dissonance is felt, not computed
Visual hierarchy as experienced -- an agent reads all text equally; a human sees what is bold, large, or above the fold
Fresh eyes -- agents have read the codebase; they cannot truly pretend they have not
Sequence effects -- seeing the pricing page AFTER the homepage changes what you notice; agents test pages in isolation

2.3 The Real Gap: Layer B Has Never Been Run By a Human

The 5-Question Framework defines three layers:

Layer A (Mechanical/Playwright) -- runs in CI, automated
Layer B (AI Goal-Based) -- designed to be run weekly by AI agents
Layer C (Outcome Verification) -- database/email/API checks after journey

Layer B was conceived correctly but has a critical blind spot: it assumes AI agents can simulate human confusion. They cannot. Layer B needs a Layer B-Prime: periodic human walk-throughs using the same 5 Questions but with actual human perception.

Part 3: New Testing Methodology -- Filling the Gaps

3.1 Marketing Consistency Checks (Root Cause A)

Problem: Claims about agent counts, tool counts, pricing, features, and product behavior appear on 20+ pages. When one changes, the others drift.

New test: Cross-Page Claim Consistency Scanner

Create a canonical claims registry and scan all marketing pages against it.

# knowledge/tests/claims-registry.yaml
claims:
  tool_count:
    canonical_value: "39"
    source: knowledge/data/features.yaml
    pages_that_reference:
      - /pricing (PricingGrid.tsx)
      - / (homepage stats bar)
      - /chatbot (feature section)
      - /voice (feature section)
      - /ai-for/[denomination] (stats)
    patterns_to_search:
      - '\d+ tools'
      - '\d+ ministry tools'
      - '\d+ AI tools'

  agent_count_starter:
    canonical_value: "2"
    source: knowledge/data/features.yaml
    pages_that_reference:
      - /pricing (Starter card)
      - /onboard (plan description)
    patterns_to_search:
      - '\d+ agents'
      - '\d+ AI agents'

  agent_count_pro:
    canonical_value: "4"
    source: knowledge/data/features.yaml
    pages_that_reference:
      - /pricing (Pro card)
      - /chatbot (Pro features)

  demo_phone_number:
    canonical_value: "+14145551234"  # actual demo line
    source: CLAUDE.md (voice agent section)
    pages_that_reference:
      - /pricing (FAQ section)
      - /demo
      - /voice
    patterns_to_search:
      - '\+1\d{10}'
      - '\(\d{3}\) \d{3}-\d{4}'

Implementation: Add a Playwright spec or script that:

Loads the claims registry
For each claim, visits every listed page
Searches for the pattern
Compares found values to canonical value
Reports any mismatch as SPEC VIOLATION

Frequency: Every deploy (add to CI).

3.2 Tier-Gating Element-Level Verification (Root Cause C)

Problem: The acceptance spec checks at section granularity. Individual UI elements within visible sections leak features from other tiers.

New test: Element-Level Tier Audit

For each tier, enumerate EVERY UI element that varies by tier -- not just tabs and sections, but individual badges, labels, form fields, progress indicators, and CTAs.

Add to each acceptance spec a new section: "Element-Level Gating" with entries like:

### Element-Level Gating (Starter Chat)

| Component | Element | Expected | Actual Check |
|-----------|---------|----------|--------------|
| AgentCard | Voice badge | HIDDEN | data-testid="voice-badge" should not exist |
| TrainingProgress | Voice greeting step | HIDDEN | text "voice greeting" should not appear |
| TrainingProgress | Total steps denominator | Exclude voice steps | count should match chat-only steps |
| AgentCard | Voice greeting input | HIDDEN | input[name="voice_greeting"] should not exist |
| OverviewTab | "This Week" panel badge | "Chatbot" only | text should NOT contain "Voice" |
| GettingStarted | Step: Customize agents | Completion = non-trivial | should NOT mark done on tab visit |
| SettingsTab | SMS phone field | HIDDEN | input[name="sms_phone"] should not exist |
| DocumentUpload | Lock icon + upgrade CTA | Upgrade message present | text should contain "Upgrade to Pro" |
| SharingLinks | All share links | Single location | all share CTAs in one section |
| ComplianceChecklist | Legal items | Church-appropriate language | no "Insurance provider notified" for Starter |

Implementation: Generate Playwright assertions from this table. Each row becomes one expect() call. Add data-testid attributes to components where they do not exist.

Frequency: Every deploy (add to CI).

3.3 Email Content vs Feature Validation (Root Cause E)

Problem: Emails promise features the tier does not have.

New test: Email-Feature Cross-Reference

For each email template, extract every feature claim and verify it against the tier's feature set.

# knowledge/tests/email-feature-validation.yaml
emails:
  welcome_email:
    template: src/lib/emails/welcome-email.ts
    tiers_that_receive: [starter_chat, starter_voice, starter_both, pro_chat, pro_both, suite_chat, suite_both]
    claims_to_verify:
      - claim: "14-day free trial"
        condition: chat plans only
        tiers_true: [starter_chat, pro_chat, suite_chat]
        tiers_false: [starter_voice, starter_both, pro_both, suite_both]
      - claim: "FAQ management"
        tiers_true: [pro_chat, pro_both, suite_chat, suite_both]
        tiers_false: [starter_chat, starter_voice, starter_both]
      - claim: "Magic link to dashboard"
        all_tiers: true
        verify: /auth/magic route returns 200

  starter_kit_email:
    template: src/lib/emails/starter-kit-email.ts
    tiers_that_receive: [starter_chat, starter_both]
    claims_to_verify:
      - claim: "FAQ management tips"
        tiers_true: []  # Starter does NOT have FAQ management
        tiers_false: [starter_chat, starter_both]
        finding: "SPEC VIOLATION: email promises feature tier does not have"
      - claim: "PDF download link"
        all_tiers: true
        verify: link href returns 200

Implementation: Parse email templates at build time. For each tier, verify every claim is accurate. Flag any claim that references a feature the tier does not have. Also verify every link in every email returns 200.

Frequency: Every deploy that touches email templates, plus weekly sweep.

3.4 Jargon Detection (Root Cause D)

Problem: Technical terminology in pastor-facing UI causes confusion. No automated test detects "this word will confuse a non-technical user."

New test: Jargon Scanner

Maintain a dictionary of terms that are meaningful to developers but not to pastors. Scan all customer-facing pages for these terms.

# knowledge/tests/jargon-dictionary.yaml
terms:
  # Terms that should NEVER appear in customer-facing UI
  forbidden:
    - pattern: 'persona[s]?'
      replacement: 'specialization area' or 'ministry focus'
    - pattern: 'RAG'
      replacement: 'knowledge base'
    - pattern: 'LLM'
      replacement: 'AI'
    - pattern: 'endpoint'
      replacement: 'connection' or 'service'
    - pattern: 'webhook'
      replacement: never show to customer
    - pattern: 'slug'
      replacement: never show to customer
    - pattern: 'token'
      context: only in auth flows
      replacement: 'access link'

  # Terms that need a tooltip or explanation
  needs_explanation:
    - pattern: 'ministry tools?'
      explanation: "AI-powered actions like prayer request capture, visitor logging, appointment scheduling"
    - pattern: 'Care Agent'
      explanation: "Your AI assistant that handles pastoral care conversations -- prayer requests, counseling referrals, crisis support"
    - pattern: 'Coordinator Agent'
      explanation: "Your AI assistant that handles logistics -- service times, directions, event info, staff routing"
    - pattern: 'handoff rules?'
      explanation: "When and how the AI transfers a conversation to a real person"
    - pattern: 'theological lens'
      explanation: "Your church's tradition (Baptist, Catholic, Lutheran, etc.) that shapes how the AI responds"

  # Variable names that should NEVER render as-is in UI
  raw_variable_patterns:
    - 'believers_baptism_only'
    - 'infant_baptism'
    - 'both_baptism'
    - '_enabled$'
    - '_config$'
    - 'snake_case_anything'
    - '^[a-z]+_[a-z]+'  # any snake_case string

Implementation: Two layers:

Build-time scan: Grep all .tsx files in customer-facing routes for forbidden terms. Fail the build if found.
Runtime tooltips: For "needs explanation" terms, verify that a tooltip or info icon exists adjacent to the term. Playwright can check for title attributes, aria-describedby, or adjacent help icons.
Variable name leak detection: Scan rendered page content for snake_case strings. Any snake_case text visible to the user is a rendering bug.

Frequency: Every deploy (build-time scan in CI). Weekly for tooltip verification.

3.5 Customer Journey Simulation -- Human Protocol (Root Cause: Layer B Gap)

Problem: AI agents test journeys by reading page content, not by experiencing them. The 5-Question Framework needs a human complement.

New process: Monthly CEO Walk-Through Protocol

Once per month, the CEO (or designated tester) walks through one complete customer journey using the following protocol:

MONTHLY HUMAN JOURNEY TEST
===========================
Date: ___________
Journey: ___________
Persona: ___________
Browser: Incognito, no extensions
Device: ___________

RULES:
1. Do NOT look at the codebase before or during the test
2. Do NOT use direct URLs -- start from Google or the homepage
3. Read EVERY word on EVERY page as if you have never seen it
4. Note EVERY moment of confusion, even if brief
5. Check EVERY email within 60 seconds of receiving it
6. Compare email promises to actual dashboard features
7. Try to USE the product, not just look at it
8. Time yourself -- if any step takes more than 2 minutes, note it

FOR EACH PAGE, ANSWER:
- Do I understand every word on this page? (Y/N, list confusing terms)
- Do I know what to do next? (Y/N, what is unclear)
- Is this consistent with what I saw on the previous page? (Y/N, what changed)
- Would I trust this company based on this page? (Y/N, what feels off)
- Does anything scare me or make me want to leave? (Y/N, what)

AFTER COMPLETING THE JOURNEY:
- How many pages did I visit total?
- How many times was I confused?
- How many broken links did I find?
- How many email/product mismatches did I find?
- Would I recommend this to another pastor? (Y/N, why)
- What was the single biggest friction point?

Frequency: Monthly, rotating through journeys. Priority order:

Starter Chat (highest volume, lowest friction expected)
Pro Chat (most features to verify)
Voice Starter (telephony adds complexity)
PewSearch Premium (cross-product)
Suite Both (full feature surface)

3.6 Payment Flow Sequence Testing (Root Cause B)

Problem: Tests check the happy path end-state but not the intermediate states or failure paths in the payment flow.

New test: Payment Flow State Machine Test

States: Form Submitted | Checkout Started | Checkout Abandoned |
        Payment Succeeded | Payment Failed | Webhook Received

Test every state transition:

1. Form Submitted -> Checkout Abandoned
   VERIFY: No premium_churches record exists
   VERIFY: No welcome email sent
   VERIFY: No MailerLite subscriber added
   VERIFY: No organization_settings record

2. Form Submitted -> Payment Succeeded -> Webhook Received
   VERIFY: premium_churches created AFTER webhook (not before)
   VERIFY: Welcome email sent AFTER webhook (not before)
   VERIFY: Email mentions 14-day trial (for chat plans)
   VERIFY: Currency is USD (not localized)
   VERIFY: Founder notification sent (email or Slack)

3. Payment Succeeded -> Webhook Delayed (30s+)
   VERIFY: Return page shows spinner, not error
   VERIFY: Return page polls for record
   VERIFY: Dashboard accessible after webhook arrives

4. Duplicate Webhook Received
   VERIFY: No duplicate records created
   VERIFY: No duplicate emails sent
   VERIFY: webhook_events table prevents reprocessing

Implementation: Stripe CLI test mode with stripe trigger checkout.session.completed. Verify database state at each step.

Frequency: After any change to onboarding, checkout, or webhook handlers.

Part 4: Self-Annealing Recommendations

These recommendations make the system automatically detect and prevent the types of issues found today, without requiring manual testing.

4.1 Cross-Page Consistency Guard (prevents Root Cause A)

Mechanism: A pre-commit hook or CI step that:

Reads knowledge/data/features.yaml and knowledge/data/pricing.yaml
Scans every .tsx file in marketing routes (/pricing, /, /chatbot, /voice, /ai-for/)
Flags any hardcoded number that does not match the canonical source
Fails the build if a mismatch is found

Scope: Tool counts, agent counts, pricing, tradition counts, church counts, phone numbers.

4.2 Tier-Gating Regression Guard (prevents Root Cause C)

Mechanism: A Playwright test suite that:

Logs into the admin dashboard as each tier (using test accounts)
For each tier, verifies every element in the Element-Level Gating table
Fails if any voice-related element appears for chat-only plans
Fails if any Pro+ element appears for Starter plans

Scope: Every dashboard component with tier-conditional rendering.

4.3 Email Template Lint (prevents Root Cause E)

Mechanism: A build-time check that:

Parses each email template
Extracts feature references (FAQ, document upload, voice, analytics, etc.)
For each tier that receives the email, verifies the feature exists at that tier
Fails the build if an email promises a feature the tier does not have

Scope: All email templates in src/lib/emails/.

4.4 Jargon Lint (prevents Root Cause D)

Mechanism: A custom ESLint rule or build-time scan that:

Reads the jargon dictionary
Scans all customer-facing components for forbidden terms
Warns on "needs explanation" terms without adjacent tooltips
Fails on raw variable names rendered as text (snake_case in UI)

Scope: All components in routes that customers see (marketing pages, dashboard, chat interfaces, emails).

4.5 Payment-First Architecture Enforcement (prevents Root Cause B)

Mechanism: Integration tests that:

Submit the onboard form
Verify zero DB records exist before checkout completion
Complete checkout
Verify records exist only after webhook processing
Verify email sent only after webhook processing

Scope: Every checkout flow (onboard, upgrade, PewSearch claim).

4.6 Drift Detection via Knowledge Derivation

Mechanism: Extend the existing pnpm derive system to:

Read canonical values from knowledge/data/*.yaml
Scan all marketing pages and dashboard components for references
Generate a drift report comparing found values to canonical values
Fail if any drift detected

This builds on the existing derivation system but extends it to UI content, not just documentation.

Part 5: Updated Testing Architecture

Before (as of 2026-03-29)

Layer A: Mechanical (Playwright)        -- "Does it load?"
Layer B: AI Goal-Based (5-Question)     -- "Would a persona succeed?"
Layer C: Outcome Verification           -- "Did the backend work?"
Layer D: Code Resilience                -- "Are there anti-patterns?"

After (as of 2026-03-30)

Layer A: Mechanical (Playwright)        -- "Does it load?"
Layer B: AI Goal-Based (5-Question)     -- "Would a persona succeed?"
  Layer B': Human Walk-Through          -- "Does a REAL human succeed?" [NEW]
Layer C: Outcome Verification           -- "Did the backend work?"
Layer D: Code Resilience                -- "Are there anti-patterns?"
Layer E: Cross-Page Consistency         -- "Do all pages agree?" [NEW]
Layer F: Tier-Gating Element Audit      -- "Does every element respect tiers?" [NEW]
Layer G: Email-Feature Validation       -- "Do emails match features?" [NEW]
Layer H: Jargon Detection              -- "Would a pastor understand this?" [NEW]
Layer I: Payment Sequence Verification  -- "Is the payment flow atomic?" [NEW]

Testing Cadence

Layer	Frequency	Who/What Runs It
A: Mechanical	Every deploy	CI/CD (Playwright)
B: AI Goal-Based	Weekly + before launch	AI agent via `/qa goals`
B': Human Walk-Through	Monthly	CEO or designated tester
C: Outcome Verification	Per journey	Automated after Layer B
D: Code Resilience	Before launch + monthly	AI agent via `/qa resilience`
E: Cross-Page Consistency	Every deploy	CI/CD (custom scanner)
F: Tier-Gating Element Audit	Every deploy	CI/CD (Playwright per-tier)
G: Email-Feature Validation	Every email template change	Build-time check
H: Jargon Detection	Every deploy	Build-time scan + weekly tooltip check
I: Payment Sequence	After checkout/webhook changes	Integration test (Stripe CLI)

Part 6: Checklist for Future Manual Testing

When the CEO (or any human) does a manual walk-through, use this checklist in addition to the persona protocol in Section 3.5.

Pre-Test Setup

Use an incognito/private browser window
Use a REAL email address you can check
Do NOT look at the codebase or admin tools beforehand
Have the persona card printed or visible (name, age, role, tech comfort, key concern)
Set a timer for each step

Marketing Pages (Root Cause A checks)

Count the number of "tools" mentioned -- is it consistent across pages?
Count the number of "agents" mentioned -- is it consistent across pages?
Check every phone number -- is it a real demo line or a sales/support line?
Read every FAQ answer -- does it match the actual product?
Check every badge/label on pricing cards -- are they tier-appropriate?
Look for upsells to products above the tier being tested
Look for "Book a Call" CTAs -- are they appropriate for this price point?
Check church size language -- does it exclude your persona's church?

Payment Flow (Root Cause B checks)

Note the exact price shown at every step (page, form, Stripe checkout)
Check the currency -- USD, not localized
Note whether trial is mentioned and consistent (14 days)
ABANDON checkout mid-flow -- check email and DB for orphan records
Complete checkout -- verify email arrives AFTER payment, not during form submission
Check for founder notification of the new signup

Dashboard (Root Cause C checks)

For EVERY visible element, ask: "Is this relevant to my tier?"
Look for voice-related content on chat-only plans
Check training progress -- are all counted steps achievable at this tier?
Check Getting Started -- can each step actually be completed?
Look for lock icons -- do they explain how to unlock?
Find every sharing/embed link -- are they all in one place?

Comprehensibility (Root Cause D checks)

Read every label and heading out loud -- would a pastor understand it?
Look for snake_case text, technical variable names, or code artifacts
Check every form field label -- would a pastor know what to enter?
Look for compliance/legal language -- is it reassuring or scary?
Check agent names and descriptions -- do they explain what the agent does?
Look for tooltips on technical terms -- are they present and helpful?

Emails (Root Cause E checks)

Read every email as a customer, not as an engineer
For each feature mentioned in the email, verify it is available at this tier
Click every link in every email -- do they all work?
Check the "from" address and brand name -- consistent?
Look for plain-text fallback URLs

Cross-Page Consistency (catch-all)

Compare the pricing page claims to the dashboard reality
Compare the email promises to the dashboard features
Compare the homepage claims to the pricing page details
Note any number that appears differently on different pages

Part 7: Immediate Action Items from Today's Test

Critical (fix before launch)

Payment-first architecture: Move ALL DB writes to webhook handler. No records before Stripe confirms payment.
Magic link fix: /auth/magic returning 505 -- investigate and fix.
Email content per tier: Make welcome email and Starter Kit email tier-aware. Remove feature references that do not apply.
USD currency enforcement: Force currency: 'usd' in all Stripe checkout sessions.
Voice badge removal: Strip all voice-related UI from chat-only plan dashboards.

Important (fix before first customer)

Agent count accuracy: Update pricing cards to show correct agent counts per tier.
Tool count update: Change all "33 tools" references to "39 tools."
Demo phone numbers: Replace sales line in FAQ with actual demo numbers.
Jargon cleanup: Add tooltips for "ministry tools," agent names, "theological lens."
Variable name rendering: Fix doctrinal position display to show human-readable labels.
Founder notifications: Send email to founder on every new trial/sale.
Training progress per tier: Calculate completion based on tier-available steps only.
Getting Started tracking: Implement real completion tracking (not "mark done on visit").

Minor (fix in next sprint)

Church size language: Remove or broaden "50-200 member churches" copy.
Strategy Call placement: Remove or deprioritize for Starter tier.
Compliance checklist tone: Reframe as helpful resource, not legal requirement.
Sharing link consolidation: Move all share/embed links to one location.
FAQ column alignment: Fix uneven gaps in FAQ layout.
Bold/emphasis in marketing: Add emphasis to key selling points.
Founder pricing badge: Show once, not repeated on each card.

Part 8: Systemic Lessons

Lesson 1: "Does the code work?" is not "Does the customer succeed?"

This is the CLAUDE.md north star, and we were not living up to it. Every test asked "does the code work?" -- page loads, elements render, API returns data, database writes succeed. None asked "would Pastor Dave, sitting at his desk on a Tuesday afternoon, actually get his chatbot set up and helping his congregation?"

The 5-Question Framework was designed to bridge this gap. It has the right questions. It has the right personas. But it was only ever run by AI agents, who read code and page content with developer eyes. It needs to be run by a human who does not know the code.

Lesson 2: Consistency bugs are the hardest to catch

A tool count that is correct on 19 of 20 pages is nearly impossible to catch with per-page tests. You need cross-page tests that compare values. The claims registry (Section 3.1) and the derive system (Section 4.6) address this, but they must be built and enforced.

Lesson 3: Granularity of specs matters enormously

The starter-chat.md acceptance spec has 62 touchpoints. It is thorough. But it operates at the section level: "Agents tab: Care + Coordinator visible." It does not say "each agent card must not have a voice badge." The element-level gating table (Section 3.2) adds the missing granularity.

Lesson 4: Email is the most under-tested touchpoint

Emails were the most neglected part of the testing infrastructure. They are also one of the most impactful customer touchpoints -- an email with broken links or false promises creates immediate distrust. Email templates need the same rigor as dashboard components.

Lesson 5: One human walk-through found more customer-facing issues than 24 automated agents

This is not an argument against automated testing. Automated testing catches hundreds of real bugs. But it IS an argument for regular human walk-throughs. The CEO should test one journey per month, using the protocol in Section 3.5, and the findings should be treated as high-priority issues.

Appendix: Issue-to-Root-Cause Mapping

#	Issue	Category	Root Cause
1	Agent count wrong on pricing cards	Marketing Drift	A
2	Tool count stale ("33" not "39")	Marketing Drift	A
3	Church size language pigeonholing	Marketing Drift	A
4	Wrong demo phone numbers in FAQ	Marketing Drift	A
5	Crisis FAQ implying integration	Marketing Drift	A
6	Starter Kit email wrong content	Marketing Drift	A + E
7	Pro Website upsell on Starter card	Marketing Drift	A
8	Strategy Call for $14.95 plan	Marketing Drift	A
9	Founder pricing badge repeated	Marketing Drift	A
10	Voice badge on chat-only panel	Marketing Drift	A + C
11	DB records before Stripe payment	Payment Flow	B
12	Welcome email before payment	Payment Flow	B
13	No trial notice in welcome email	Payment Flow	B + E
14	Stripe showing CAD not USD	Payment Flow	B
15	No founder notification on sale	Payment Flow	B
16	Voice badge on agent cards (chat-only)	Tier-Gating	C
17	Voice greeting in training progress	Tier-Gating	C
18	Document upload visible with lock	Tier-Gating	C
19	SMS phone field visible (chat-only)	Tier-Gating	C
20	Getting Started steps untrackable	Tier-Gating	C
21	Suggested questions not loading	Tier-Gating	C
22	Sharing links scattered	Tier-Gating	C
23	Compliance checklist scaring users	Tier-Gating	C + D
24	"Ministry tools" no tooltip	UX/Copy	D
25	Agent names unexplained	UX/Copy	D
26	"2 personas" meaningless	UX/Copy	D
27	Raw variable names in doctrinal positions	UX/Copy	D
28	Handoff rules implying config needed	UX/Copy	D
29	"Hero Photo URL" jargon	UX/Copy	D
30	Human escalation buried	UX/Copy	D
31	Same baptism example everywhere	UX/Copy	D
32	Sermon/homily not denomination-aware	UX/Copy	D
33	Safety Guide framed as legal	UX/Copy	D
34	FAQ columns uneven	UX/Copy	D
35	Missing bold/emphasis	UX/Copy	D
36	Duplicate "3 things" in welcome email	Email	E
37	Starter Kit email wrong features	Email	E
38	No PDF link in Starter Kit email	Email	E
39	Magic link 505 error	Email	E
40	No plain-text fallback URLs	Email	E

ADDENDUM: Persona Prompts & TAG-Based Testing (added post-retrospective)

The CEO's Insight

"You said AI agents can't detect confusion. I think you're selling yourself short. Instead of 'You are an expert QA engineer,' try 'You are a tired pastor who confuses easily.' The persona IS the test."

Two New Methodologies

1. Persona Test Prompts — Full library at knowledge/tests/persona-test-prompts.md

Instead of expert prompts, test with:

The Tired Pastor — catches jargon, unclear UX, missing guidance
The Anxious Board Member — catches scary compliance language, missing safety info
The Justice-Minded Fact Checker — catches claim drift (tool counts, pricing, agent counts)
The Overwhelmed First-Timer — catches missing onboarding, too many options
The Catholic Secretary — catches Protestant assumptions, denomination-specific terminology
The Skeptical IT Director — catches vague security claims, missing API docs
The Budget Treasurer — catches hidden fees, upsell pressure, unclear pricing

Run 3+ personas per journey. Each catches what the others miss.

2. TAG-Based Consistency Registry — Full registry at knowledge/tests/tag-registry.yaml

Every customer-visible claim gets a TAG with:

Canonical value (the correct number/text)
Every location it appears in the codebase
Whether a tooltip/explanation is required
Per-tier variations

Example: #tools_count has canonical value "39", appears in 6 files, per-tier values (12/35/39), requires tooltip.

Before any marketing/UI change, search for the TAG and update ALL occurrences. After any change, verify consistency across all locations.

How These Prevent the 40+ Issues

Root Cause	Persona That Catches It	TAG That Tracks It
Marketing drift	Justice-Minded Fact Checker	#tools_count, #agent_count, #pricing
Tier-gating leakage	Tired Pastor, First-Timer	#tier_features, #channel_gating
Jargon	Tired Pastor, First-Timer	#jargon_forbidden
Email mismatch	Justice-Minded	#tier_features cross-ref with email content
Denomination issues	Catholic Secretary	#denomination_labels
Compliance fear	Anxious Board Member	(compliance section audit)
Payment flow	Justice-Minded	#pricing, payment state machine test

Integration with Existing Test Infrastructure

These methodologies slot into the existing testing layers:

Layer A (Unit): unchanged
Layer B (Integration): unchanged
Layer C (E2E Playwright): unchanged
Layer D (5-Question AI): unchanged
Layer E (Persona Prompts): NEW — 3+ persona agents per journey
Layer F (TAG Consistency): NEW — automated cross-page claim verification
Layer G (Expected Output): unchanged
Layer H (Code Resilience): unchanged
Layer I (Monthly CEO Walk): unchanged

Layers E and F are the bridge between "does the code work?" and "does the customer succeed?"

Executive Summary​

Part 1: Root Cause Analysis​

1.1 The Five Root Causes​

Root Cause A: Marketing Copy Drift (10 issues)​

Root Cause B: Payment Flow Architecture (5 issues)​

Root Cause C: Tier-Gating Leakage (8 issues)​

Root Cause D: Jargon and Pastor-Hostile UX (12 issues)​

Root Cause E: Email Content Mismatch (5 issues)​

1.2 Why the Existing Testing Infrastructure Missed All 40+​

Part 2: The Persona-Based Testing Gap​

2.1 What the CEO Did Differently​

2.2 Why AI Agents Cannot Fully Replace This​

2.3 The Real Gap: Layer B Has Never Been Run By a Human​

Part 3: New Testing Methodology -- Filling the Gaps​

3.1 Marketing Consistency Checks (Root Cause A)​

3.2 Tier-Gating Element-Level Verification (Root Cause C)​

3.3 Email Content vs Feature Validation (Root Cause E)​

3.4 Jargon Detection (Root Cause D)​

3.5 Customer Journey Simulation -- Human Protocol (Root Cause: Layer B Gap)​

3.6 Payment Flow Sequence Testing (Root Cause B)​

Part 4: Self-Annealing Recommendations​

4.1 Cross-Page Consistency Guard (prevents Root Cause A)​

4.2 Tier-Gating Regression Guard (prevents Root Cause C)​

4.3 Email Template Lint (prevents Root Cause E)​

4.4 Jargon Lint (prevents Root Cause D)​

4.5 Payment-First Architecture Enforcement (prevents Root Cause B)​

4.6 Drift Detection via Knowledge Derivation​

Part 5: Updated Testing Architecture​

Before (as of 2026-03-29)​

After (as of 2026-03-30)​

Testing Cadence​

Part 6: Checklist for Future Manual Testing​

Pre-Test Setup​

Marketing Pages (Root Cause A checks)​

Payment Flow (Root Cause B checks)​

Dashboard (Root Cause C checks)​

Comprehensibility (Root Cause D checks)​

Emails (Root Cause E checks)​

Cross-Page Consistency (catch-all)​

Part 7: Immediate Action Items from Today's Test​

Critical (fix before launch)​

Important (fix before first customer)​

Minor (fix in next sprint)​

Part 8: Systemic Lessons​

Lesson 1: "Does the code work?" is not "Does the customer succeed?"​

Lesson 2: Consistency bugs are the hardest to catch​

Lesson 3: Granularity of specs matters enormously​

Lesson 4: Email is the most under-tested touchpoint​

Lesson 5: One human walk-through found more customer-facing issues than 24 automated agents​

Appendix: Issue-to-Root-Cause Mapping​

ADDENDUM: Persona Prompts & TAG-Based Testing (added post-retrospective)​

The CEO's Insight​

Two New Methodologies​

How These Prevent the 40+ Issues​

Integration with Existing Test Infrastructure​

Executive Summary

Part 1: Root Cause Analysis

1.1 The Five Root Causes

Root Cause A: Marketing Copy Drift (10 issues)

Root Cause B: Payment Flow Architecture (5 issues)

Root Cause C: Tier-Gating Leakage (8 issues)

Root Cause D: Jargon and Pastor-Hostile UX (12 issues)

Root Cause E: Email Content Mismatch (5 issues)

1.2 Why the Existing Testing Infrastructure Missed All 40+

Part 2: The Persona-Based Testing Gap

2.1 What the CEO Did Differently

2.2 Why AI Agents Cannot Fully Replace This

2.3 The Real Gap: Layer B Has Never Been Run By a Human

Part 3: New Testing Methodology -- Filling the Gaps

3.1 Marketing Consistency Checks (Root Cause A)

3.2 Tier-Gating Element-Level Verification (Root Cause C)

3.3 Email Content vs Feature Validation (Root Cause E)

3.4 Jargon Detection (Root Cause D)

3.5 Customer Journey Simulation -- Human Protocol (Root Cause: Layer B Gap)

3.6 Payment Flow Sequence Testing (Root Cause B)

Part 4: Self-Annealing Recommendations

4.1 Cross-Page Consistency Guard (prevents Root Cause A)

4.2 Tier-Gating Regression Guard (prevents Root Cause C)

4.3 Email Template Lint (prevents Root Cause E)

4.4 Jargon Lint (prevents Root Cause D)

4.5 Payment-First Architecture Enforcement (prevents Root Cause B)

4.6 Drift Detection via Knowledge Derivation

Part 5: Updated Testing Architecture

Before (as of 2026-03-29)

After (as of 2026-03-30)

Testing Cadence

Part 6: Checklist for Future Manual Testing

Pre-Test Setup

Marketing Pages (Root Cause A checks)

Payment Flow (Root Cause B checks)

Dashboard (Root Cause C checks)

Comprehensibility (Root Cause D checks)

Emails (Root Cause E checks)

Cross-Page Consistency (catch-all)

Part 7: Immediate Action Items from Today's Test

Critical (fix before launch)

Important (fix before first customer)

Minor (fix in next sprint)

Part 8: Systemic Lessons

Lesson 1: "Does the code work?" is not "Does the customer succeed?"

Lesson 2: Consistency bugs are the hardest to catch

Lesson 3: Granularity of specs matters enormously

Lesson 4: Email is the most under-tested touchpoint

Lesson 5: One human walk-through found more customer-facing issues than 24 automated agents

Appendix: Issue-to-Root-Cause Mapping

ADDENDUM: Persona Prompts & TAG-Based Testing (added post-retrospective)

The CEO's Insight

Two New Methodologies

How These Prevent the 40+ Issues

Integration with Existing Test Infrastructure