Knowledge > Products > Chatbot > Moderation

Chatbot Moderation System

Overview

The chatbot enforces a multi-layered moderation system that protects both visitors and churches. Moderation runs at two stages: before the LLM response (restriction checks) and after the LLM response (crisis safety net). The voice agent has a parallel system with the same concepts but adapted for real-time phone conversation.

Violation Types

Five violation types are tracked, each with different severity and response behavior:

Type	Trigger	Chatbot Response	Voice Agent Response
Crisis	Self-harm, suicidal ideation, domestic violence	Append 988/741741/911 resources to response, continue conversation, auto-flag safety concern	Set `crisis_detected=true`, inject crisis context into LLM, provide resources, continue call, disable auto-hangup
Abuse (mild)	Profanity, insults, verbal abuse	Warning with gracious boundary, continue conversation	Warning, one redirect attempt
Abuse (severe)	Repeated or escalated abuse, threats toward others	End conversation	End call immediately
Spam	Repetitive meaningless input	Cooldown applied	Noise filter drops short/irrelevant utterances before they reach the LLM
Predatory	Predatory behavior toward minors or vulnerable people	Immediate block	Immediate end call with safety flag

Progressive Escalation

Violations accumulate per session (identified by sessionId for chatbot, call session for voice). The escalation ladder applies automatic restrictions:

Violation Count	Restriction Type	Duration	Effect
2 violations	Cooldown	5 minutes	Chatbot returns a restriction message; visitor must wait
4 violations	Temp block	24 hours	Chatbot refuses to engage; provides church office number and crisis resources
7 violations	Permanent block	Never expires	Chatbot permanently refuses engagement; provides church office number

Escalation Constants

COOLDOWN_THRESHOLD = 2       → 5-minute cooldown
TEMP_BLOCK_THRESHOLD = 4     → 24-hour temporary block
PERMANENT_BLOCK_THRESHOLD = 7 → permanent block (expires_at = null)

Escalation Logic

The autoEscalate() function in moderation.ts:

Count total violations for the session from moderation_violations
Check for existing active restriction from user_restrictions
Apply the highest applicable restriction that is not already in effect:
- If count >= 7 and not already permanently blocked: insert permanent block
- If count >= 4 and not already temp-blocked or permanently blocked: insert 24-hour temp block
- If count >= 2 and no existing restriction: insert 5-minute cooldown
Restrictions never downgrade -- a permanent block is never replaced by a temp block

Chatbot Moderation Pipeline

The moderation pipeline runs within route.ts at two stages:

Stage 1: Pre-LLM Restriction Check (Pipeline Step 5)

checkRestriction(churchId, sessionId)
  → Query user_restrictions for active restrictions
    (expires_at IS NULL or expires_at > now())
  → If restricted:
      Return restriction message with type and expiry
      HTTP 200 with restricted=true
      Conversation does not proceed to LLM

Restriction messages are tailored by type:

Cooldown: "I need to pause our conversation for a few minutes. Please try again shortly. If you're in crisis, call 988 or 911."
Temp block: "This conversation has been temporarily paused due to our community guidelines. Please try again later. If you need immediate help, call 988 or 911."
Permanent block: "This conversation is no longer available. If you need to reach the church, please call the church office directly. If you're in crisis, call 988 or 911."

All restriction messages include crisis resource numbers. This is non-negotiable -- even a blocked user who is in genuine crisis must be able to reach help.

Stage 2: Post-LLM Crisis Safety Net (Pipeline Step 15)

This is the NON-NEGOTIABLE safety net that runs after every LLM response, on all chatbot types (basic, pro_website, full):

1. Test user message against crisis regex patterns
   (self-harm, suicidal ideation, domestic violence patterns)

2. If crisis patterns detected:
   a. Check if LLM response contains all three mandatory resources:
      - 988 (Suicide & Crisis Lifeline)
      - 741741 (Crisis Text Line)
      - 911
   b. If ANY resource is missing:
      Auto-append the full crisis resource block to the response

   c. Check if the LLM called flag_safety_concern tool:
   d. If NOT called:
      Auto-execute flag_safety_concern(level='urgent') as system_safety_net
      This ensures every crisis is logged even if the LLM fails to invoke the tool

3. Log moderation violation (type: crisis)
4. Run autoEscalate() to apply restrictions if warranted

Why Regex, Not Just LLM

LLMs occasionally omit crisis resources despite explicit system prompt instructions. The regex-based safety net is a deterministic backstop:

Cannot be prompt-injected
Cannot hallucinate away resources
Cannot be defeated by model behavior changes or provider switches
Runs on every response to every chatbot type
Is the last line of defense for life-safety scenarios

This layer must never be removed, weakened, or made conditional. It is the one part of the system where correctness outweighs all other concerns.

Voice Agent Moderation (Comparison)

The voice agent implements the same moderation concepts in moderation.py and turn_processor.py, but adapted for real-time phone conversation where latency and UX constraints differ:

Pipeline (turn_processor.py)

The voice agent checks moderation BEFORE the LLM processes each turn:

UserTextSent event arrives (STT transcription)
  ↓
1. check_threat(text)
   → If threat detected (and not negated, and not self-harm):
     Hardcoded response: "I need to end this call. If this is an emergency,
     please call 911."
     End call immediately
     Log violation + send alert email + alert SMS
  ↓
2. check_crisis(text)
   → If crisis detected:
     Set session["crisis_detected"] = true
     Inject crisis context into LLM: "CRITICAL: Caller may be in crisis.
     Provide the 988 Suicide & Crisis Lifeline, Crisis Text Line 741741,
     and 911."
     Continue conversation (do NOT end call)
     Disable auto-hangup (farewell detection skipped during crisis)
     Log violation
  ↓
3. check_abuse(text, session)
   → "warning": Inject abuse context, continue call
   → "end_call": End call after abuse threshold exceeded
  ↓
4. Noise filtering (only if moderation did not fire)
   Drop short/irrelevant utterances before they reach LLM

Key Differences from Chatbot

Aspect	Chatbot	Voice Agent
Restriction persistence	Stored in `user_restrictions` table, persists across sessions	Per-call only (no cross-call tracking)
Threat response	Logged, restriction applied	Hardcoded response + immediate end call + email/SMS alert
Crisis response	Append resources to response, continue	Inject context into LLM, continue, disable auto-hangup
Abuse escalation	Progressive (cooldown → block → permanent)	Progressive within call (warning → end call)
Noise handling	Not applicable (text input)	STT noise filtering drops irrelevant/short utterances
Moderation timing	Pre-LLM (restrictions) + post-LLM (crisis net)	Pre-LLM only (all checks before LLM processes)

Voice Agent Pattern Matching

The voice agent uses compiled regex patterns in moderation.py, ported from the legacy voice agent:

Threat patterns (_THREAT): Threats of violence against others (kill, shoot, bomb, etc.). Excludes self-harm (redirected to crisis). Includes negation guard ("I'm NOT going to...").
Crisis patterns (_CRISIS): Comprehensive suicidal ideation detection including coded/euphemistic language (elderly variants, religious framing, burden language, farewell patterns, C-SSRS Q1 screening). Context-aware exceptions for benign phrases ("ready to go to church/home/work").
Abuse patterns: Tracked via check_abuse() with progressive escalation within the call.

Database Schema

moderation_violations

All incidents are logged regardless of type:

Column	Type	Purpose
`id`	UUID	Primary key
`church_id`	UUID	FK to churches
`session_id`	TEXT	Chat session identifier
`user_identifier`	TEXT	Same as session_id for anonymous chatbot
`violation_type`	TEXT	One of: crisis, abuse_mild, abuse_severe, spam, predatory
`severity_score`	NUMERIC(4,2)	Optional severity score
`detected_categories`	JSONB	Category flags from detection
`original_message`	TEXT	The message that triggered the violation
`action_taken`	TEXT	Description of the response taken
`created_at`	TIMESTAMPTZ	Timestamp

user_restrictions

Active blocks with optional expiry:

Column	Type	Purpose
`id`	UUID	Primary key
`church_id`	UUID	FK to churches
`user_identifier`	TEXT	Session/user identifier
`restriction_type`	TEXT	One of: cooldown, temp_block, permanent_block
`reason`	TEXT	Auto-generated reason string
`expires_at`	TIMESTAMPTZ	Null for permanent blocks
`created_at`	TIMESTAMPTZ	Timestamp

Query Patterns

Check restriction: SELECT FROM user_restrictions WHERE church_id = ? AND user_identifier = ? AND (expires_at IS NULL OR expires_at > now()) ORDER BY created_at DESC LIMIT 1
Count violations: SELECT count(*) FROM moderation_violations WHERE church_id = ? AND user_identifier = ?

Admin Safety Tab (ModerationDashboard)

The admin dashboard surfaces safety events to church pastors via a dedicated Safety sub-tab in the Requests tab. This sub-tab is visible to admin and office_admin roles only.

Location in UI: Requests tab → Safety sub-tab (4th sub-tab)
Component: churchwiseai-web/src/components/admin/ModerationDashboard.tsx
Data source: Reads directly from moderation_violations (violations list) and user_restrictions (active blocks)
Badge count: The Requests tab shows a red badge when safety flags are pending. This count queries moderation_violations (not voice_callback_requests).
Overview banner: The dashboard Overview tab shows a flashing amber/red banner when pendingSafetyFlags > 0. This count also queries moderation_violations exclusively.
Previous (wrong) approach: Safety flag counts were previously read from voice_callback_requests records matching the SAFETY FLAG [ pattern. This caused badge/display discrepancies. Do NOT revert to that pattern.

Crisis Content Validator

Pastors can configure a custom crisis care message in Settings. A two-layer validator (churchwiseai-web/src/lib/crisis-validator.ts) blocks obviously harmful content (dismissive phrases, wishing harm) before it can be saved:

Client-side: Runs before form submission via onBeforeSubmit prop on SaveForm
Server-side: Route /api/premium/update re-runs the validator on the crisis_message case as a server guard

Code References

Chatbot moderation types, restriction check, violation logging, auto-escalation, format helpers: churchwiseai-web/src/lib/moderation.ts
Crisis content validator (shared pure function): churchwiseai-web/src/lib/crisis-validator.ts
Crisis safety net (post-LLM): churchwiseai-web/src/app/api/chatbot/stream/route.ts (the only chatbot endpoint — legacy /chat was deleted 2026-04-09)
Admin safety dashboard component: churchwiseai-web/src/components/admin/ModerationDashboard.tsx
Admin safety stats API: churchwiseai-web/src/app/api/admin/safety-stats/route.ts
Voice agent moderation: churchwiseai-web/voice-agent-livekit/moderation.py (active) — do NOT modify voice-agent-livekit/moderation.py (legacy)

Overview​

Violation Types​

Progressive Escalation​

Escalation Constants​

Escalation Logic​

Chatbot Moderation Pipeline​

Stage 1: Pre-LLM Restriction Check (Pipeline Step 5)​

Stage 2: Post-LLM Crisis Safety Net (Pipeline Step 15)​

Why Regex, Not Just LLM​

Voice Agent Moderation (Comparison)​

Pipeline (turn_processor.py)​

Key Differences from Chatbot​

Voice Agent Pattern Matching​

Database Schema​

moderation_violations​

user_restrictions​

Query Patterns​

Admin Safety Tab (ModerationDashboard)​

Crisis Content Validator​

Code References​

See Also​