Knowledge > Products > Chatbot > Moderation
Chatbot Moderation System
Overview
The chatbot enforces a multi-layered moderation system that protects both visitors and churches. Moderation runs at two stages: before the LLM response (restriction checks) and after the LLM response (crisis safety net). The voice agent has a parallel system with the same concepts but adapted for real-time phone conversation.
Violation Types
Five violation types are tracked, each with different severity and response behavior:
| Type | Trigger | Chatbot Response | Voice Agent Response |
|---|---|---|---|
| Crisis | Self-harm, suicidal ideation, domestic violence | Append 988/741741/911 resources to response, continue conversation, auto-flag safety concern | Set crisis_detected=true, inject crisis context into LLM, provide resources, continue call, disable auto-hangup |
| Abuse (mild) | Profanity, insults, verbal abuse | Warning with gracious boundary, continue conversation | Warning, one redirect attempt |
| Abuse (severe) | Repeated or escalated abuse, threats toward others | End conversation | End call immediately |
| Spam | Repetitive meaningless input | Cooldown applied | Noise filter drops short/irrelevant utterances before they reach the LLM |
| Predatory | Predatory behavior toward minors or vulnerable people | Immediate block | Immediate end call with safety flag |
Progressive Escalation
Violations accumulate per session (identified by sessionId for chatbot, call session for voice). The escalation ladder applies automatic restrictions:
| Violation Count | Restriction Type | Duration | Effect |
|---|---|---|---|
| 2 violations | Cooldown | 5 minutes | Chatbot returns a restriction message; visitor must wait |
| 4 violations | Temp block | 24 hours | Chatbot refuses to engage; provides church office number and crisis resources |
| 7 violations | Permanent block | Never expires | Chatbot permanently refuses engagement; provides church office number |
Escalation Constants
COOLDOWN_THRESHOLD = 2 → 5-minute cooldown
TEMP_BLOCK_THRESHOLD = 4 → 24-hour temporary block
PERMANENT_BLOCK_THRESHOLD = 7 → permanent block (expires_at = null)
Escalation Logic
The autoEscalate() function in moderation.ts:
- Count total violations for the session from
moderation_violations - Check for existing active restriction from
user_restrictions - Apply the highest applicable restriction that is not already in effect:
- If count >= 7 and not already permanently blocked: insert permanent block
- If count >= 4 and not already temp-blocked or permanently blocked: insert 24-hour temp block
- If count >= 2 and no existing restriction: insert 5-minute cooldown
- Restrictions never downgrade -- a permanent block is never replaced by a temp block
Chatbot Moderation Pipeline
The moderation pipeline runs within route.ts at two stages:
Stage 1: Pre-LLM Restriction Check (Pipeline Step 5)
checkRestriction(churchId, sessionId)
→ Query user_restrictions for active restrictions
(expires_at IS NULL or expires_at > now())
→ If restricted:
Return restriction message with type and expiry
HTTP 200 with restricted=true
Conversation does not proceed to LLM
Restriction messages are tailored by type:
- Cooldown: "I need to pause our conversation for a few minutes. Please try again shortly. If you're in crisis, call 988 or 911."
- Temp block: "This conversation has been temporarily paused due to our community guidelines. Please try again later. If you need immediate help, call 988 or 911."
- Permanent block: "This conversation is no longer available. If you need to reach the church, please call the church office directly. If you're in crisis, call 988 or 911."
All restriction messages include crisis resource numbers. This is non-negotiable -- even a blocked user who is in genuine crisis must be able to reach help.
Stage 2: Post-LLM Crisis Safety Net (Pipeline Step 15)
This is the NON-NEGOTIABLE safety net that runs after every LLM response, on all chatbot types (basic, pro_website, full):
1. Test user message against crisis regex patterns
(self-harm, suicidal ideation, domestic violence patterns)
2. If crisis patterns detected:
a. Check if LLM response contains all three mandatory resources:
- 988 (Suicide & Crisis Lifeline)
- 741741 (Crisis Text Line)
- 911
b. If ANY resource is missing:
Auto-append the full crisis resource block to the response
c. Check if the LLM called flag_safety_concern tool:
d. If NOT called:
Auto-execute flag_safety_concern(level='urgent') as system_safety_net
This ensures every crisis is logged even if the LLM fails to invoke the tool
3. Log moderation violation (type: crisis)
4. Run autoEscalate() to apply restrictions if warranted
Why Regex, Not Just LLM
LLMs occasionally omit crisis resources despite explicit system prompt instructions. The regex-based safety net is a deterministic backstop:
- Cannot be prompt-injected
- Cannot hallucinate away resources
- Cannot be defeated by model behavior changes or provider switches
- Runs on every response to every chatbot type
- Is the last line of defense for life-safety scenarios
This layer must never be removed, weakened, or made conditional. It is the one part of the system where correctness outweighs all other concerns.
Voice Agent Moderation (Comparison)
The voice agent implements the same moderation concepts in moderation.py and turn_processor.py, but adapted for real-time phone conversation where latency and UX constraints differ:
Pipeline (turn_processor.py)
The voice agent checks moderation BEFORE the LLM processes each turn:
UserTextSent event arrives (STT transcription)
↓
1. check_threat(text)
→ If threat detected (and not negated, and not self-harm):
Hardcoded response: "I need to end this call. If this is an emergency,
please call 911."
End call immediately
Log violation + send alert email + alert SMS
↓
2. check_crisis(text)
→ If crisis detected:
Set session["crisis_detected"] = true
Inject crisis context into LLM: "CRITICAL: Caller may be in crisis.
Provide the 988 Suicide & Crisis Lifeline, Crisis Text Line 741741,
and 911."
Continue conversation (do NOT end call)
Disable auto-hangup (farewell detection skipped during crisis)
Log violation
↓
3. check_abuse(text, session)
→ "warning": Inject abuse context, continue call
→ "end_call": End call after abuse threshold exceeded
↓
4. Noise filtering (only if moderation did not fire)
Drop short/irrelevant utterances before they reach LLM
Key Differences from Chatbot
| Aspect | Chatbot | Voice Agent |
|---|---|---|
| Restriction persistence | Stored in user_restrictions table, persists across sessions | Per-call only (no cross-call tracking) |
| Threat response | Logged, restriction applied | Hardcoded response + immediate end call + email/SMS alert |
| Crisis response | Append resources to response, continue | Inject context into LLM, continue, disable auto-hangup |
| Abuse escalation | Progressive (cooldown → block → permanent) | Progressive within call (warning → end call) |
| Noise handling | Not applicable (text input) | STT noise filtering drops irrelevant/short utterances |
| Moderation timing | Pre-LLM (restrictions) + post-LLM (crisis net) | Pre-LLM only (all checks before LLM processes) |
Voice Agent Pattern Matching
The voice agent uses compiled regex patterns in moderation.py, ported from the legacy voice agent:
- Threat patterns (
_THREAT): Threats of violence against others (kill, shoot, bomb, etc.). Excludes self-harm (redirected to crisis). Includes negation guard ("I'm NOT going to..."). - Crisis patterns (
_CRISIS): Comprehensive suicidal ideation detection including coded/euphemistic language (elderly variants, religious framing, burden language, farewell patterns, C-SSRS Q1 screening). Context-aware exceptions for benign phrases ("ready to go to church/home/work"). - Abuse patterns: Tracked via
check_abuse()with progressive escalation within the call.
Database Schema
moderation_violations
All incidents are logged regardless of type:
| Column | Type | Purpose |
|---|---|---|
id | UUID | Primary key |
church_id | UUID | FK to churches |
session_id | TEXT | Chat session identifier |
user_identifier | TEXT | Same as session_id for anonymous chatbot |
violation_type | TEXT | One of: crisis, abuse_mild, abuse_severe, spam, predatory |
severity_score | NUMERIC(4,2) | Optional severity score |
detected_categories | JSONB | Category flags from detection |
original_message | TEXT | The message that triggered the violation |
action_taken | TEXT | Description of the response taken |
created_at | TIMESTAMPTZ | Timestamp |
user_restrictions
Active blocks with optional expiry:
| Column | Type | Purpose |
|---|---|---|
id | UUID | Primary key |
church_id | UUID | FK to churches |
user_identifier | TEXT | Session/user identifier |
restriction_type | TEXT | One of: cooldown, temp_block, permanent_block |
reason | TEXT | Auto-generated reason string |
expires_at | TIMESTAMPTZ | Null for permanent blocks |
created_at | TIMESTAMPTZ | Timestamp |
Query Patterns
- Check restriction:
SELECT FROM user_restrictions WHERE church_id = ? AND user_identifier = ? AND (expires_at IS NULL OR expires_at > now()) ORDER BY created_at DESC LIMIT 1 - Count violations:
SELECT count(*) FROM moderation_violations WHERE church_id = ? AND user_identifier = ?
Admin Safety Tab (ModerationDashboard)
The admin dashboard surfaces safety events to church pastors via a dedicated Safety sub-tab in the Requests tab. This sub-tab is visible to admin and office_admin roles only.
- Location in UI: Requests tab → Safety sub-tab (4th sub-tab)
- Component:
churchwiseai-web/src/components/admin/ModerationDashboard.tsx - Data source: Reads directly from
moderation_violations(violations list) anduser_restrictions(active blocks) - Badge count: The Requests tab shows a red badge when safety flags are pending. This count queries
moderation_violations(notvoice_callback_requests). - Overview banner: The dashboard Overview tab shows a flashing amber/red banner when
pendingSafetyFlags > 0. This count also queriesmoderation_violationsexclusively. - Previous (wrong) approach: Safety flag counts were previously read from
voice_callback_requestsrecords matching theSAFETY FLAG [pattern. This caused badge/display discrepancies. Do NOT revert to that pattern.
Crisis Content Validator
Pastors can configure a custom crisis care message in Settings. A two-layer validator (churchwiseai-web/src/lib/crisis-validator.ts) blocks obviously harmful content (dismissive phrases, wishing harm) before it can be saved:
- Client-side: Runs before form submission via
onBeforeSubmitprop on SaveForm - Server-side: Route
/api/premium/updatere-runs the validator on thecrisis_messagecase as a server guard
Code References
- Chatbot moderation types, restriction check, violation logging, auto-escalation, format helpers:
churchwiseai-web/src/lib/moderation.ts - Crisis content validator (shared pure function):
churchwiseai-web/src/lib/crisis-validator.ts - Crisis safety net (post-LLM):
churchwiseai-web/src/app/api/chatbot/stream/route.ts(the only chatbot endpoint — legacy/chatwas deleted 2026-04-09) - Admin safety dashboard component:
churchwiseai-web/src/components/admin/ModerationDashboard.tsx - Admin safety stats API:
churchwiseai-web/src/app/api/admin/safety-stats/route.ts - Voice agent moderation:
churchwiseai-web/voice-agent-livekit/moderation.py(active) — do NOT modifyvoice-agent-livekit/moderation.py(legacy)
See Also
- Architecture -- full request pipeline including moderation stages
- Overview -- summary of the four-level escalation ladder
- Voice Agent Overview -- voice agent moderation in broader context