Skip to main content

Knowledge > Products > Chatbot > Moderation

Chatbot Moderation System

Overview

The chatbot enforces a multi-layered moderation system that protects both visitors and churches. Moderation runs at two stages: before the LLM response (restriction checks) and after the LLM response (crisis safety net). The voice agent has a parallel system with the same concepts but adapted for real-time phone conversation.

Violation Types

Five violation types are tracked, each with different severity and response behavior:

TypeTriggerChatbot ResponseVoice Agent Response
CrisisSelf-harm, suicidal ideation, domestic violenceAppend 988/741741/911 resources to response, continue conversation, auto-flag safety concernSet crisis_detected=true, inject crisis context into LLM, provide resources, continue call, disable auto-hangup
Abuse (mild)Profanity, insults, verbal abuseWarning with gracious boundary, continue conversationWarning, one redirect attempt
Abuse (severe)Repeated or escalated abuse, threats toward othersEnd conversationEnd call immediately
SpamRepetitive meaningless inputCooldown appliedNoise filter drops short/irrelevant utterances before they reach the LLM
PredatoryPredatory behavior toward minors or vulnerable peopleImmediate blockImmediate end call with safety flag

Progressive Escalation

Violations accumulate per session (identified by sessionId for chatbot, call session for voice). The escalation ladder applies automatic restrictions:

Violation CountRestriction TypeDurationEffect
2 violationsCooldown5 minutesChatbot returns a restriction message; visitor must wait
4 violationsTemp block24 hoursChatbot refuses to engage; provides church office number and crisis resources
7 violationsPermanent blockNever expiresChatbot permanently refuses engagement; provides church office number

Escalation Constants

COOLDOWN_THRESHOLD = 2 → 5-minute cooldown
TEMP_BLOCK_THRESHOLD = 4 → 24-hour temporary block
PERMANENT_BLOCK_THRESHOLD = 7 → permanent block (expires_at = null)

Escalation Logic

The autoEscalate() function in moderation.ts:

  1. Count total violations for the session from moderation_violations
  2. Check for existing active restriction from user_restrictions
  3. Apply the highest applicable restriction that is not already in effect:
    • If count >= 7 and not already permanently blocked: insert permanent block
    • If count >= 4 and not already temp-blocked or permanently blocked: insert 24-hour temp block
    • If count >= 2 and no existing restriction: insert 5-minute cooldown
  4. Restrictions never downgrade -- a permanent block is never replaced by a temp block

Chatbot Moderation Pipeline

The moderation pipeline runs within route.ts at two stages:

Stage 1: Pre-LLM Restriction Check (Pipeline Step 5)

checkRestriction(churchId, sessionId)
→ Query user_restrictions for active restrictions
(expires_at IS NULL or expires_at > now())
→ If restricted:
Return restriction message with type and expiry
HTTP 200 with restricted=true
Conversation does not proceed to LLM

Restriction messages are tailored by type:

  • Cooldown: "I need to pause our conversation for a few minutes. Please try again shortly. If you're in crisis, call 988 or 911."
  • Temp block: "This conversation has been temporarily paused due to our community guidelines. Please try again later. If you need immediate help, call 988 or 911."
  • Permanent block: "This conversation is no longer available. If you need to reach the church, please call the church office directly. If you're in crisis, call 988 or 911."

All restriction messages include crisis resource numbers. This is non-negotiable -- even a blocked user who is in genuine crisis must be able to reach help.

Stage 2: Post-LLM Crisis Safety Net (Pipeline Step 15)

This is the NON-NEGOTIABLE safety net that runs after every LLM response, on all chatbot types (basic, pro_website, full):

1. Test user message against crisis regex patterns
(self-harm, suicidal ideation, domestic violence patterns)

2. If crisis patterns detected:
a. Check if LLM response contains all three mandatory resources:
- 988 (Suicide & Crisis Lifeline)
- 741741 (Crisis Text Line)
- 911
b. If ANY resource is missing:
Auto-append the full crisis resource block to the response

c. Check if the LLM called flag_safety_concern tool:
d. If NOT called:
Auto-execute flag_safety_concern(level='urgent') as system_safety_net
This ensures every crisis is logged even if the LLM fails to invoke the tool

3. Log moderation violation (type: crisis)
4. Run autoEscalate() to apply restrictions if warranted

Why Regex, Not Just LLM

LLMs occasionally omit crisis resources despite explicit system prompt instructions. The regex-based safety net is a deterministic backstop:

  • Cannot be prompt-injected
  • Cannot hallucinate away resources
  • Cannot be defeated by model behavior changes or provider switches
  • Runs on every response to every chatbot type
  • Is the last line of defense for life-safety scenarios

This layer must never be removed, weakened, or made conditional. It is the one part of the system where correctness outweighs all other concerns.

Voice Agent Moderation (Comparison)

The voice agent implements the same moderation concepts in moderation.py and turn_processor.py, but adapted for real-time phone conversation where latency and UX constraints differ:

Pipeline (turn_processor.py)

The voice agent checks moderation BEFORE the LLM processes each turn:

UserTextSent event arrives (STT transcription)

1. check_threat(text)
→ If threat detected (and not negated, and not self-harm):
Hardcoded response: "I need to end this call. If this is an emergency,
please call 911."
End call immediately
Log violation + send alert email + alert SMS

2. check_crisis(text)
→ If crisis detected:
Set session["crisis_detected"] = true
Inject crisis context into LLM: "CRITICAL: Caller may be in crisis.
Provide the 988 Suicide & Crisis Lifeline, Crisis Text Line 741741,
and 911."
Continue conversation (do NOT end call)
Disable auto-hangup (farewell detection skipped during crisis)
Log violation

3. check_abuse(text, session)
→ "warning": Inject abuse context, continue call
→ "end_call": End call after abuse threshold exceeded

4. Noise filtering (only if moderation did not fire)
Drop short/irrelevant utterances before they reach LLM

Key Differences from Chatbot

AspectChatbotVoice Agent
Restriction persistenceStored in user_restrictions table, persists across sessionsPer-call only (no cross-call tracking)
Threat responseLogged, restriction appliedHardcoded response + immediate end call + email/SMS alert
Crisis responseAppend resources to response, continueInject context into LLM, continue, disable auto-hangup
Abuse escalationProgressive (cooldown → block → permanent)Progressive within call (warning → end call)
Noise handlingNot applicable (text input)STT noise filtering drops irrelevant/short utterances
Moderation timingPre-LLM (restrictions) + post-LLM (crisis net)Pre-LLM only (all checks before LLM processes)

Voice Agent Pattern Matching

The voice agent uses compiled regex patterns in moderation.py, ported from the legacy voice agent:

  • Threat patterns (_THREAT): Threats of violence against others (kill, shoot, bomb, etc.). Excludes self-harm (redirected to crisis). Includes negation guard ("I'm NOT going to...").
  • Crisis patterns (_CRISIS): Comprehensive suicidal ideation detection including coded/euphemistic language (elderly variants, religious framing, burden language, farewell patterns, C-SSRS Q1 screening). Context-aware exceptions for benign phrases ("ready to go to church/home/work").
  • Abuse patterns: Tracked via check_abuse() with progressive escalation within the call.

Database Schema

moderation_violations

All incidents are logged regardless of type:

ColumnTypePurpose
idUUIDPrimary key
church_idUUIDFK to churches
session_idTEXTChat session identifier
user_identifierTEXTSame as session_id for anonymous chatbot
violation_typeTEXTOne of: crisis, abuse_mild, abuse_severe, spam, predatory
severity_scoreNUMERIC(4,2)Optional severity score
detected_categoriesJSONBCategory flags from detection
original_messageTEXTThe message that triggered the violation
action_takenTEXTDescription of the response taken
created_atTIMESTAMPTZTimestamp

user_restrictions

Active blocks with optional expiry:

ColumnTypePurpose
idUUIDPrimary key
church_idUUIDFK to churches
user_identifierTEXTSession/user identifier
restriction_typeTEXTOne of: cooldown, temp_block, permanent_block
reasonTEXTAuto-generated reason string
expires_atTIMESTAMPTZNull for permanent blocks
created_atTIMESTAMPTZTimestamp

Query Patterns

  • Check restriction: SELECT FROM user_restrictions WHERE church_id = ? AND user_identifier = ? AND (expires_at IS NULL OR expires_at > now()) ORDER BY created_at DESC LIMIT 1
  • Count violations: SELECT count(*) FROM moderation_violations WHERE church_id = ? AND user_identifier = ?

Admin Safety Tab (ModerationDashboard)

The admin dashboard surfaces safety events to church pastors via a dedicated Safety sub-tab in the Requests tab. This sub-tab is visible to admin and office_admin roles only.

  • Location in UI: Requests tab → Safety sub-tab (4th sub-tab)
  • Component: churchwiseai-web/src/components/admin/ModerationDashboard.tsx
  • Data source: Reads directly from moderation_violations (violations list) and user_restrictions (active blocks)
  • Badge count: The Requests tab shows a red badge when safety flags are pending. This count queries moderation_violations (not voice_callback_requests).
  • Overview banner: The dashboard Overview tab shows a flashing amber/red banner when pendingSafetyFlags > 0. This count also queries moderation_violations exclusively.
  • Previous (wrong) approach: Safety flag counts were previously read from voice_callback_requests records matching the SAFETY FLAG [ pattern. This caused badge/display discrepancies. Do NOT revert to that pattern.

Crisis Content Validator

Pastors can configure a custom crisis care message in Settings. A two-layer validator (churchwiseai-web/src/lib/crisis-validator.ts) blocks obviously harmful content (dismissive phrases, wishing harm) before it can be saved:

  • Client-side: Runs before form submission via onBeforeSubmit prop on SaveForm
  • Server-side: Route /api/premium/update re-runs the validator on the crisis_message case as a server guard

Code References

  • Chatbot moderation types, restriction check, violation logging, auto-escalation, format helpers: churchwiseai-web/src/lib/moderation.ts
  • Crisis content validator (shared pure function): churchwiseai-web/src/lib/crisis-validator.ts
  • Crisis safety net (post-LLM): churchwiseai-web/src/app/api/chatbot/stream/route.ts (the only chatbot endpoint — legacy /chat was deleted 2026-04-09)
  • Admin safety dashboard component: churchwiseai-web/src/components/admin/ModerationDashboard.tsx
  • Admin safety stats API: churchwiseai-web/src/app/api/admin/safety-stats/route.ts
  • Voice agent moderation: churchwiseai-web/voice-agent-livekit/moderation.py (active) — do NOT modify voice-agent-livekit/moderation.py (legacy)

See Also