Knowledge > Products > ITW Premium > Content Pipeline

ITW Content Pipeline

Pipeline Overview

Content flows through a six-stage pipeline from generation to website display. Two codebases share ownership: ai-sermon-assistant (legacy, reference-only) owns content generation and enrichment, while sermon-illustrations owns visibility assignment, stub regeneration, and website queries.

[1] Content Import/Generation (ai-sermon-assistant)
  |
  v
[2] Enrichment (ai-sermon-assistant)
  |
  v
[3] Embedding Generation (ai-sermon-assistant / sermon-illustrations)
  |
  v
[4] Vetting & Approval (ai-sermon-assistant admin)
  |
  v
[5] Materialized View Refresh (pg_cron, every 15 min)
  |
  v
[6] Website Reads View (sermon-illustrations)

Each stage is described in detail below.

Stage 1: Content Import/Generation

Owner: ai-sermon-assistant (legacy -- read for reference only, never modify)

Content enters unified_rag_content through multiple paths:

Source	Method	Content Types	Volume
AI generation	Claude Max CLI (`claude -p`)	All 26 types	Bulk batches
Web scraping	Custom scrapers with exclusion rules	sermon_illustration, quote, historical_illustration	Periodic
Manual entry	Admin dashboard	Any type	Ad hoc
Partner content	API import	Varies	Rare

AI Generation Rules

ALWAYS use Claude Max CLI (claude -p) for batch content generation -- never Anthropic/OpenAI APIs
The founder pays $200/mo for Claude Max; API costs must be avoided for batch operations
For nested CLI invocation from scripts: delete CLAUDE_CODE_ENTRYPOINT and CLAUDECODE env vars before spawning
Content must meet minimum quality: word_count >= 100, quality_score >= 0.5

Row Structure on Insert

Every new row in unified_rag_content must include:

Field	Required	Description
`content`	Yes	Full illustration text
`content_type`	Yes	One of 26 recognized types
`source_type`	Yes	`'ai_generated'`, `'scraped'`, `'manual'`, `'partner'`
`content_category`	Yes	`'illustration'` for ITW content
`content_format`	Yes	`'text'` for standard content
`title`	Yes	Display title
`summary`	Yes	1-2 sentence summary for cards
`word_count`	Yes	Computed from content
`quality_score`	Yes	0.0-1.0 quality rating
`embedding`	Yes	1536-dim vector (text-embedding-3-small)
`embedding_model`	Yes	`'text-embedding-3-small'`
`slug`	Yes	URL-safe slug (see slugification rules below)
`created_at`	Auto	Timestamp
`updated_at`	Auto	Timestamp

Stage 2: Enrichment

Owner: ai-sermon-assistant (legacy -- reference only)

After initial import, content is enriched with structured metadata:

Field	Type	Description
`topics`	text[]	Subject topics (Faith, Grace, Love, etc.)
`themes`	text[]	Thematic tags
`emotions`	text[]	Emotional register (Joy, Hope, Comfort, etc.)
`human_needs`	text[]	Needs addressed (belonging, meaning, healing)
`keywords`	text[]	Search keywords
`doctrines`	text[]	Theological doctrines referenced
`scripture_references`	text[]	Bible passages (e.g., `['John 3:16', 'Romans 8:28']`)
`theological_lens_id`	UUID	FK to `sai_theological_lenses` (optional)
`is_universal`	boolean	True if applicable across all traditions
`tone`	text	Writing tone (inspirational, reflective, challenging)
`audience_primary`	text	Target audience (general, youth, seniors, etc.)
`audience_suitability`	text[]	Suitable audiences
`sermon_sections`	text[]	Where in a sermon this fits (opening, body, closing)
`application_points`	text[]	Practical takeaways
`spiritual_disciplines`	text[]	Related spiritual practices
`creative_approach`	text	How the content approaches its subject
`source_attribution`	text	Original source credit

Enrichment happens in bulk via AI pipelines in ai-sermon-assistant. The enrichment pipeline is NOT run from sermon-illustrations/.

Stage 3: Embedding Generation

Owner: Both codebases (generation in ai-sermon-assistant, regeneration in sermon-illustrations)

Embeddings use OpenAI text-embedding-3-small (1536 dimensions). The text format for embedding is:

"Scripture: {scripture_references}\n\nAuthor: {primary_author}\n\nSource: {source_attribution}\n\nContent: {content}"

Embedding Rules

Every row MUST have a non-null embedding column
embedding_model MUST be set to 'text-embedding-3-small'
When content is updated (stub regeneration, edits), the embedding MUST be regenerated
Embedding dimension: 1536 (enforced by pgvector column type)
Never insert a row with a NULL embedding -- the materialized view and RAG queries depend on it

Embedding Generation Pseudocode

async function generateEmbedding(row) {
  text = buildEmbeddingText(row)
  // text = "Scripture: John 3:16\n\nAuthor: C.S. Lewis\n\nSource: Mere Christianity\n\nContent: ..."

  response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: text
  })

  return response.data[0].embedding  // float[1536]
}

function buildEmbeddingText(row) {
  parts = []
  if (row.scripture_references?.length)
    parts.push(`Scripture: ${row.scripture_references.join(', ')}`)
  if (row.primary_author)
    parts.push(`Author: ${row.primary_author}`)
  if (row.source_attribution)
    parts.push(`Source: ${row.source_attribution}`)
  parts.push(`Content: ${row.content}`)
  return parts.join('\n\n')
}

Stage 4: Vetting & Approval

Owner: ai-sermon-assistant admin dashboard (legacy -- reference only)

Content goes through quality gates before appearing on the website:

Gate	Criteria	Outcome
Minimum word count	word_count >= 30 (lower threshold for stubs)	Below → flagged as stub
Quality score	quality_score >= 0.5	Below → held for review
Content review	Manual spot-check of AI-generated content	Approve or regenerate
Slug check	Valid slug, no duplicates	Auto-generated if missing

Approved content appears in the dir_illustrations view immediately — it reflects live data from unified_rag_content.

Stage 5: View Read Layer

Owner: Supabase

The dir_illustrations SQL view (not a materialized view) is the read layer for the ITW website. It reflects live data from unified_rag_content — no refresh cycle is needed.

What the View Contains

The view filters and shapes unified_rag_content for website display:

Includes only content with content_category = 'illustration'
Excludes rows below quality thresholds
Excludes rows without embeddings
Indexes by content_type, visibility_tier, quality_score for efficient browse queries
Contains ~50,000 rows (subset of ~327K total)

Latency

New or updated content appears on the website immediately — the view is a live SQL view over unified_rag_content, not a materialized snapshot.

Historical note: Earlier docs described a pg_cron-refreshed materialized view with a 15-minute latency. The current production schema uses a regular SQL view with no refresh needed.

Stage 6: Website Reads View

Owner: sermon-illustrations

The ITW website queries dir_illustrations exclusively. It NEVER writes to unified_rag_content directly during normal operation (only scripts like regenerate-stubs.mjs write).

Query functions in src/lib/queries/illustrations.ts handle all reads:

Function	Purpose	Table
`getFeaturedIllustrations()`	Homepage featured content	`dir_illustrations`
`getIllustrationsByTopic()`	Topic browse pages	`dir_illustrations`
`getIllustrationsByEmotion()`	Emotion browse pages	`dir_illustrations`
`getIllustrationsByTradition()`	Tradition browse pages	`dir_illustrations`
`getIllustrationBySlug()`	Detail page	`dir_illustrations`
`getContentTypeCounts()`	Browse-by-type page	`dir_illustrations`
Full-text search	Search results	`dir_illustrations` (search_vector)

Stub Regeneration System

2,030 AI-generated stubs (word_count <= 30) exist in the database. These are placeholder rows created during bulk import that need full content. The regeneration system upgrades them:

Pipeline

[1] Read stubs from unified_rag_content
    WHERE source_type = 'ai_generated' AND word_count <= 30

[2] For each stub:
    a. Generate new full-length content via Claude Max CLI
       (generateWithClaudeMax() from scripts/lib/shared.mjs)
    b. Generate new summary
    c. Compute new word_count
    d. Generate new embedding via text-embedding-3-small
    e. Assign new quality_score

[3] Update unified_rag_content:
    SET content = new_content,
        summary = new_summary,
        word_count = new_word_count,
        quality_score = new_quality_score,
        embedding = new_embedding,
        embedding_model = 'text-embedding-3-small',
        updated_at = now()
    WHERE id = stub.id

Safety Rules for Stub Regeneration

Only update stubs: source_type = 'ai_generated' AND word_count <= 30
Never delete rows -- 327K records are irreplaceable
Always update embedding when content changes
Always update word_count and summary when content changes
Always set updated_at to track when regeneration occurred
Use service role key for database writes (RLS bypass)
Rate limit Claude CLI calls to avoid overwhelming the session

Script Location

sermon-illustrations/scripts/regenerate-stubs.mjs
  |-- imports: scripts/lib/shared.mjs (generateWithClaudeMax, createEmbedding)
  |-- reads:   unified_rag_content (stubs only)
  |-- writes:  unified_rag_content (content, summary, word_count, quality_score,
  |            embedding, embedding_model, updated_at)

Slugification Rules

Illustration slugs follow this pattern:

function generateSlug(title, id) {
  slug = title
    .toLowerCase()
    .replace(/[^a-z0-9\s-]/g, '')  // strip special chars
    .replace(/\s+/g, '-')           // spaces to hyphens
    .replace(/-+/g, '-')            // collapse multiple hyphens
    .trim()

  // Append first 8 chars of UUID for uniqueness
  slug += '-' + id.substring(0, 8)

  return slug
}

// Example: "God's Grace in the Storm" + UUID "a1b2c3d4-..."
// Result:  "gods-grace-in-the-storm-a1b2c3d4"

The 8-character UUID suffix prevents slug collisions when multiple illustrations have similar titles.

Database Write Rules

MUST DO When Writing to unified_rag_content

Update embedding AND embedding_model when content changes
Update word_count when content changes
Update summary when content changes
Set updated_at = now() on every write
Use Supabase service role key (bypasses RLS)
Build embedding text with the proper format (Scripture/Author/Source/Content)
Validate word_count >= 30 (minimum for non-stub content)
Ensure quality_score is between 0.0 and 1.0

MUST NOT Do

Delete rows -- 327K records are irreplaceable
Change metadata owned by enrichment pipeline -- topics, themes, emotions, human_needs, keywords, doctrines (owned by ai-sermon-assistant enrichment)
Change immutable fields -- source_type, content_type, content_category, content_format
Write NULL embeddings -- every row must have a valid 1536-dim vector
Update non-stub rows from regeneration scripts -- only source_type = 'ai_generated' AND word_count <= 30
Insert without required fields -- embedding, summary, word_count, quality_score are all mandatory
Bulk delete -- even for cleanup, archive instead

Content Script Ownership

Script	Owner	Purpose
`assign-visibility-tiers.mjs`	sermon-illustrations	Assigns public/free_signup/premium visibility tiers
`regenerate-stubs.mjs`	sermon-illustrations	Upgrades AI stubs to full content
`generate-*.mjs` variants	sermon-illustrations	Content generation utilities
`audit-content-quality.mjs`	sermon-illustrations	Audits quality scores and flags issues
Content import/generation	ai-sermon-assistant	Bulk content creation (legacy, reference only)
Enrichment pipeline	ai-sermon-assistant	Metadata extraction (legacy, reference only)
Vetting/approval	ai-sermon-assistant	Quality review dashboard (legacy, reference only)

Monitoring & Health Checks

Key Metrics to Watch

Metric	Expected	Query
Total rows	~327K	`SELECT count(*) FROM unified_rag_content`
View rows	~50K	`SELECT count(*) FROM dir_illustrations`
Stubs remaining	~2,030	`SELECT count(*) FROM unified_rag_content WHERE source_type='ai_generated' AND word_count<=30`
NULL embeddings	0	`SELECT count(*) FROM unified_rag_content WHERE embedding IS NULL`
Last view refresh	< 15 min ago	Check pg_cron logs

View Staleness Check

-- dir_illustrations is a live SQL view — content is always current.
-- If content appears missing, verify it's in unified_rag_content:
SELECT count(*) FROM unified_rag_content
WHERE content_category = 'illustration' AND is_active = true;

-- Verify view row count matches expected ~50K:
SELECT count(*) FROM dir_illustrations;

Reference Documents

sermon-illustrations/internal/content-rules.md -- editorial quality standards, tone guidelines, content type definitions
sermon-illustrations/internal/shared-database-contract.md -- table ownership map, write permissions, cross-project rules
sermon-illustrations/internal/content-pipeline.md -- original pipeline documentation (may be outdated, this doc is canonical)

Pipeline Overview​

Stage 1: Content Import/Generation​

AI Generation Rules​

Row Structure on Insert​

Stage 2: Enrichment​

Stage 3: Embedding Generation​

Embedding Rules​

Embedding Generation Pseudocode​

Stage 4: Vetting & Approval​

Stage 5: View Read Layer​

What the View Contains​

Latency​

Stage 6: Website Reads View​

Stub Regeneration System​

Pipeline​

Safety Rules for Stub Regeneration​

Script Location​

Slugification Rules​

Database Write Rules​

MUST DO When Writing to unified_rag_content​

MUST NOT Do​

Content Script Ownership​

Monitoring & Health Checks​

Key Metrics to Watch​

View Staleness Check​

Reference Documents​

See Also​