Skip to main content

Knowledge > Processes > Content Generation Pipeline

Content Generation Pipeline

How sermon illustrations are generated, enriched, embedded, and surfaced in the IllustrateTheWord directory (327K+ records in unified_rag_content).


Two Content Paths

Content enters unified_rag_content through two paths: public-domain scraping and AI generation. Both converge on the same insertion and embedding pipeline.


Path A: Public-Domain Scraping (Python)

Historical illustrations scraped from Archive.org sources (Biblical Illustrator, Spurgeon, Maclaren). Uses claude -p CLI for text cleanup, NOT API calls.

1. SCRAPER SELECTS SOURCE
scraper = BiblicalIllustratorScraper(source_config)
# source_config defines: author, pub_year, content_type, theological_lens_id

2. PARSE VOLUME
for item in scraper.parse_volume(volume_id):
# item is a ScrapedItem: raw_text, book, chapter, verse, verse_quote
yield item

3. AI CLEANUP (Claude CLI — NOT API)
cleaned = scraper._call_claude_cli(prompt)
# Spawns: claude -p "Clean up this OCR text..."
# ENV VARS STRIPPED: CLAUDE_CODE_ENTRYPOINT, CLAUDECODE deleted
# (required for nested CLI invocation from scripts)
# COST: $0 — founder pays $200/mo for Claude Max

4. BUILD ProcessedItem
item = ProcessedItem(
id=uuid4(),
slug=slugify(title) + "-" + id[:8],
title, content, summary, teaser,
word_count=len(content.split()),
content_type="historical_illustration",
source_type="ai_generated",
scripture_references=[make_scripture_ref(book, ch, v_start, v_end)],
theological_lens_id=0, # Universal (shows in all traditions)
is_universal=True,
topics, themes,
primary_author, primary_source, source_attribution,
quality_score=simple_quality_score(content, refs),
visibility_tier="free_signup",
curation_status="approved",
)

5. QUALITY SCORING
score = simple_quality_score(content, refs)
# Baseline: 0.70
# Penalties: word_count < 80 (-0.25), banned phrases (-0.15),
# God names lowercase (-0.10)
# Bonuses: 150-280 words (+0.05), 5+ proper nouns (+0.05),
# 3+ theological terms (+0.05), has scripture refs (+0.05)

6. DUPLICATE CHECK (two-level)
IF (scripture_ref, primary_source) in session_seen_set:
SKIP # In-memory dedup (fast)
IF db.count(scripture_refs=refs, primary_source=source) > 0:
SKIP # DB dedup (authoritative)

7. INSERT INTO unified_rag_content
db_writer.write(item)
# content_category derived from content_type via mapping dict
# Sets created_at and updated_at to now()

Path B: AI Generation (Node.js scripts)

Six-phase pipeline generating new illustrations. All use claude -p CLI via generateWithClaudeMax() from scripts/lib/shared.mjs.

Phase 1: REGENERATE STUBS (regenerate-stubs.mjs)
Read stubs from unified_rag_content WHERE word_count <= 30
FOR each stub:
prompt = existing metadata (topics, themes, scripture) as context
new_content = claude -p "Generate illustration..."
UPDATE unified_rag_content SET content, summary, teaser,
word_count, quality_score, embedding, embedding_model

Phase 2: GENERATE BY SCRIPTURE (generate-by-scripture.mjs)
Target: popular passages, lectionary readings, book gaps
Generate new illustrations for underserved scripture references

Phase 3: GENERATE BY TOPIC (generate-by-topic.mjs)
Target: underserved topics x source categories
Fill coverage gaps across topic taxonomy

Phase 4: GENERATE LENS CONTENT (generate-lens-content.mjs)
Target: tradition-specific illustrations for each of 17 lenses
Each illustration tagged with specific theological_lens_id

Phase 6: GENERATE IMAGES (generate-illustration-images.mjs)
DALL-E image generation per illustration
STRICT RULES (from content-rules.md):
- NEVER depict God, Jesus's face, or any deity
- Jesus only from behind, silhouette, or at distance
- No non-Christian religious symbols or architecture
- No nudity, no meditation poses, no text in images
- Always include AI disclosure in alt text

Embedding Generation

Both paths generate embeddings using the same model and format.

MODEL: text-embedding-3-small (OpenAI API)
DIMENSIONS: 1536
TEXT FORMAT: "Scripture: {ref}\n\nAuthor: {author}\n\nSource: {source}\n\nContent: {content}"
COLUMN: embedding (vector(1536))
TRACKING: embedding_model column per row

NOTE: Embeddings still use OpenAI API (no CLI alternative).
Use --skip-embeddings flag to defer embedding generation.

CRITICAL: If embedding model ever changes, ALL embeddings must be
regenerated together. Mixed embedding spaces break vector search.

View Read Layer

After content is inserted, it becomes visible in the directory immediately through a live SQL view.

SOURCE TABLE: public.unified_rag_content (327K+ rows)
|
v (live — no refresh needed)
VIEW: dir_illustrations (regular SQL view, ~50K rows)
- Filters: content_category = 'illustration', is_active = true
- Includes 26 content types
- Includes structured data fields

NOTE: dir_illustrations is NOT a materialized view. Content appears immediately.
If rows are missing, check unified_rag_content.is_active and curation_status.

Content Quality Rules (content-rules.md)

WORD COUNTS:
Standard illustrations: 180-280 words
Never under 100 words (stubs)
Commentary (churchwiseai_commentary): 300-500 words

BANNED PHRASES (quality_score penalty):
"Consider how [scripture] speaks to [topic]"
"In a world where..."
"A story that demonstrates..."
"This modern example reminds us..."
Any [template brackets] leftover

GOD NAMES: Always capitalized (Yahweh, Jehovah, Elohim, Adonai, El Shaddai)
FOREIGN WORDS: Wrapped in *asterisks* with English meaning
TITLES: Wrapped in *asterisks* (books, movies, songs)

VISIBILITY TIERS (assigned by assign-visibility-tiers.mjs):
public (~18%): historical_illustration + top ~50 per non-premium type
free_signup (~68%): majority of library, requires free account
premium (~14%): premium content types, requires $9.95/mo subscription

DATABASE RULES:
Always update embedding, summary, teaser, word_count, quality_score
when changing content
Never delete rows
Never change source_type, content_type, content_category, content_format
Use service role key (bypasses RLS)
Run in small batches with rate limiting

Key Constraint

ALWAYS use claude -p (Claude CLI) for content generation, NEVER use Anthropic/OpenAI APIs. The founder pays $200/mo for Claude Max. API calls should only be used for real-time product features (chatbot, voice agent) and for embedding generation (no CLI alternative).