Knowledge > Products > PewSearch Directory > Data Quality
PewSearch Data Quality
Why Data Quality Matters
PewSearch displays 218K+ church listings to the public. Every listing represents a real church where real people worship. Data quality directly affects:
- User trust: Wrong addresses, closed churches, or restaurant photos destroy credibility
- SEO authority: Google penalizes directories with stale or inaccurate information
- Conversion: A pastor who sees their church listed with wrong hours will not claim the listing
- Downstream products: Voice Agent and Chatbot inherit church data -- bad data means bad AI responses
This document catalogs known data quality issues, their scope, and the mitigation strategies in place.
Known Data Quality Issues
1. Address-Only-State Records (~128K rows)
Severity: High
Scope: ~128K churches have address containing only the state abbreviation (e.g., "TX", "California")
These records were imported from OpenStreetMap where the address field was populated with only the state. The church exists and has a valid name, coordinates, and often a denomination, but the address is unusable for display or directions.
Detection:
-- Count address-only-state records
SELECT COUNT(*) FROM churches
WHERE directory_visible = true
AND business_status = 'OPERATIONAL'
AND (
LENGTH(address) <= 3
OR address = state
OR address = state_code
);
Mitigation: address-utils.ts contains isDisplayableAddress():
pseudocode: isDisplayableAddress(address, state, state_code)
if address is null or empty:
return false
if address.trim().length <= 3:
return false // "TX", "CA", etc.
if address.trim() == state or address.trim() == state_code:
return false // "Texas", "TX"
if address matches pattern /^[A-Z]{2}$/i:
return false // Any two-letter abbreviation
return true
When isDisplayableAddress() returns false, the UI shows "Address not available" instead of the misleading state-only value. The church still appears in search results (it has valid coordinates for map display).
2. Non-Church Business Photos
Severity: Medium Scope: Unknown (estimated hundreds to low thousands)
Google Maps data sometimes associates photos from nearby businesses with church listings. Known cases include restaurant interiors, park landscapes, dispensary storefronts, and gas station signs appearing as the photo_url for churches.
Root cause: Outscraper/Google Maps API returns the most prominent photo for a Google Maps place ID. When the place ID is slightly wrong or the business has been recategorized, the photo may be from a different business.
Detection: Manual review only. No automated detection is in place.
Mitigation strategies:
- Premium churches upload their own photos (overrides scraped photo)
- Category-based filtering during import (exclude categories like "restaurant", "gas_station")
- Community reports via contact form
- Future: AI-based photo classification to flag non-church images
3. Missing Service Hours (~20% of listings)
Severity: Medium
Scope: ~44K visible churches have NULL or empty working_hours
Many churches -- especially smaller congregations and non-denominational churches -- do not have hours listed on Google Maps. OpenStreetMap rarely includes hours data.
Impact: The church detail page shows "Hours not available" and cannot display a "Next Service" highlight. This reduces the page's usefulness and SEO value.
Mitigation:
- Premium churches set their own hours via admin dashboard (
custom_hours) - Website scraping extracts hours from church websites when available
- Google Maps data refresh periodically adds hours for churches that update their Google listing
4. Missing Denominations (~15% of listings)
Severity: Low-Medium Scope: ~33K visible churches have NULL denomination
Non-denominational churches intentionally omit denomination, but many denominational churches also have NULL denomination due to incomplete data sources.
Impact: These churches do not appear in denomination-filtered searches. They can still be found by name, location, or text search.
Mitigation:
- Website scraping attempts to extract denomination from "about" pages
- Denomination-to-name heuristics (e.g., "First Baptist Church of Dallas" → "Baptist")
- Premium churches set denomination during claim flow
- Community submissions
5. Duplicate Entries
Severity: Medium Scope: Estimated 2-5K duplicate pairs
The same physical church can appear multiple times under:
- Different names ("Grace Baptist Church" vs "Grace Baptist")
- Different data sources (one from OSM, one from Google Maps)
- Name changes (old name still in database alongside new name)
- Multi-campus churches (main campus and satellite listed separately)
Detection:
-- Find potential duplicates by proximity + similar name
SELECT a.id, a.name, b.id, b.name,
haversine_distance(a.latitude, a.longitude, b.latitude, b.longitude) as dist_miles
FROM churches a
JOIN churches b ON a.id < b.id
WHERE a.directory_visible = true
AND b.directory_visible = true
AND haversine_distance(a.latitude, a.longitude, b.latitude, b.longitude) < 0.1
AND similarity(a.name, b.name) > 0.6
LIMIT 100;
Mitigation:
- Deduplication script runs periodically (merge lower-quality record into higher-quality)
directory_visible = falsehides identified duplicates without deleting data- Manual review for high-profile churches
6. Permanently Closed Churches
Severity: Low
Scope: Tracked via business_status field
Churches that have permanently closed are marked business_status = 'CLOSED_PERMANENTLY'. The mandatory query filter business_status = 'OPERATIONAL' excludes these from all directory views.
Detection: Google Maps periodically marks businesses as closed. Our data refresh picks up these status changes.
Mitigation: Already handled by the mandatory query filter. No user-visible impact.
7. Incorrect Coordinates
Severity: Low-Medium Scope: Unknown (estimated <1%)
Some churches have latitude/longitude that places them in the wrong location (sometimes in a different state). This affects map display and "nearby churches" results.
Detection:
-- Find churches where coordinates don't match their state
SELECT id, name, state_code, latitude, longitude
FROM churches
WHERE directory_visible = true
AND latitude IS NOT NULL
AND NOT ST_Contains(
(SELECT geom FROM us_states WHERE state_code = churches.state_code),
ST_MakePoint(longitude, latitude)
)
LIMIT 50;
Mitigation:
- Cross-reference coordinates with state boundaries
- Geocoding validation during import
- Premium churches can correct their coordinates via admin
Import Pipeline Quality Gates
When new data is imported (from any source), the following quality gates apply:
pseudocode: importQualityGates(church_record)
// Gate 1: Required fields
REQUIRE: name is not null and not empty
REQUIRE: state_code is valid US state/territory
REQUIRE: latitude and longitude are within US bounds
latitude: 18.0 to 72.0 (includes territories)
longitude: -180.0 to -65.0 (includes Alaska, territories)
// Gate 2: Category filtering
EXCLUDE if category in:
["restaurant", "gas_station", "convenience_store",
"bar", "liquor_store", "cannabis_dispensary",
"night_club", "adult_entertainment", "casino",
"pawn_shop", "tattoo_parlor"]
// Gate 3: Name filtering
EXCLUDE if name matches patterns:
/mosque|synagogue|temple|gurdwara|masjid/i
(PewSearch is churches-only; other faith directories are separate)
// Gate 4: Deduplication check
CHECK for existing church with:
same state_code AND
(similar name OR same coordinates within 0.05 miles)
If duplicate found: merge data (keep higher-quality fields), do not insert
// Gate 5: Default values
SET directory_visible = true
SET business_status = 'OPERATIONAL'
return PASS / FAIL with reason
Content Enrichment Pipeline
The enrichment pipeline improves existing records by scraping church websites:
pseudocode: enrichChurch(church)
if church.website is null:
return // Nothing to scrape
if church.website_scraped_at is recent (< 90 days):
return // Already enriched recently
// Scrape website
content = scrapeWebsite(church.website)
// Extract structured data
extractedHours = parseHours(content)
extractedStaff = parseStaff(content)
extractedDenomination = parseDenomination(content)
extractedDescription = parseDescription(content)
// Update church record (only fill gaps, never overwrite existing good data)
if church.working_hours is null AND extractedHours is valid:
UPDATE church SET working_hours = extractedHours
if church.denomination is null AND extractedDenomination is valid:
UPDATE church SET denomination = extractedDenomination
if church.description is null AND extractedDescription is valid:
UPDATE church SET description = extractedDescription
UPDATE church SET website_scraped_at = now()
Scraper Exclusion Rules
Certain websites and patterns are excluded from scraping (see memory: feedback_scraper_exclusions):
- Websites behind authentication walls
- Websites with robots.txt disallow
- Websites that return CAPTCHA or anti-bot pages
- Known template platforms that don't contain unique content
- Rate limiting: max 1 request per second per domain
Data Quality Metrics
Key metrics to monitor (queryable from Supabase):
| Metric | Query | Target |
|---|---|---|
| Total visible churches | SELECT COUNT(*) FROM churches WHERE directory_visible=true AND business_status='OPERATIONAL' | 218K+ |
| Missing address | SELECT COUNT(*) FROM churches WHERE directory_visible=true AND NOT isDisplayableAddress(address) | < 130K |
| Missing hours | SELECT COUNT(*) FROM churches WHERE directory_visible=true AND working_hours IS NULL | < 45K |
| Missing denomination | SELECT COUNT(*) FROM churches WHERE directory_visible=true AND denomination IS NULL | < 35K |
| Missing photo | SELECT COUNT(*) FROM churches WHERE directory_visible=true AND photo_url IS NULL | < 100K |
| Missing phone | SELECT COUNT(*) FROM churches WHERE directory_visible=true AND phone IS NULL | < 80K |
| Missing website | SELECT COUNT(*) FROM churches WHERE directory_visible=true AND website IS NULL | < 100K |
| Has coordinates | SELECT COUNT(*) FROM churches WHERE directory_visible=true AND latitude IS NOT NULL | > 200K |
Prioritization: What to Fix First
When allocating resources to data quality improvements:
| Priority | Issue | Rationale |
|---|---|---|
| P0 | Closed churches still visible | Destroys trust immediately |
| P1 | Non-church businesses in directory | Misleading, hurts SEO authority |
| P1 | Premium church data inaccurate | Paying customers see wrong info |
| P2 | Missing hours for high-traffic churches | Reduces conversion for best candidates |
| P2 | Duplicate entries in same city | Confusing search results |
| P3 | Address-only-state display | Mitigated by isDisplayableAddress() |
| P3 | Missing denomination | Partial searches still work |
| P4 | Missing photos | Nice-to-have, not trust-breaking |
See Also
- PewSearch Directory Overview -- parent document with full context
- Search System -- how data quality affects search results
- Church Detail Page -- where data quality issues are most visible
- Denomination Taxonomy -- denomination data quality specifically