The Content Engine
A multi-stage pipeline that transforms raw RSS feeds into semantically clustered, AI-summarized insights. Features advanced repeat detection, automated publishing to Statamic, trend analysis via Pulse, and evolutionary tracking with Narrative Shifts.
1. Scrape
Readability & Markdown
2. Tag
RAKE Keyword Extraction
3. Cluster
DBSCAN & Merge Step
4. Summarize
Gemini AI Multi-Step
5. Pulse
Velocity & Trends
ContentScraper
Fetches and extracts main article content from a URL and cleans it for downstream processing.
Key methods
- scrapeArticleContent(string $url) — Fetches HTML, uses
fivefilters/Readabilityto extract core content (no ads/nav), then converts to clean Markdown. - scrapeArticleWithMetadata(string $url) — Same plus title, author, site name, excerpt.
- Data Retention — Automatically cleans up articles older than 6 months unless they are pinned to a Summary Object.
Content is converted to Markdown via a ContentCleaner for high-density input to clustering.
$readability->parse($html);
$content = $readability->getContent();
$contentCleaner = new ContentCleaner;
$cleanContent = $contentCleaner->htmlToMarkdown($content);
return [
'content' => $cleanContent,
'title' => $readability->getTitle(),
'author' => $readability->getAuthor(),
'site_name' => $readability->getSiteName(),
'excerpt' => $readability->getExcerpt(),
];
TagSuggester
Analyzes text and suggests relevant, SEO-friendly tags using the RAKE algorithm.
Key functionality
- suggestTags(string $text, int $numTags = 5) — Uses
donatelloza/rake-plus(Rapid Automatic Keyword Extraction): word frequency and co-occurrence. - Blacklist filtering — Filters out non-descriptive words (e.g. "today", "article") so only meaningful keywords are suggested.
$rake = RakePlus::create($text, 'en_US');
$phrases = $rake->keywords();
foreach ($phrases as $phrase) {
if ($this->isBlacklisted(Str::lower($phrase))) continue;
$filteredTags[] = Str::slug($cleanPhrase);
}
return $filteredTags;
Rubix ML Implementation
We use the rubix/ml package to handle the heavy lifting of vectorization and density-based clustering.
// Build dataset from article title+description
$dataset = Unlabeled::quick($samples);
$dataset
->apply(new TextNormalizer())
->apply(new WordCountVectorizer(5000, 1, 0.99, new Word()))
->apply(new TfIdfTransformer());
// DBSCAN with Cosine distance (0–1 scale)
$estimator = new DBSCAN(0.4, 2, new BallTree(20, new Cosine()));
$predictions = $estimator->predict($dataset);
ArticleClusteringService
Groups semantically similar articles into Topic Clusters—the core of trending story detection.
Advanced Features
- Merge Step — Post-clustering, we merge clusters that are semantically identical but were separated by DBSCAN due to slight density variations.
- Hot Clusters — When multiple related stories break in one category, we aggregate them into a "Hot Story" cluster for a single comprehensive report.
- Dynamic Epsilon — Category-specific neighborhood size (e.g. tech-news stricter, media-we-love looser).
- TF-IDF & Cosine Similarity — Used to measure the semantic distance between articles and clusters.
// Post-clustering merge threshold
'cluster_merge_similarity_threshold' => 0.82,
'cluster_merge_max_size' => 8,
// Hot cluster detection
if ($run->repeatClusters()->count() >= 2) {
$this->createHotStoryClustersForRun($run);
}
AutoSummarizationService
Takes a TopicCluster and generates a human-readable summary via Gemini AI, with a full audit trail in summary_generation_steps.
Step-by-step generation
- scrape_article_* — Scrapes each article in the cluster if needed.
- individual_summary_* — Summary per article (normal clusters).
- repeat_summary_* — Specialized steps for Hot Clusters; summarizes each unique story within the aggregate.
- master_summary — Cohesive synthesis of all articles or repeat summaries.
- suggest_title — AI proposes Huement-style titles from category templates.
PulseService
Generates daily Pulse stats per category: velocity, trend lines, trending topics, and the SVG charts used on the homepage and category pages.
Key functionality
- generateDailyPulse() — Runs per category with configurable time windows (7–21 days).
- Smart trend analysis — Compares current vs previous window; fast categories (e.g. Tech) use 7-day, slower (e.g. Digital Art) use 21-day.
- Pulse type —
trend_line(meaningful activity) orminimal_activity(low activity). - SVG generation —
generateTrendGraph()(sparklines),generateLargeTrendGraph()(area charts with gradients/glow).
// Category-specific time windows
'tech-news' => 7,
'artificial-intelligence' => 7,
'webdev' => 10,
'hardware' => 14,
'digital-art' => 21,
Pulse Cards & Category Stats
Pulse cards (pulse-card.blade) show velocity, trend lines or minimal-activity view, activity trend (↗/↘), and trending topics. Category stats (category-stats.blade) show N-day velocity, trending topic badges, and total articles. Both use stats_json from DailyPulse (velocity, svg_points, trending_topics, confirmation_score, etc.).
NarrativeShiftService
Detects how a story evolves over time by comparing the semantic summaries of current clusters against historical ones in the same category.
Key functionality
- What Changed Today — Surfaced on the homepage when a meaningful shift in the narrative is detected.
- Semantic Comparison — Uses AI to compare "Window A" vs "Window B" and identify new developments, rebuttals, or resolution of events.
// Detection window (default 7 days)
$shifts = $narrativeService->runDetectionAndRecordShifts();
// Persisted to narrative_shifts table
NanoBananaService
Automatically generates featured images for blog posts using Gemini 2.5 Flash Image.
Key functionality
- Prompt Synthesis — Builds descriptive image prompts from the article's summary, category, and tags.
- Style Injection — Ensures a consistent Huement visual style across all generated assets.
$result = $nanoBananaService->generateImage($prompt);
// Returns local storage path and token usage metadata
Statamic Integration
Seamlessly bridges the AI pipeline with our Statamic CMS via the Area of Interest (a_o_i) taxonomy.
- Automated Tagging — Maps AI-suggested tags to Statamic taxonomy terms.
- Blueprint Mapping — Ensures generated content fits the strict structural requirements of our blog blueprints.
Deep Intelligence
- Novelty —
ClusterInsightService::computeNovelty()vs historical clusters; labels: New, Recurring, Ongoing. - Controversy —
computeControversyScore()from keywords in titles/descriptions; labels: low, medium, high. - Hidden trends — Small clusters with high novelty/momentum or rapid 24h growth; "Early Trend Alert" in pulse modal.
- Coverage bias — Diversity (unique feeds / total articles); labels e.g. Industry-driven vs Grassroots.
- Narrative shifts —
NarrativeShiftServiceand "What Changed Today" on the homepage.
Admin Dashboard & User Flow
Admin (/admin/content-engine): dashboard (last run, last pulse, tokens, limits), Content Engine settings (DB overrides), and Runs list with Livewire. Clustering and summarization are triggered from here.
User experience: Homepage pulse cards (e.g. "Tech News ↗ +45%") → category deep-dive with trending topics and SVG trend charts → time windows and thresholds adapt per category for meaningful insights.
The engine is aggregating data. Check back shortly for new pulses.
Engine
- Last Run 07:03
- Total Clusters (7d) 142
- AI Tokens Used 4.2M