The Content Engine
A multi-stage pipeline that transforms raw RSS feeds, reddit posts, and other data sources into semantically clustered, AI-summarized, and trend-analyzed insights—from scrape to Pulse.
1. Scrape
Readability & Markdown
2. Tag
RAKE Keyword Extraction
3. Cluster
DBSCAN & TF-IDF
4. Summarize
Gemini AI Multi-Step
5. Pulse
Velocity & Trends
ContentScraper
Fetches and extracts main article content from a URL and cleans it for downstream processing.
Key methods
- scrapeArticleContent(string $url) — Fetches HTML, uses
fivefilters/Readabilityto extract core content (no ads/nav), then converts to clean Markdown. - scrapeArticleWithMetadata(string $url) — Same plus title, author, site name, excerpt.
Content is converted to Markdown via a ContentCleaner for high-density input to clustering.
$readability->parse($html);
$content = $readability->getContent();
$contentCleaner = new ContentCleaner;
$cleanContent = $contentCleaner->htmlToMarkdown($content);
return [
'content' => $cleanContent,
'title' => $readability->getTitle(),
'author' => $readability->getAuthor(),
'site_name' => $readability->getSiteName(),
'excerpt' => $readability->getExcerpt(),
];
TagSuggester
Analyzes text and suggests relevant, SEO-friendly tags using the RAKE algorithm.
Key functionality
- suggestTags(string $text, int $numTags = 5) — Uses
donatelloza/rake-plus(Rapid Automatic Keyword Extraction): word frequency and co-occurrence. - Blacklist filtering — Filters out non-descriptive words (e.g. "today", "article") so only meaningful keywords are suggested.
$rake = RakePlus::create($text, 'en_US');
$phrases = $rake->keywords();
foreach ($phrases as $phrase) {
if ($this->isBlacklisted(Str::lower($phrase))) continue;
$filteredTags[] = Str::slug($cleanPhrase);
}
return $filteredTags;
ArticleClusteringService
Groups semantically similar articles into Topic Clusters—the core of trending story detection.
Key functionality
- runFullClustering() — Fetches recent articles, categorizes them (TagSuggester + taxonomy), then runs clustering per category.
- DBSCAN (Rubix ML) — Density-based clustering; finds clusters of arbitrary shape and treats outliers as noise.
- TF-IDF — Titles/descriptions vectorized so DBSCAN can measure semantic distance.
- Dynamic Epsilon — Category-specific neighborhood size (e.g. tech-news stricter, media-we-love looser).
$epsilon = match ($categorySlug) {
'tech-news', 'artificial-intelligence' => 0.35, // Stricter
'media-we-love' => 0.45, // Looser
default => 0.4,
};
$this->epsilon = $epsilon;
$result = $this->group($articlesInCategory);
AutoSummarizationService
Takes a TopicCluster and generates a human-readable summary via Gemini AI, with a full audit trail in summary_generation_steps.
Step-by-step generation
- scrape_article_* — Scrapes each article in the cluster if needed.
- individual_summary_* — Summary per article.
- master_summary — Single cohesive summary of the topic.
- suggest_title — SEO-friendly cluster title from the master summary.
PulseService
Generates daily Pulse stats per category: velocity, trend lines, trending topics, and the SVG charts used on the homepage and category pages.
Key functionality
- generateDailyPulse() — Runs per category with configurable time windows (7–21 days).
- Smart trend analysis — Compares current vs previous window; fast categories (e.g. Tech) use 7-day, slower (e.g. Digital Art) use 21-day.
- Pulse type —
trend_line(meaningful activity) orminimal_activity(low activity). - SVG generation —
generateTrendGraph()(sparklines),generateLargeTrendGraph()(area charts with gradients/glow).
// Category-specific time windows
'tech-news' => 7,
'artificial-intelligence' => 7,
'webdev' => 10,
'hardware' => 14,
'digital-art' => 21,
Pulse Cards & Category Stats
Pulse cards (pulse-card.blade) show velocity, trend lines or minimal-activity view, activity trend (↗/↘), and trending topics. Category stats (category-stats.blade) show N-day velocity, trending topic badges, and total articles. Both use stats_json from DailyPulse (velocity, svg_points, trending_topics, confirmation_score, etc.).
Deep Intelligence
- Novelty —
ClusterInsightService::computeNovelty()vs historical clusters; labels: New, Recurring, Ongoing. - Controversy —
computeControversyScore()from keywords in titles/descriptions; labels: low, medium, high. - Hidden trends — Small clusters with high novelty/momentum or rapid 24h growth; "Early Trend Alert" in pulse modal.
- Coverage bias — Diversity (unique feeds / total articles); labels e.g. Industry-driven vs Grassroots.
- Narrative shifts —
NarrativeShiftServiceand "What Changed Today" on the homepage.
Admin Dashboard & User Flow
Admin (/admin/content-engine): dashboard (last run, last pulse, tokens, limits), Content Engine settings (DB overrides), and Runs list with Livewire. Clustering and summarization are triggered from here.
User experience: Homepage pulse cards (e.g. "Tech News ↗ +45%") → category deep-dive with trending topics and SVG trend charts → time windows and thresholds adapt per category for meaningful insights.
The Trending Update Debate: IOS Vs Android Edition
Activity in apple-android is stable.
Hot Take: Trending Update Is The Future (Or Just More Hype)
Activity in artificial-intelligence is stable.
The Trending Update Showcase: Technology Meets Creativity
Activity in digital-art is stable.
Hot Take: Trending Update Is The Hardware We've Been Waiting For
Activity in hardware is stable.
Why Trending Update Is Peak Internet Culture
Activity in internet-culture is stable.
The Trending Update Experience: Art, Music, Film, And More
Activity in media-we-love is stable.
Breaking: Trending Update Developments (And Why They Matter)
Activity in tech-news is stable.
The Trending Update Approach: Building Better, Not Just Faster
Activity in webdev is stable.
Engine
- Last Run 00:23
- Total Clusters (7d) 142
- AI Tokens Used 4.2M