# Feed RSS feed downloader and storage. Fetches articles from RSS sources, parses them with Floki, and stores them in SQLite with deduplication via unique indexes. --- ## Table of Contents 1. [Overview](#overview) 2. [Features](#features) 3. [Configuration](#configuration) 4. [Database Schema](#database-schema) 5. [Usage](#usage) 6. [API Reference](#api-reference) 7. [Internals](#internals) 8. [Troubleshooting](#troubleshooting) 9. [Related Documentation](#related-documentation) --- ## Overview `Feed` is a functional module (no GenServer) that owns a Sqler instance registered as `:feeds_db`. The database file is `data/feeds.sqlite`. Each feed source gets its own table within that database. Currently supports one source: | Source | Table | Feed URL | |--------|-------|----------| | Crunchbase News | `crunchbase` | `https://news.crunchbase.com/feed/` | The module is designed for easy extension — adding a new feed source requires a new table, URL constant, and `fetch_*/list_*` function pair. ### Design Decisions - **No GenServer** — the module is stateless. The Sqler process handles concurrency and persistence. - **Lazy initialization** — the database starts on first API call via `start/0`. No supervision tree wiring required. - **Floki for RSS parsing** — RSS is structurally close enough to HTML that Floki handles it without needing a dedicated XML parser, avoiding an extra dependency. - **Deduplication via unique index** — duplicate titles are rejected at the SQLite level. No upsert logic needed. ## Features | Feature | Description | |---------|-------------| | RSS fetch and parse | Downloads RSS XML via Req, parses with Floki | | SQLite storage | Each feed source stored in its own table | | Deduplication | Unique index on `title` prevents duplicate articles | | Lazy start | Database initializes on first call, idempotent | | Structured output | Query results returned as maps with atom keys | ## Configuration No external configuration required. All constants are module attributes: | Attribute | Value | Description | |-----------|-------|-------------| | `@db` | `:feeds_db` | Registered name for the Sqler process | | `@crunchbase_table` | `"crunchbase"` | SQLite table name | | `@crunchbase_url` | `https://news.crunchbase.com/feed/` | RSS feed URL | Database file: `data/feeds.sqlite` ## Database Schema ### `crunchbase` table | Column | Type | Constraints | Description | |--------|------|-------------|-------------| | `id` | INTEGER | PRIMARY KEY | Millisecond timestamp (auto-generated by Sqler) | | `updated_at` | INTEGER | | Optimistic locking (auto-generated by Sqler) | | `title` | TEXT | NOT NULL, UNIQUE | Article headline | | `link` | TEXT | | Article URL | | `author` | TEXT | | Author name (from `dc:creator`) | | `pub_date` | TEXT | | Publication date string from RSS | | `description` | TEXT | | Article summary/excerpt | | `categories` | TEXT | | Comma-separated tags | | `guid` | TEXT | | RSS globally unique identifier | ### Indexes | Index | Columns | Type | Purpose | |-------|---------|------|---------| | `idx_crunchbase_title` | `title` | UNIQUE | Prevent duplicate articles | ## Usage ### Fetch Latest Articles ```elixir # First run — inserts all articles from the feed Feed.fetch_crunchbase() #=> {:ok, %{inserted: 10, skipped: 0, total: 10}} # Subsequent run — duplicates skipped Feed.fetch_crunchbase() #=> {:ok, %{inserted: 0, skipped: 10, total: 10}} ``` ### Query Stored Articles ```elixir # Latest 5 articles Feed.list_crunchbase(limit: 5) #=> [ #=> %{ #=> id: 1740643200000, #=> updated_at: 1740643200000, #=> title: "Fintech Plaid Completes Tender Offer At $8B Valuation", #=> link: "https://news.crunchbase.com/fintech/plaid-completes-tender-offer...", #=> author: "Judy Rider", #=> pub_date: "Thu, 26 Feb 2026 18:02:27 +0000", #=> description: "Fintech infrastructure company Plaid revealed...", #=> categories: "Artificial intelligence, Fintech, Startups, AI", #=> guid: "https://news.crunchbase.com/?p=93186" #=> }, #=> ... #=> ] # Default limit is 20 Feed.list_crunchbase() # Count total stored articles Feed.count_crunchbase() #=> 10 ``` ### Manual Database Initialization ```elixir # Normally not needed — called automatically by other functions Feed.start() #=> :ok ``` ## API Reference ### `start/0` Initializes the Sqler database process and creates tables if they don't exist. Idempotent — safe to call multiple times. Called automatically by all other public functions. **Returns:** `:ok` ```elixir Feed.start() #=> :ok # Already running — no-op Feed.start() #=> :ok ``` --- ### `fetch_crunchbase/0` Fetches the Crunchbase News RSS feed, parses all `` elements, and inserts new articles into the `crunchbase` table. Articles with duplicate titles are silently skipped. **Returns:** - `{:ok, %{inserted: integer, skipped: integer, total: integer}}` — on success - `{:error, reason}` — on HTTP failure, parse error, or exception ```elixir Feed.fetch_crunchbase() #=> {:ok, %{inserted: 10, skipped: 0, total: 10}} # Network error Feed.fetch_crunchbase() #=> {:error, %Mint.TransportError{reason: :nxdomain}} ``` > **Note:** The feed typically contains ~10 articles. Crunchbase updates it hourly. --- ### `list_crunchbase/1` Returns stored articles from the `crunchbase` table, ordered by insertion time (newest first). **Parameters:** - `opts` (keyword list, optional) - `:limit` (integer) — maximum number of articles to return. Default: `20` **Returns:** list of maps with atom keys ```elixir # Get latest 3 articles Feed.list_crunchbase(limit: 3) #=> [%{title: "...", link: "...", author: "...", ...}, ...] # Default (up to 20) Feed.list_crunchbase() ``` **Map keys:** `:id`, `:updated_at`, `:title`, `:link`, `:author`, `:pub_date`, `:description`, `:categories`, `:guid` --- ### `count_crunchbase/0` Returns the total number of articles stored in the `crunchbase` table. **Returns:** integer ```elixir Feed.count_crunchbase() #=> 10 ``` ## Internals ### Fetch Flow ``` Feed.fetch_crunchbase() │ ├── start() Ensure Sqler is running │ ├── Req.get(crunchbase_url) HTTP GET the RSS XML │ ├── parse_rss(body) Floki.parse_document → find "item" elements │ ├── Extract: title, link, author, pub_date, description, categories, guid │ └── Reject items with empty/nil titles │ └── insert_items(table, items) Sqler.insert each item ├── {:ok, _id} → count as inserted └── {:error, _} → count as skipped (duplicate title) ``` ### RSS Parsing Details - RSS XML is parsed as HTML via `Floki.parse_document/2` with `attributes_as_maps: true` - `` elements are located with `Floki.find(doc, "item")` - The `dc:creator` namespace prefix requires escaping in CSS selectors: `"dc\\:creator"` - Multiple `` elements per item are collected and joined with `", "` - Items with nil or empty titles are rejected before insertion ### Deduplication SQLite's unique constraint on `title` causes `Sqler.insert/3` to return `{:error, _}` for duplicates. The module counts these as skips rather than raising. This means: - First fetch: all articles inserted - Subsequent fetches: only genuinely new articles inserted - No need for pre-fetch queries to check existence ## Troubleshooting | Problem | Cause | Fix | |---------|-------|-----| | `{:error, %Mint.TransportError{}}` | Network issue or DNS failure | Check internet connectivity | | All articles skipped on first run | Database already populated from a previous session | Expected behavior — `data/feeds.sqlite` persists across restarts | | Empty results from `list_crunchbase` | Feed not yet fetched | Run `Feed.fetch_crunchbase()` first | | `parse_rss` returns empty list | RSS format changed or Floki can't parse | Check feed URL manually; inspect raw XML | ## Related Documentation - [Sqler](sqler.md) — SQLite wrapper used for database operations - [BlogStore](blog_store.md) — Similar pattern: functional module owning a Sqler instance with text storage --- *Source: `lib/feed.ex` — Last updated: 2026-02-27*