# Feed

RSS feed downloader and storage. Fetches articles from RSS sources, parses them with Floki, and stores them in SQLite with deduplication via unique indexes.

---

## Table of Contents

1. [Overview](#overview)
2. [Features](#features)
3. [Configuration](#configuration)
4. [Database Schema](#database-schema)
5. [Usage](#usage)
6. [API Reference](#api-reference)
7. [Internals](#internals)
8. [Troubleshooting](#troubleshooting)
9. [Related Documentation](#related-documentation)

---

## Overview

`Feed` is a functional module (no GenServer) that owns a Sqler instance registered as `:feeds_db`. The database file is `data/feeds.sqlite`. Each feed source gets its own table within that database.

Currently supports one source:

| Source | Table | Feed URL |
|--------|-------|----------|
| Crunchbase News | `crunchbase` | `https://news.crunchbase.com/feed/` |

The module is designed for easy extension — adding a new feed source requires a new table, URL constant, and `fetch_*/list_*` function pair.

### Design Decisions

- **No GenServer** — the module is stateless. The Sqler process handles concurrency and persistence.
- **Lazy initialization** — the database starts on first API call via `start/0`. No supervision tree wiring required.
- **Floki for RSS parsing** — RSS is structurally close enough to HTML that Floki handles it without needing a dedicated XML parser, avoiding an extra dependency.
- **Deduplication via unique index** — duplicate titles are rejected at the SQLite level. No upsert logic needed.

## Features

| Feature | Description |
|---------|-------------|
| RSS fetch and parse | Downloads RSS XML via Req, parses with Floki |
| SQLite storage | Each feed source stored in its own table |
| Deduplication | Unique index on `title` prevents duplicate articles |
| Lazy start | Database initializes on first call, idempotent |
| Structured output | Query results returned as maps with atom keys |

## Configuration

No external configuration required. All constants are module attributes:

| Attribute | Value | Description |
|-----------|-------|-------------|
| `@db` | `:feeds_db` | Registered name for the Sqler process |
| `@crunchbase_table` | `"crunchbase"` | SQLite table name |
| `@crunchbase_url` | `https://news.crunchbase.com/feed/` | RSS feed URL |

Database file: `data/feeds.sqlite`

## Database Schema

### `crunchbase` table

| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `id` | INTEGER | PRIMARY KEY | Millisecond timestamp (auto-generated by Sqler) |
| `updated_at` | INTEGER | | Optimistic locking (auto-generated by Sqler) |
| `title` | TEXT | NOT NULL, UNIQUE | Article headline |
| `link` | TEXT | | Article URL |
| `author` | TEXT | | Author name (from `dc:creator`) |
| `pub_date` | TEXT | | Publication date string from RSS |
| `description` | TEXT | | Article summary/excerpt |
| `categories` | TEXT | | Comma-separated tags |
| `guid` | TEXT | | RSS globally unique identifier |

### Indexes

| Index | Columns | Type | Purpose |
|-------|---------|------|---------|
| `idx_crunchbase_title` | `title` | UNIQUE | Prevent duplicate articles |

## Usage

### Fetch Latest Articles

```elixir
# First run — inserts all articles from the feed
Feed.fetch_crunchbase()
#=> {:ok, %{inserted: 10, skipped: 0, total: 10}}

# Subsequent run — duplicates skipped
Feed.fetch_crunchbase()
#=> {:ok, %{inserted: 0, skipped: 10, total: 10}}
```

### Query Stored Articles

```elixir
# Latest 5 articles
Feed.list_crunchbase(limit: 5)
#=> [
#=>   %{
#=>     id: 1740643200000,
#=>     updated_at: 1740643200000,
#=>     title: "Fintech Plaid Completes Tender Offer At $8B Valuation",
#=>     link: "https://news.crunchbase.com/fintech/plaid-completes-tender-offer...",
#=>     author: "Judy Rider",
#=>     pub_date: "Thu, 26 Feb 2026 18:02:27 +0000",
#=>     description: "Fintech infrastructure company Plaid revealed...",
#=>     categories: "Artificial intelligence, Fintech, Startups, AI",
#=>     guid: "https://news.crunchbase.com/?p=93186"
#=>   },
#=>   ...
#=> ]

# Default limit is 20
Feed.list_crunchbase()

# Count total stored articles
Feed.count_crunchbase()
#=> 10
```

### Manual Database Initialization

```elixir
# Normally not needed — called automatically by other functions
Feed.start()
#=> :ok
```

## API Reference

### `start/0`

Initializes the Sqler database process and creates tables if they don't exist. Idempotent — safe to call multiple times. Called automatically by all other public functions.

**Returns:** `:ok`

```elixir
Feed.start()
#=> :ok

# Already running — no-op
Feed.start()
#=> :ok
```

---

### `fetch_crunchbase/0`

Fetches the Crunchbase News RSS feed, parses all `<item>` elements, and inserts new articles into the `crunchbase` table. Articles with duplicate titles are silently skipped.

**Returns:**
- `{:ok, %{inserted: integer, skipped: integer, total: integer}}` — on success
- `{:error, reason}` — on HTTP failure, parse error, or exception

```elixir
Feed.fetch_crunchbase()
#=> {:ok, %{inserted: 10, skipped: 0, total: 10}}

# Network error
Feed.fetch_crunchbase()
#=> {:error, %Mint.TransportError{reason: :nxdomain}}
```

> **Note:** The feed typically contains ~10 articles. Crunchbase updates it hourly.

---

### `list_crunchbase/1`

Returns stored articles from the `crunchbase` table, ordered by insertion time (newest first).

**Parameters:**
- `opts` (keyword list, optional)
  - `:limit` (integer) — maximum number of articles to return. Default: `20`

**Returns:** list of maps with atom keys

```elixir
# Get latest 3 articles
Feed.list_crunchbase(limit: 3)
#=> [%{title: "...", link: "...", author: "...", ...}, ...]

# Default (up to 20)
Feed.list_crunchbase()
```

**Map keys:** `:id`, `:updated_at`, `:title`, `:link`, `:author`, `:pub_date`, `:description`, `:categories`, `:guid`

---

### `count_crunchbase/0`

Returns the total number of articles stored in the `crunchbase` table.

**Returns:** integer

```elixir
Feed.count_crunchbase()
#=> 10
```

## Internals

### Fetch Flow

```
Feed.fetch_crunchbase()
    │
    ├── start()                          Ensure Sqler is running
    │
    ├── Req.get(crunchbase_url)          HTTP GET the RSS XML
    │
    ├── parse_rss(body)                  Floki.parse_document → find "item" elements
    │   ├── Extract: title, link, author, pub_date, description, categories, guid
    │   └── Reject items with empty/nil titles
    │
    └── insert_items(table, items)       Sqler.insert each item
        ├── {:ok, _id}  →  count as inserted
        └── {:error, _} →  count as skipped (duplicate title)
```

### RSS Parsing Details

- RSS XML is parsed as HTML via `Floki.parse_document/2` with `attributes_as_maps: true`
- `<item>` elements are located with `Floki.find(doc, "item")`
- The `dc:creator` namespace prefix requires escaping in CSS selectors: `"dc\\:creator"`
- Multiple `<category>` elements per item are collected and joined with `", "`
- Items with nil or empty titles are rejected before insertion

### Deduplication

SQLite's unique constraint on `title` causes `Sqler.insert/3` to return `{:error, _}` for duplicates. The module counts these as skips rather than raising. This means:

- First fetch: all articles inserted
- Subsequent fetches: only genuinely new articles inserted
- No need for pre-fetch queries to check existence

## Troubleshooting

| Problem | Cause | Fix |
|---------|-------|-----|
| `{:error, %Mint.TransportError{}}` | Network issue or DNS failure | Check internet connectivity |
| All articles skipped on first run | Database already populated from a previous session | Expected behavior — `data/feeds.sqlite` persists across restarts |
| Empty results from `list_crunchbase` | Feed not yet fetched | Run `Feed.fetch_crunchbase()` first |
| `parse_rss` returns empty list | RSS format changed or Floki can't parse | Check feed URL manually; inspect raw XML |

## Related Documentation

- [Sqler](sqler.md) — SQLite wrapper used for database operations
- [BlogStore](blog_store.md) — Similar pattern: functional module owning a Sqler instance with text storage

---

*Source: `lib/feed.ex` — Last updated: 2026-02-27*