# BlogStore

Stateless module that indexes blog posts into SQLite for vector and full-text search. Takes the file-based blog posts from `asset/blogs/` (managed by `Blog`), chunks them into embedding-sized pieces, generates VoyageAI embeddings, and stores everything in a searchable database.

**File:** `lib/blog_store.ex`
**Database:** `Sqler :blogs_store_db` (with `sqlite-vec` extension)
**Permission:** Gated through `Permissions.Blogs` (key 50,001)

---

## Why It Exists

`Blog` is a file-based system -- Markdown and HTML files in `asset/blogs/`. That's great for authoring and reading, but you can't do semantic search over files. BlogStore bridges that gap:

1. Imports blog files into SQLite with extracted metadata (title, author, date, word count)
2. Chunks the text using `TextChunker` (2000-char targets, 3-sentence overlap between chunks)
3. Embeds each chunk via VoyageAI (`voyage-3`, 1024 dimensions)
4. Indexes chunks in both `vec0` (vector cosine search) and FTS5 (keyword search)

The `BlogFileWatcher` GenServer monitors `asset/blogs/` for file changes and triggers automatic import/refresh, so the search index stays in sync without manual intervention.

---

## Database Schema

### `blogs` table

| Column | Type | Description |
|--------|------|-------------|
| id | INTEGER | Millisecond timestamp ID |
| updated_at | INTEGER | Optimistic locking timestamp |
| filename | TEXT | Blog filename (unique) |
| title | TEXT | Extracted from first `<h1>` or `# heading` |
| author | TEXT | Extracted from byline pattern |
| published_at | TEXT | Extracted from byline pattern |
| word_count | INTEGER | Total words in the post |
| char_count | INTEGER | Total characters in the post |
| chunk_count | INTEGER | Number of chunks generated |
| status | TEXT | `active` (default) |
| tags | TEXT | Reserved for future use |

### `blog_chunks` table

| Column | Type | Description |
|--------|------|-------------|
| id | INTEGER | Millisecond timestamp ID |
| updated_at | INTEGER | Optimistic locking timestamp |
| blog_id | INTEGER | FK to `blogs.id` |
| chunk_index | INTEGER | Position within the blog (0-based) |
| text | TEXT | Chunk content |
| char_len | INTEGER | Character length |
| sentence_count | INTEGER | Number of sentences |

### `vec_blog_chunks` (vec0 virtual table)

| Column | Type | Description |
|--------|------|-------------|
| chunk_id | INTEGER | PK, matches `blog_chunks.id` |
| embedding | float[1024] | VoyageAI embedding, cosine distance |

### `fts_blog_chunks` (FTS5 virtual table)

| Column | Description |
|--------|-------------|
| title | Blog title (for boosting) |
| text | Chunk text |

FTS5 rowids match `blog_chunks.id`, so results from either search method resolve to the same chunk records.

---

## Public API

### Import

```elixir
# Import all blog files not already in the database
BlogStore.import_all()
# => [{filename, id}, ...]

# Import a single blog file
BlogStore.import_blog("2026-03-01-14-30-my-article.md")
# => {:ok, 1740000000000}
```

Import reads the file via `Blog.read_post/1`, extracts metadata, chunks the text, inserts chunks with FTS entries, then embeds all chunks in a single batch via VoyageAI.

### Search

```elixir
# Semantic vector search (VoyageAI embedding + cosine similarity)
{:ok, results} = BlogStore.search("how does the permission system work", k: 5)
# => [%{"chunk_id" => ..., "distance" => 0.23, "text" => "...",
#       "blog_title" => "...", "blog_filename" => "..."}]

# Full-text keyword search (SQLite FTS5 + BM25 ranking)
{:ok, results} = BlogStore.search_text("AccessControl roles", k: 10)
# => [%{"chunk_id" => ..., "rank" => -2.34, "text" => "...",
#       "blog_title" => "...", "blog_filename" => "..."}]
```

Both search functions return chunk text with the parent blog's title and filename for context.

### CRUD

```elixir
# List all indexed blogs
BlogStore.list()
# => [%{"id" => ..., "filename" => "...", "title" => "...", ...}]

# Get a single blog by ID
{:ok, blog} = BlogStore.get(1740000000000)

# Refresh a blog (re-extract, re-chunk, re-embed)
{:ok, id} = BlogStore.refresh(1740000000000)

# Delete a blog and all its chunks, vectors, and FTS entries
BlogStore.delete(1740000000000)
```

`refresh/1` is a full re-index: deletes old chunks/vectors/FTS, re-reads the file, re-chunks, re-embeds. Used when a blog file is edited.

### Backfill

```elixir
# Backfill vector embeddings for chunks that don't have them
{:ok, count} = BlogStore.backfill_embeddings()

# Backfill FTS index for chunks not yet indexed
{:ok, count} = BlogStore.backfill_fts()
```

Backfill functions are idempotent -- they skip chunks that already have embeddings or FTS entries. Useful after adding search features to a database with pre-existing chunks.

### Maintenance

```elixir
# Fix Windows-1252 encoding artifacts in blog metadata
{:ok, fixed_count} = BlogStore.fix_invalid_utf8()
```

---

## Chunking Strategy

BlogStore uses `TextChunker` with these settings:

| Parameter | Value | Purpose |
|-----------|-------|---------|
| target_chars | 2000 | Soft limit per chunk |
| max_chars | 2400 | Hard limit -- forces split |
| overlap_sentences | 3 | Last 3 sentences repeated in next chunk |
| min_chars | 300 | Don't create tiny trailing chunks |

TextChunker splits on sentence boundaries (`.`, `!`, `?`), respects paragraph breaks, and adds overlap so that context isn't lost at chunk edges. This produces chunks well-suited for embedding with VoyageAI's 1024-dimension model.

---

## Metadata Extraction

BlogStore extracts metadata differently based on file type:

**Markdown files** (detected by `Blog.markdown?/1`):
- Title: first `# heading` line
- Byline: line matching `By Author -- Date` (with italic markers stripped)
- Plain text: the raw Markdown content

**HTML files:**
- Title: first `<h1>` element text
- Byline: `<p><em>By Author -- Date</em></p>` pattern
- Plain text: `Floki.text(document)` (all tags stripped)

Both paths produce: `title`, `author`, `published_at`, `plain_text`, `word_count`, `char_count`.

---

## Automatic Indexing via BlogFileWatcher

`BlogFileWatcher` is a GenServer started in the supervision tree immediately after the `:blogs_store_db` Sqler instance. It watches `asset/blogs/` using the `file_system` library (fsevents on macOS, inotify on Linux).

**Behavior:**
- New `.html` or `.md` file appears -- calls `BlogStore.import_blog(filename)`
- Existing file changes -- finds the blog ID, calls `BlogStore.refresh(id)`
- File events are debounced per-file with a 2-second window (file sync tools often emit multiple events for a single write)
- Ignores dot-files (`.hidden`, `.DS_Store`)

This means dropping a blog file into `asset/blogs/` (via editor, rsync, scp, or any sync tool) automatically indexes it for search within a few seconds.

---

## UTF-8 Sanitization

Some older HTML blog files contain Windows-1252 encoded bytes (smart quotes, em dashes) that aren't valid UTF-8. BlogStore handles this with a byte-by-byte sanitizer:

- Valid UTF-8 sequences pass through unchanged
- Known Windows-1252 bytes (0x80--0x9F range) are mapped to their Unicode equivalents (e.g., `0x93` becomes `"`, `0x97` becomes `--`)
- Unknown invalid bytes are replaced with the Unicode replacement character

Sanitization runs on import (file content) and on read (database values). The `fix_invalid_utf8/0` function retroactively cleans metadata in existing rows.

---

## Callers

| Caller | Functions Used | Context |
|--------|---------------|---------|
| `Permissions.Blogs` | All public functions | Permission-gated MCP tool dispatch |
| `BlogFileWatcher` | `import_blog/1`, `refresh/1` | Automatic file-change indexing |
| `Mcp.Application` | `setup_database/1` | Schema initialization at startup |

---

## Supervision & Startup

```
Mcp.Supervisor
  |-- {Sqler, name: "blogs_store", register: :blogs_store_db, extensions: [SqliteVec.path()]}
  |-- BlogFileWatcher
```

The Sqler instance loads the `sqlite-vec` extension for vector search. `BlogStore.setup_database(:blogs_store_db)` is called during application startup (in `Mcp.Application.custom_init/0`) to create tables and indexes if they don't exist. BlogFileWatcher starts immediately after and begins watching for file changes.

---

## Key Design Decisions

**Stateless module, not a GenServer.** BlogStore has no process of its own -- it's a collection of functions that operate on the `:blogs_store_db` Sqler instance. This is intentional: there's no state to manage beyond what's in SQLite, and the Sqler GenServer already serializes database access.

**Dual search indexes.** Vector search (cosine similarity via `vec0`) handles semantic queries ("how does authentication work?"). FTS5 handles keyword queries ("AccessControl"). Both use the same chunk IDs, so results are interchangeable.

**Batch embedding.** `embed_chunks/2` and `backfill_embeddings/0` process chunks in batches of 128 to minimize API calls to VoyageAI. A single blog import typically needs just one batch.

**Cascade delete.** `delete/1` removes FTS entries, vector entries, chunks, and the blog record in that order. No orphaned data.

**Overlap for context continuity.** The 3-sentence overlap between chunks means a search hit near a chunk boundary still captures surrounding context. This significantly improves retrieval quality for RAG use cases.

---

*Last updated: 2026-03-03*