# BlogStore Stateless module that indexes blog posts into SQLite for vector and full-text search. Takes the file-based blog posts from `asset/blogs/` (managed by `Blog`), chunks them into embedding-sized pieces, generates VoyageAI embeddings, and stores everything in a searchable database. **File:** `lib/blog_store.ex` **Database:** `Sqler :blogs_store_db` (with `sqlite-vec` extension) **Permission:** Gated through `Permissions.Blogs` (key 50,001) --- ## Why It Exists `Blog` is a file-based system -- Markdown and HTML files in `asset/blogs/`. That's great for authoring and reading, but you can't do semantic search over files. BlogStore bridges that gap: 1. Imports blog files into SQLite with extracted metadata (title, author, date, word count) 2. Chunks the text using `TextChunker` (2000-char targets, 3-sentence overlap between chunks) 3. Embeds each chunk via VoyageAI (`voyage-3`, 1024 dimensions) 4. Indexes chunks in both `vec0` (vector cosine search) and FTS5 (keyword search) The `BlogFileWatcher` GenServer monitors `asset/blogs/` for file changes and triggers automatic import/refresh, so the search index stays in sync without manual intervention. --- ## Database Schema ### `blogs` table | Column | Type | Description | |--------|------|-------------| | id | INTEGER | Millisecond timestamp ID | | updated_at | INTEGER | Optimistic locking timestamp | | filename | TEXT | Blog filename (unique) | | title | TEXT | Extracted from first `
By Author -- Date
` pattern - Plain text: `Floki.text(document)` (all tags stripped) Both paths produce: `title`, `author`, `published_at`, `plain_text`, `word_count`, `char_count`. --- ## Automatic Indexing via BlogFileWatcher `BlogFileWatcher` is a GenServer started in the supervision tree immediately after the `:blogs_store_db` Sqler instance. It watches `asset/blogs/` using the `file_system` library (fsevents on macOS, inotify on Linux). **Behavior:** - New `.html` or `.md` file appears -- calls `BlogStore.import_blog(filename)` - Existing file changes -- finds the blog ID, calls `BlogStore.refresh(id)` - File events are debounced per-file with a 2-second window (file sync tools often emit multiple events for a single write) - Ignores dot-files (`.hidden`, `.DS_Store`) This means dropping a blog file into `asset/blogs/` (via editor, rsync, scp, or any sync tool) automatically indexes it for search within a few seconds. --- ## UTF-8 Sanitization Some older HTML blog files contain Windows-1252 encoded bytes (smart quotes, em dashes) that aren't valid UTF-8. BlogStore handles this with a byte-by-byte sanitizer: - Valid UTF-8 sequences pass through unchanged - Known Windows-1252 bytes (0x80--0x9F range) are mapped to their Unicode equivalents (e.g., `0x93` becomes `"`, `0x97` becomes `--`) - Unknown invalid bytes are replaced with the Unicode replacement character Sanitization runs on import (file content) and on read (database values). The `fix_invalid_utf8/0` function retroactively cleans metadata in existing rows. --- ## Callers | Caller | Functions Used | Context | |--------|---------------|---------| | `Permissions.Blogs` | All public functions | Permission-gated MCP tool dispatch | | `BlogFileWatcher` | `import_blog/1`, `refresh/1` | Automatic file-change indexing | | `Mcp.Application` | `setup_database/1` | Schema initialization at startup | --- ## Supervision & Startup ``` Mcp.Supervisor |-- {Sqler, name: "blogs_store", register: :blogs_store_db, extensions: [SqliteVec.path()]} |-- BlogFileWatcher ``` The Sqler instance loads the `sqlite-vec` extension for vector search. `BlogStore.setup_database(:blogs_store_db)` is called during application startup (in `Mcp.Application.custom_init/0`) to create tables and indexes if they don't exist. BlogFileWatcher starts immediately after and begins watching for file changes. --- ## Key Design Decisions **Stateless module, not a GenServer.** BlogStore has no process of its own -- it's a collection of functions that operate on the `:blogs_store_db` Sqler instance. This is intentional: there's no state to manage beyond what's in SQLite, and the Sqler GenServer already serializes database access. **Dual search indexes.** Vector search (cosine similarity via `vec0`) handles semantic queries ("how does authentication work?"). FTS5 handles keyword queries ("AccessControl"). Both use the same chunk IDs, so results are interchangeable. **Batch embedding.** `embed_chunks/2` and `backfill_embeddings/0` process chunks in batches of 128 to minimize API calls to VoyageAI. A single blog import typically needs just one batch. **Cascade delete.** `delete/1` removes FTS entries, vector entries, chunks, and the blog record in that order. No orphaned data. **Overlap for context continuity.** The 3-sentence overlap between chunks means a search hit near a chunk boundary still captures surrounding context. This significantly improves retrieval quality for RAG use cases. --- *Last updated: 2026-03-03*