# BlogStore Stateless module that indexes blog posts into SQLite for vector and full-text search. Takes the file-based blog posts from `asset/blogs/` (managed by `Blog`), chunks them into embedding-sized pieces, generates VoyageAI embeddings, and stores everything in a searchable database. **File:** `lib/blog_store.ex` **Database:** `Sqler :blogs_store_db` (with `sqlite-vec` extension) **Permission:** Gated through `Permissions.Blogs` (key 50,001) --- ## Why It Exists `Blog` is a file-based system -- Markdown and HTML files in `asset/blogs/`. That's great for authoring and reading, but you can't do semantic search over files. BlogStore bridges that gap: 1. Imports blog files into SQLite with extracted metadata (title, author, date, word count) 2. Chunks the text using `TextChunker` (2000-char targets, 3-sentence overlap between chunks) 3. Embeds each chunk via VoyageAI (`voyage-3`, 1024 dimensions) 4. Indexes chunks in both `vec0` (vector cosine search) and FTS5 (keyword search) The `BlogFileWatcher` GenServer monitors `asset/blogs/` for file changes and triggers automatic import/refresh, so the search index stays in sync without manual intervention. --- ## Database Schema ### `blogs` table | Column | Type | Description | |--------|------|-------------| | id | INTEGER | Millisecond timestamp ID | | updated_at | INTEGER | Optimistic locking timestamp | | filename | TEXT | Blog filename (unique) | | title | TEXT | Extracted from first `

` or `# heading` | | author | TEXT | Extracted from byline pattern | | published_at | TEXT | Extracted from byline pattern | | word_count | INTEGER | Total words in the post | | char_count | INTEGER | Total characters in the post | | chunk_count | INTEGER | Number of chunks generated | | status | TEXT | `active` (default) | | tags | TEXT | Reserved for future use | ### `blog_chunks` table | Column | Type | Description | |--------|------|-------------| | id | INTEGER | Millisecond timestamp ID | | updated_at | INTEGER | Optimistic locking timestamp | | blog_id | INTEGER | FK to `blogs.id` | | chunk_index | INTEGER | Position within the blog (0-based) | | text | TEXT | Chunk content | | char_len | INTEGER | Character length | | sentence_count | INTEGER | Number of sentences | ### `vec_blog_chunks` (vec0 virtual table) | Column | Type | Description | |--------|------|-------------| | chunk_id | INTEGER | PK, matches `blog_chunks.id` | | embedding | float[1024] | VoyageAI embedding, cosine distance | ### `fts_blog_chunks` (FTS5 virtual table) | Column | Description | |--------|-------------| | title | Blog title (for boosting) | | text | Chunk text | FTS5 rowids match `blog_chunks.id`, so results from either search method resolve to the same chunk records. --- ## Public API ### Import ```elixir # Import all blog files not already in the database BlogStore.import_all() # => [{filename, id}, ...] # Import a single blog file BlogStore.import_blog("2026-03-01-14-30-my-article.md") # => {:ok, 1740000000000} ``` Import reads the file via `Blog.read_post/1`, extracts metadata, chunks the text, inserts chunks with FTS entries, then embeds all chunks in a single batch via VoyageAI. ### Search ```elixir # Semantic vector search (VoyageAI embedding + cosine similarity) {:ok, results} = BlogStore.search("how does the permission system work", k: 5) # => [%{"chunk_id" => ..., "distance" => 0.23, "text" => "...", # "blog_title" => "...", "blog_filename" => "..."}] # Full-text keyword search (SQLite FTS5 + BM25 ranking) {:ok, results} = BlogStore.search_text("AccessControl roles", k: 10) # => [%{"chunk_id" => ..., "rank" => -2.34, "text" => "...", # "blog_title" => "...", "blog_filename" => "..."}] ``` Both search functions return chunk text with the parent blog's title and filename for context. ### CRUD ```elixir # List all indexed blogs BlogStore.list() # => [%{"id" => ..., "filename" => "...", "title" => "...", ...}] # Get a single blog by ID {:ok, blog} = BlogStore.get(1740000000000) # Refresh a blog (re-extract, re-chunk, re-embed) {:ok, id} = BlogStore.refresh(1740000000000) # Delete a blog and all its chunks, vectors, and FTS entries BlogStore.delete(1740000000000) ``` `refresh/1` is a full re-index: deletes old chunks/vectors/FTS, re-reads the file, re-chunks, re-embeds. Used when a blog file is edited. ### Backfill ```elixir # Backfill vector embeddings for chunks that don't have them {:ok, count} = BlogStore.backfill_embeddings() # Backfill FTS index for chunks not yet indexed {:ok, count} = BlogStore.backfill_fts() ``` Backfill functions are idempotent -- they skip chunks that already have embeddings or FTS entries. Useful after adding search features to a database with pre-existing chunks. ### Maintenance ```elixir # Fix Windows-1252 encoding artifacts in blog metadata {:ok, fixed_count} = BlogStore.fix_invalid_utf8() ``` --- ## Chunking Strategy BlogStore uses `TextChunker` with these settings: | Parameter | Value | Purpose | |-----------|-------|---------| | target_chars | 2000 | Soft limit per chunk | | max_chars | 2400 | Hard limit -- forces split | | overlap_sentences | 3 | Last 3 sentences repeated in next chunk | | min_chars | 300 | Don't create tiny trailing chunks | TextChunker splits on sentence boundaries (`.`, `!`, `?`), respects paragraph breaks, and adds overlap so that context isn't lost at chunk edges. This produces chunks well-suited for embedding with VoyageAI's 1024-dimension model. --- ## Metadata Extraction BlogStore extracts metadata differently based on file type: **Markdown files** (detected by `Blog.markdown?/1`): - Title: first `# heading` line - Byline: line matching `By Author -- Date` (with italic markers stripped) - Plain text: the raw Markdown content **HTML files:** - Title: first `

` element text - Byline: `

By Author -- Date

` pattern - Plain text: `Floki.text(document)` (all tags stripped) Both paths produce: `title`, `author`, `published_at`, `plain_text`, `word_count`, `char_count`. --- ## Automatic Indexing via BlogFileWatcher `BlogFileWatcher` is a GenServer started in the supervision tree immediately after the `:blogs_store_db` Sqler instance. It watches `asset/blogs/` using the `file_system` library (fsevents on macOS, inotify on Linux). **Behavior:** - New `.html` or `.md` file appears -- calls `BlogStore.import_blog(filename)` - Existing file changes -- finds the blog ID, calls `BlogStore.refresh(id)` - File events are debounced per-file with a 2-second window (file sync tools often emit multiple events for a single write) - Ignores dot-files (`.hidden`, `.DS_Store`) This means dropping a blog file into `asset/blogs/` (via editor, rsync, scp, or any sync tool) automatically indexes it for search within a few seconds. --- ## UTF-8 Sanitization Some older HTML blog files contain Windows-1252 encoded bytes (smart quotes, em dashes) that aren't valid UTF-8. BlogStore handles this with a byte-by-byte sanitizer: - Valid UTF-8 sequences pass through unchanged - Known Windows-1252 bytes (0x80--0x9F range) are mapped to their Unicode equivalents (e.g., `0x93` becomes `"`, `0x97` becomes `--`) - Unknown invalid bytes are replaced with the Unicode replacement character Sanitization runs on import (file content) and on read (database values). The `fix_invalid_utf8/0` function retroactively cleans metadata in existing rows. --- ## Callers | Caller | Functions Used | Context | |--------|---------------|---------| | `Permissions.Blogs` | All public functions | Permission-gated MCP tool dispatch | | `BlogFileWatcher` | `import_blog/1`, `refresh/1` | Automatic file-change indexing | | `Mcp.Application` | `setup_database/1` | Schema initialization at startup | --- ## Supervision & Startup ``` Mcp.Supervisor |-- {Sqler, name: "blogs_store", register: :blogs_store_db, extensions: [SqliteVec.path()]} |-- BlogFileWatcher ``` The Sqler instance loads the `sqlite-vec` extension for vector search. `BlogStore.setup_database(:blogs_store_db)` is called during application startup (in `Mcp.Application.custom_init/0`) to create tables and indexes if they don't exist. BlogFileWatcher starts immediately after and begins watching for file changes. --- ## Key Design Decisions **Stateless module, not a GenServer.** BlogStore has no process of its own -- it's a collection of functions that operate on the `:blogs_store_db` Sqler instance. This is intentional: there's no state to manage beyond what's in SQLite, and the Sqler GenServer already serializes database access. **Dual search indexes.** Vector search (cosine similarity via `vec0`) handles semantic queries ("how does authentication work?"). FTS5 handles keyword queries ("AccessControl"). Both use the same chunk IDs, so results are interchangeable. **Batch embedding.** `embed_chunks/2` and `backfill_embeddings/0` process chunks in batches of 128 to minimize API calls to VoyageAI. A single blog import typically needs just one batch. **Cascade delete.** `delete/1` removes FTS entries, vector entries, chunks, and the blog record in that order. No orphaned data. **Overlap for context continuity.** The 3-sentence overlap between chunks means a search hit near a chunk boundary still captures surrounding context. This significantly improves retrieval quality for RAG use cases. --- *Last updated: 2026-03-03*