# PageScraper Server-driven browser DOM extraction system. The server reacts to URL patterns visited in Chrome, spins up per-user GenServers, and orchestrates DOM queries, clicks, watches, and polling — all driven from Elixir with results stored in SQLite. ## Overview When a user with the Chrome extension navigates to a URL that matches a registered pattern (e.g., a GitHub PR page), the server automatically: 1. Starts a `PageScraperServer` GenServer for that user + pattern 2. Checks `handler.url_exists?(url)` — skips extraction if URL already stored 3. Calls the handler's `on_match/2` to get an extraction plan 4. Sends DOM commands to the browser's content script 5. Routes results back through the handler's `on_extraction/3` callback 6. Stores completed extractions via `handler.store_result/3` into handler-specific tables The system supports immediate extraction, periodic polling, persistent MutationObservers, and multi-step sequences (click → wait → extract chains). ## Architecture ``` Content Script (per tab) ←→ Background.js ←→ Offscreen (WebSocket) ←→ WsHandler ←→ PageScraperServer ``` ### Message Flow **Server → Browser** (DOM commands): ``` PageScraperServer → send(ws_handler_pid, {:push_json, msg}) → WsHandler pushes JSON over WebSocket → Offscreen relays to Background.js → Background.js routes via chrome.tabs.sendMessage(tab_id) → content.js executes DOM command ``` **Browser → Server** (results): ``` content.js sends dom_result via chrome.runtime.sendMessage → Background.js forwards to Offscreen → Offscreen relays over WebSocket → WsHandler receives JSON, calls PageScraperServer.dom_result/4 → PageScraperServer dispatches to handler.on_extraction/3 ``` ### Process Registry Each `PageScraperServer` is registered in `PageScraperRegistry` with the key `{username, pattern_id}`. One process per URL pattern per user. Multiple patterns can match the same URL — each gets its own server instance. ## Database All handler tables live in a single Sqler instance: `:page_scraper_db` (`data/page_scraper.sqlite`). ### Per-Handler Tables Each handler defines its own table with typed columns matching the data it extracts. Tables are created at application startup — `application.ex` iterates all registered patterns and calls `handler.setup_database/0` for each. #### `github_prs` — GithubPR Handler | Column | Type | Description | |--------|------|-------------| | `id` | INTEGER | Primary key (ms timestamp) | | `updated_at` | INTEGER | Last update timestamp | | `owner` | TEXT | Repository owner (NOT NULL) | | `repo` | TEXT | Repository name (NOT NULL) | | `number` | INTEGER | PR number (NOT NULL) | | `url` | TEXT | Full URL (NOT NULL, UNIQUE) | | `pr_header` | TEXT | PR title | | `pr_state` | TEXT | Open/Closed/Merged (data-status attribute) | | `pr_author` | TEXT | Author username | | `pr_body` | TEXT | PR description (HTML) | | `pr_files_count` | TEXT | Number of changed files | | `pr_commits_count` | TEXT | Number of commits | Index: `idx_gpr_owner_repo` on `(owner, repo)` #### `crunchbase_companies` — CrunchbaseFunding Handler | Column | Type | Description | |--------|------|-------------| | `id` | INTEGER | Primary key (ms timestamp) | | `updated_at` | INTEGER | Last update timestamp | | `slug` | TEXT | Company slug from URL (NOT NULL) | | `url` | TEXT | Full URL (NOT NULL, UNIQUE) | | `company_name` | TEXT | Company display name | | `company_tagline` | TEXT | One-line description | | `company_scores` | TEXT | JSON array of score-and-trend values | | `company_overview` | TEXT | Founded year, IPO status, location, etc. | | `company_categories` | TEXT | JSON array of category chips | | `details_enums` | TEXT | JSON array: Operating Status + Company Type | | `details_founders` | TEXT | Founders list | | `details_description` | TEXT | Full company description | | `details_contact` | TEXT | Contact email | | `total_funding` | TEXT | Total funding amount | | `num_rounds` | TEXT | Number of funding rounds | | `lead_investors_summary` | TEXT | Lead investors | | `funding_rounds` | TEXT | JSON: `{headers, rows}` from funding table | Index: `idx_cb_slug` on `(slug)` Rows with nil or empty `company_name` are rejected at storage time. #### `page_scraper_extractions` — Legacy Generic Table Still exists for backward compatibility. New handlers should define their own table. | Column | Type | Description | |--------|------|-------------| | `id` | INTEGER | Primary key (ms timestamp) | | `updated_at` | INTEGER | Last update timestamp | | `user_id` | INTEGER | User who triggered the extraction (NOT NULL) | | `pattern_id` | TEXT | Pattern identifier (NOT NULL) | | `url` | TEXT | Full URL (NOT NULL) | | `tab_id` | INTEGER | Chrome tab ID | | `data` | TEXT | JSON extraction results (NOT NULL) | | `extraction_type` | TEXT | snapshot, poll, sequence, mutation (NOT NULL) | | `metadata` | TEXT | JSON metadata (default `'{}'`) | ### URL Deduplication Before extracting, `PageScraperServer.init/1` checks `handler.url_exists?(url)`. If the handler already has a row for this URL, the server stops immediately (`{:stop, :normal}`) without sending any DOM commands. This prevents duplicate extractions. To re-scrape a URL, delete the existing row first: ```elixir Sqler.sql(:page_scraper_db, "DELETE FROM github_prs WHERE url = 'https://...'") Sqler.sql(:page_scraper_db, "DELETE FROM crunchbase_companies WHERE slug = 'anthropic'") ``` ## Startup Configuration In `Mcp.Application`: ```elixir # Supervision tree children: {Sqler, name: "page_scraper", register: :page_scraper_db} {Registry, keys: :unique, name: PageScraperRegistry} # Post-startup schema init: PageScraper.setup_database(:page_scraper_db) for {_pattern_id, _regex, handler} <- PageScraper.patterns() do handler.setup_database() end ``` ## Modules ### PageScraper (`lib/page_scraper.ex`) Pure functional module. Two responsibilities: 1. **URL Pattern Registry** — Compile-time list of `{pattern_id, regex, handler_module}` tuples. `match_url/1` checks a URL against all patterns. 2. **Legacy Extraction Database** — CRUD for the generic `page_scraper_extractions` table. New handlers use their own tables instead. ### PageScraperServer (`lib/page_scraper_server.ex`) Per `{username, pattern_id}` GenServer that orchestrates extraction. Manages the lifecycle from page visit to departure, tracks pending DOM commands, and dispatches results to the handler. Key behaviors: - **URL dedup** — checks `handler.url_exists?` before extracting - **Handler storage** — calls `handler.store_result/3` instead of generic storage - **Timeout management** — dom_wait commands get `wait_timeout + 5s` to avoid race conditions with the browser-side timeout ### PageScraper.Handler (`lib/page_scraper/handler.ex`) Behaviour definition for URL-pattern handlers. Handlers are pure logic — no GenServer, no state management. The server owns the state and passes it through callbacks. ### PageScraper.Handlers.GithubPR (`lib/page_scraper/handlers/github_pr.ex`) Extracts PR metadata from GitHub pull request pages. - **Triggers on:** `https://github.com///pull/` - **Table:** `github_prs` - **Extracts 6 fields:** pr_header, pr_state, pr_author, pr_body, pr_files_count, pr_commits_count - **Wait selector:** `.markdown-title` (10s timeout) ### PageScraper.Handlers.CrunchbaseFunding (`lib/page_scraper/handlers/crunchbase_funding.ex`) Extracts company data from Crunchbase organization pages. Covers 3 page sections. - **Triggers on:** `https://www.crunchbase.com/organization/` - **Table:** `crunchbase_companies` - **Wait selector:** `#company_funding table` (15s timeout — Angular SPA is slow) - **Extracts 13 fields from 3 sections:** | Section | Fields | |---------|--------| | Profile header | company_name, company_tagline, company_scores, company_overview, company_categories | | #overview_details | details_enums, details_founders, details_description, details_contact | | #company_funding | total_funding, num_rounds, lead_investors_summary, funding_rounds | ## Registered URL Patterns | pattern_id | Regex | Handler | |-----------|-------|---------| | `github_pr` | `^https://github\.com///pull/` | GithubPR | | `crunchbase_funding` | `^https://www\.crunchbase\.com/organization/` | CrunchbaseFunding | ## Handler Callbacks ### Required #### on_match(url, captures) → extraction_plan Called once when the URL first matches. Returns a plan map with any combination of: | Key | Type | Description | |-----|------|-------------| | `:wait` | `wait_spec` | Wait for selector before extracting | | `:extractions` | `[extraction_spec]` | Immediate extractions | | `:watches` | `[watch_spec]` | Persistent MutationObservers | | `:poll` | `poll_spec` | Periodic re-extraction | | `:sequence` | `[sequence_step]` | Ordered click/wait/extract chain | | `:handler_state` | `map` | Initial handler state | #### on_extraction(step, data, state) → {action, state} Called for each extraction result. `step` is the extraction name (string). `data` is the unwrapped value. **Actions:** | Action | Description | |--------|-------------| | `:store` | Persist empty map to DB | | `{:store, data}` | Persist the given map to DB via `handler.store_result/3` | | `:next_step` | Advance to next sequence step | | `{:next_step, state}` | Advance + update handler state | | `:stop` | Shut down this scraper | | `:noop` | Do nothing (still accumulating) | ### Optional (with defaults via `use PageScraper.Handler`) #### setup_database() → :ok Create the handler's table in `:page_scraper_db`. Called once at application startup. Default: no-op. #### url_exists?(url) → boolean Check if a URL has already been extracted and stored. Called before starting extraction — if true, the server stops without extracting. Default: `false` (always extract). #### store_result(url, captures, data) → {:ok, id} | {:error, reason} Store extraction results into the handler's own table. Receives the URL, regex captures, and the data map from `{:store, data}`. Default: `{:ok, 0}` (no-op). #### on_departure(state) → :ok Called when the user navigates away. Default: no-op. #### on_mutation(selector, data, state) → {action, state} Called when a MutationObserver fires. Default: `{:noop, state}`. ## Writing a Handler ### 1. Create the handler module ```elixir defmodule PageScraper.Handlers.MyHandler do use PageScraper.Handler require Logger @db :page_scraper_db @table "my_table" # ── Database ───────────────────────────────────────────────────────────── @impl PageScraper.Handler def setup_database do Sqler.sql(@db, """ CREATE TABLE IF NOT EXISTS #{@table} ( id INTEGER PRIMARY KEY, updated_at INTEGER, slug TEXT NOT NULL, url TEXT NOT NULL UNIQUE, title TEXT, status TEXT ) """) Sqler.sql(@db, "CREATE INDEX IF NOT EXISTS idx_my_slug ON #{@table}(slug)") :ok end @impl PageScraper.Handler def url_exists?(url) do case Sqler.sql(@db, "SELECT COUNT(*) FROM #{@table} WHERE url = ?", [url]) do [[count]] when count > 0 -> true _ -> false end end @impl PageScraper.Handler def store_result(url, captures, data) do row = data |> Map.put("slug", captures["slug"]) |> Map.put("url", url) Sqler.insert(@db, @table, row) end # ── Extraction ─────────────────────────────────────────────────────────── @impl PageScraper.Handler def on_match(url, captures) do %{ wait: %{selector: ".main-content", timeout: 10_000}, extractions: [ %{name: "title", selector: "h1", type: :text}, %{name: "status", selector: ".badge", type: :text} ], handler_state: %{ slug: captures["slug"], url: url, extractions: %{} } } end @expected ~w(title status) @impl PageScraper.Handler def on_extraction(step, data, state) do extractions = Map.put(state.extractions, step, data) state = %{state | extractions: extractions} if MapSet.subset?(MapSet.new(@expected), MapSet.new(Map.keys(extractions))) do {{:store, extractions}, state} else {:noop, state} end end end ``` ### 2. Register the URL pattern Add an entry to `@patterns` in `lib/page_scraper.ex`: ```elixir @patterns [ {"github_pr", ~r{^https://github\.com/(?P[^/]+)/(?P[^/]+)/pull/(?P\d+)}, PageScraper.Handlers.GithubPR}, {"crunchbase_funding", ~r{^https://www\.crunchbase\.com/organization/(?P[^/]+)}, PageScraper.Handlers.CrunchbaseFunding}, {"my_pattern", ~r{^https://example\.com/dashboard/(?P[^/]+)}, PageScraper.Handlers.MyHandler} ] ``` ### 3. That's it The system automatically: - Matches URLs on every `page_visit` event from the Chrome extension - Checks `url_exists?` — skips if already stored - Starts a `PageScraperServer` for each new match - Calls your handler callbacks - Stores results via `store_result/3` to your handler's table - Stops the server on page departure (5s grace period) or idle timeout (30min) ## Content Script Commands The content script (`mcp-extension/content.js`) supports these DOM commands: | Command | Payload | Description | |---------|---------|-------------| | `dom_query` | `{selector, type, attribute}` | querySelector, return single element data | | `dom_query_all` | `{selector, type, attribute}` | querySelectorAll, return array | | `dom_click` | `{selector}` | Click an element | | `dom_wait` | `{selector, timeout}` | MutationObserver until selector appears | | `dom_watch` | `{selector, attributes, children, subtree}` | Persistent MutationObserver | | `dom_unwatch` | `{selector}` | Disconnect a watcher | | `dom_extract` | `{extractions: [specs]}` | Structured multi-field extraction | ### Extraction Types | Type | JavaScript | Description | |------|-----------|-------------| | `text` | `el.textContent.trim()` | Plain text content | | `html` | `el.innerHTML` | Raw HTML | | `attribute` | `el.getAttribute(name)` | Element attribute value | | `table` | `{headers, rows}` | Structured table data | | `list` | `[text, ...]` | Array of textContent from querySelectorAll | ## Message Protocol ### Browser → Server | Type | Key Fields | Description | |------|-----------|-------------| | `page_visit` | `url, title, tab_id, timestamp` | User navigated to a page | | `page_departure` | `url, tab_id, timestamp` | User left a page or closed tab | | `dom_result` | `pattern_id, request_id, result` | Content script extraction result | | `dom_mutation` | `pattern_id, selector, data` | MutationObserver fired | ### Server → Browser | Type | Key Fields | Description | |------|-----------|-------------| | `dom_command` | `command, payload, tab_id, pattern_id, request_id` | DOM command for content script | ## Server Lifecycle ### Startup 1. `WsHandler` receives `page_visit` with URL and tab_id 2. `PageScraper.match_url(url)` checks against registered patterns 3. For each match, `PageScraperServer.get_or_start/7` is called 4. `init` checks `handler.url_exists?(url)` — stops if already stored 5. If new, calls `handler.on_match/2`, begins extraction pipeline 6. If already running, casts `{:update_page, ...}` — only re-extracts if URL changed ### Extraction Pipeline ``` start_extraction ├── plan has :wait? → send dom_wait → on result → run_extractions └── no :wait → run_extractions ├── send_extractions (one dom_extract per spec) ├── dom_watch (set up MutationObservers) ├── poll timer (Process.send_after for periodic re-extract) └── sequence steps (process one at a time) ``` ### Request Tracking Each DOM command gets a monotonic `request_id`. The browser echoes it back in `dom_result`. The server maps `request_id → request_info` tuple to dispatch results: | request_info | Meaning | |-------------|---------| | `{:wait_complete}` | dom_wait finished — run extractions | | `{:watch_setup, selector}` | dom_watch confirmed — record active watch | | `{:sequence_click}` | Click done — advance sequence | | `{:sequence_wait}` | Wait done — advance sequence | | `{:extraction, phase, step_name}` | Extraction result — call handler | ### Timeout Handling - **Default request timeout:** 15 seconds — if the browser doesn't respond, the pending request is cleaned up - **dom_wait timeout:** `wait_timeout + 5s` — the server-side timeout must exceed the browser-side wait to avoid a race condition where the server expires the request before the browser responds - **Idle timeout:** 30 minutes of no activity shuts down the server - **Departure grace period:** 5 seconds — cancelled if user returns ### Shutdown - **Page departure**: 5-second grace period. Cancelled if user returns. - **Idle timeout**: 30 minutes of no activity. - **Handler stop**: Handler returns `:stop` action. ## IEx Cheatsheet ```elixir # ── URL Matching ───────────────────────────────────────────────────────── PageScraper.match_url("https://github.com/owner/repo/pull/123") #=> [{"github_pr", PageScraper.Handlers.GithubPR, %{"owner" => ..., "repo" => ..., "number" => ...}}] PageScraper.match_url("https://www.crunchbase.com/organization/anthropic") #=> [{"crunchbase_funding", PageScraper.Handlers.CrunchbaseFunding, %{"slug" => "anthropic"}}] PageScraper.patterns() #=> [{pattern_id, regex, handler}, ...] # ── Check Running Scrapers ─────────────────────────────────────────────── Registry.lookup(PageScraperRegistry, {"bob", "github_pr"}) #=> [{#PID<0.123.0>, nil}] # ── Query Handler Tables (preferred) ──────────────────────────────────── alias PageScraper.Handlers.GithubPR GithubPR.query(limit: 5) GithubPR.query(owner: "elixir-lang", repo: "elixir") alias PageScraper.Handlers.CrunchbaseFunding CrunchbaseFunding.query(limit: 5) CrunchbaseFunding.query(slug: "anthropic") # ── Row Counts ─────────────────────────────────────────────────────────── Sqler.sql(:page_scraper_db, "SELECT COUNT(*) FROM github_prs") Sqler.sql(:page_scraper_db, "SELECT COUNT(*) FROM crunchbase_companies") # ── Re-scrape (delete first, then revisit the page) ───────────────────── Sqler.sql(:page_scraper_db, "DELETE FROM github_prs WHERE url = 'https://...'") Sqler.sql(:page_scraper_db, "DELETE FROM crunchbase_companies WHERE slug = 'composio'") # ── Delete bad rows ────────────────────────────────────────────────────── Sqler.sql(:page_scraper_db, "DELETE FROM crunchbase_companies WHERE company_name IS NULL") # ── Query Legacy Generic Table ─────────────────────────────────────────── PageScraper.query_extractions("github_pr", limit: 5) PageScraper.query_extractions("github_pr", user_id: 1, limit: 10) ``` ## Public API ### PageScraper #### match_url/1 Match a URL against registered patterns. Returns `[{pattern_id, handler, captures}]`. ```elixir PageScraper.match_url("https://github.com/owner/repo/pull/123") #=> [{"github_pr", PageScraper.Handlers.GithubPR, %{"owner" => "owner", ...}}] ``` #### store_extraction/2 Store to the legacy generic table. New handlers should use `store_result/3` instead. #### query_extractions/2 Query the legacy generic table by pattern_id. #### GithubPR.query/1 ```elixir GithubPR.query() # all PRs, limit 50 GithubPR.query(owner: "elixir-lang", repo: "elixir") # filter by repo GithubPR.query(limit: 5) # limit results ``` #### CrunchbaseFunding.query/1 ```elixir CrunchbaseFunding.query() # all companies, limit 50 CrunchbaseFunding.query(slug: "anthropic") # filter by slug CrunchbaseFunding.query(limit: 5) # limit results ``` ### PageScraperServer #### get_or_start/7 Start or update a scraper. If already running, casts an update (only re-extracts if URL changed). ```elixir PageScraperServer.get_or_start("bob", 1, "github_pr", PageScraper.Handlers.GithubPR, "https://github.com/owner/repo/pull/123", %{"owner" => "owner"}, 42) ``` #### page_departed/2 Notify a specific scraper of page departure. Starts a 5-second grace period. #### page_departed_all/2 Notify all scrapers for a user of a departure. Each checks if URL matches. #### dom_result/4 Deliver a DOM extraction result from the browser. #### dom_mutation/4 Deliver a DOM mutation notification. ## Dependencies | Module | Purpose | |--------|---------| | `Sqler` | SQLite database wrapper (single `:page_scraper_db` instance shared by all handlers) | | `Jason` | JSON encoding/decoding for list/map columns | | `WsRegistry` | Lookup WsHandler pid for sending DOM commands to browser | | `PageScraperRegistry` | Process registry for per-user per-pattern GenServers |