# PageScraper

Server-driven browser DOM extraction system. The server reacts to URL patterns visited in Chrome, spins up per-user GenServers, and orchestrates DOM queries, clicks, watches, and polling — all driven from Elixir with results stored in SQLite.

## Overview

When a user with the Chrome extension navigates to a URL that matches a registered pattern (e.g., a GitHub PR page), the server automatically:

1. Starts a `PageScraperServer` GenServer for that user + pattern
2. Checks `handler.url_exists?(url)` — skips extraction if URL already stored
3. Calls the handler's `on_match/2` to get an extraction plan
4. Sends DOM commands to the browser's content script
5. Routes results back through the handler's `on_extraction/3` callback
6. Stores completed extractions via `handler.store_result/3` into handler-specific tables

The system supports immediate extraction, periodic polling, persistent MutationObservers, and multi-step sequences (click → wait → extract chains).

## Architecture

```
Content Script (per tab)  ←→  Background.js  ←→  Offscreen (WebSocket)  ←→  WsHandler  ←→  PageScraperServer
```

### Message Flow

**Server → Browser** (DOM commands):

```
PageScraperServer
  → send(ws_handler_pid, {:push_json, msg})
  → WsHandler pushes JSON over WebSocket
  → Offscreen relays to Background.js
  → Background.js routes via chrome.tabs.sendMessage(tab_id)
  → content.js executes DOM command
```

**Browser → Server** (results):

```
content.js sends dom_result via chrome.runtime.sendMessage
  → Background.js forwards to Offscreen
  → Offscreen relays over WebSocket
  → WsHandler receives JSON, calls PageScraperServer.dom_result/4
  → PageScraperServer dispatches to handler.on_extraction/3
```

### Process Registry

Each `PageScraperServer` is registered in `PageScraperRegistry` with the key `{username, pattern_id}`. One process per URL pattern per user. Multiple patterns can match the same URL — each gets its own server instance.

## Database

All handler tables live in a single Sqler instance: `:page_scraper_db` (`data/page_scraper.sqlite`).

### Per-Handler Tables

Each handler defines its own table with typed columns matching the data it extracts. Tables are created at application startup — `application.ex` iterates all registered patterns and calls `handler.setup_database/0` for each.

#### `github_prs` — GithubPR Handler

| Column | Type | Description |
|--------|------|-------------|
| `id` | INTEGER | Primary key (ms timestamp) |
| `updated_at` | INTEGER | Last update timestamp |
| `owner` | TEXT | Repository owner (NOT NULL) |
| `repo` | TEXT | Repository name (NOT NULL) |
| `number` | INTEGER | PR number (NOT NULL) |
| `url` | TEXT | Full URL (NOT NULL, UNIQUE) |
| `pr_header` | TEXT | PR title |
| `pr_state` | TEXT | Open/Closed/Merged (data-status attribute) |
| `pr_author` | TEXT | Author username |
| `pr_body` | TEXT | PR description (HTML) |
| `pr_files_count` | TEXT | Number of changed files |
| `pr_commits_count` | TEXT | Number of commits |

Index: `idx_gpr_owner_repo` on `(owner, repo)`

#### `crunchbase_companies` — CrunchbaseFunding Handler

| Column | Type | Description |
|--------|------|-------------|
| `id` | INTEGER | Primary key (ms timestamp) |
| `updated_at` | INTEGER | Last update timestamp |
| `slug` | TEXT | Company slug from URL (NOT NULL) |
| `url` | TEXT | Full URL (NOT NULL, UNIQUE) |
| `company_name` | TEXT | Company display name |
| `company_tagline` | TEXT | One-line description |
| `company_scores` | TEXT | JSON array of score-and-trend values |
| `company_overview` | TEXT | Founded year, IPO status, location, etc. |
| `company_categories` | TEXT | JSON array of category chips |
| `details_enums` | TEXT | JSON array: Operating Status + Company Type |
| `details_founders` | TEXT | Founders list |
| `details_description` | TEXT | Full company description |
| `details_contact` | TEXT | Contact email |
| `total_funding` | TEXT | Total funding amount |
| `num_rounds` | TEXT | Number of funding rounds |
| `lead_investors_summary` | TEXT | Lead investors |
| `funding_rounds` | TEXT | JSON: `{headers, rows}` from funding table |

Index: `idx_cb_slug` on `(slug)`

Rows with nil or empty `company_name` are rejected at storage time.

#### `page_scraper_extractions` — Legacy Generic Table

Still exists for backward compatibility. New handlers should define their own table.

| Column | Type | Description |
|--------|------|-------------|
| `id` | INTEGER | Primary key (ms timestamp) |
| `updated_at` | INTEGER | Last update timestamp |
| `user_id` | INTEGER | User who triggered the extraction (NOT NULL) |
| `pattern_id` | TEXT | Pattern identifier (NOT NULL) |
| `url` | TEXT | Full URL (NOT NULL) |
| `tab_id` | INTEGER | Chrome tab ID |
| `data` | TEXT | JSON extraction results (NOT NULL) |
| `extraction_type` | TEXT | snapshot, poll, sequence, mutation (NOT NULL) |
| `metadata` | TEXT | JSON metadata (default `'{}'`) |

### URL Deduplication

Before extracting, `PageScraperServer.init/1` checks `handler.url_exists?(url)`. If the handler already has a row for this URL, the server stops immediately (`{:stop, :normal}`) without sending any DOM commands. This prevents duplicate extractions.

To re-scrape a URL, delete the existing row first:

```elixir
Sqler.sql(:page_scraper_db, "DELETE FROM github_prs WHERE url = 'https://...'")
Sqler.sql(:page_scraper_db, "DELETE FROM crunchbase_companies WHERE slug = 'anthropic'")
```

## Startup Configuration

In `Mcp.Application`:

```elixir
# Supervision tree children:
{Sqler, name: "page_scraper", register: :page_scraper_db}
{Registry, keys: :unique, name: PageScraperRegistry}

# Post-startup schema init:
PageScraper.setup_database(:page_scraper_db)

for {_pattern_id, _regex, handler} <- PageScraper.patterns() do
  handler.setup_database()
end
```

## Modules

### PageScraper (`lib/page_scraper.ex`)

Pure functional module. Two responsibilities:

1. **URL Pattern Registry** — Compile-time list of `{pattern_id, regex, handler_module}` tuples. `match_url/1` checks a URL against all patterns.
2. **Legacy Extraction Database** — CRUD for the generic `page_scraper_extractions` table. New handlers use their own tables instead.

### PageScraperServer (`lib/page_scraper_server.ex`)

Per `{username, pattern_id}` GenServer that orchestrates extraction. Manages the lifecycle from page visit to departure, tracks pending DOM commands, and dispatches results to the handler.

Key behaviors:
- **URL dedup** — checks `handler.url_exists?` before extracting
- **Handler storage** — calls `handler.store_result/3` instead of generic storage
- **Timeout management** — dom_wait commands get `wait_timeout + 5s` to avoid race conditions with the browser-side timeout

### PageScraper.Handler (`lib/page_scraper/handler.ex`)

Behaviour definition for URL-pattern handlers. Handlers are pure logic — no GenServer, no state management. The server owns the state and passes it through callbacks.

### PageScraper.Handlers.GithubPR (`lib/page_scraper/handlers/github_pr.ex`)

Extracts PR metadata from GitHub pull request pages.

- **Triggers on:** `https://github.com/<owner>/<repo>/pull/<number>`
- **Table:** `github_prs`
- **Extracts 6 fields:** pr_header, pr_state, pr_author, pr_body, pr_files_count, pr_commits_count
- **Wait selector:** `.markdown-title` (10s timeout)

### PageScraper.Handlers.CrunchbaseFunding (`lib/page_scraper/handlers/crunchbase_funding.ex`)

Extracts company data from Crunchbase organization pages. Covers 3 page sections.

- **Triggers on:** `https://www.crunchbase.com/organization/<slug>`
- **Table:** `crunchbase_companies`
- **Wait selector:** `#company_funding table` (15s timeout — Angular SPA is slow)
- **Extracts 13 fields from 3 sections:**

| Section | Fields |
|---------|--------|
| Profile header | company_name, company_tagline, company_scores, company_overview, company_categories |
| #overview_details | details_enums, details_founders, details_description, details_contact |
| #company_funding | total_funding, num_rounds, lead_investors_summary, funding_rounds |

## Registered URL Patterns

| pattern_id | Regex | Handler |
|-----------|-------|---------|
| `github_pr` | `^https://github\.com/<owner>/<repo>/pull/<number>` | GithubPR |
| `crunchbase_funding` | `^https://www\.crunchbase\.com/organization/<slug>` | CrunchbaseFunding |

## Handler Callbacks

### Required

#### on_match(url, captures) → extraction_plan

Called once when the URL first matches. Returns a plan map with any combination of:

| Key | Type | Description |
|-----|------|-------------|
| `:wait` | `wait_spec` | Wait for selector before extracting |
| `:extractions` | `[extraction_spec]` | Immediate extractions |
| `:watches` | `[watch_spec]` | Persistent MutationObservers |
| `:poll` | `poll_spec` | Periodic re-extraction |
| `:sequence` | `[sequence_step]` | Ordered click/wait/extract chain |
| `:handler_state` | `map` | Initial handler state |

#### on_extraction(step, data, state) → {action, state}

Called for each extraction result. `step` is the extraction name (string). `data` is the unwrapped value.

**Actions:**

| Action | Description |
|--------|-------------|
| `:store` | Persist empty map to DB |
| `{:store, data}` | Persist the given map to DB via `handler.store_result/3` |
| `:next_step` | Advance to next sequence step |
| `{:next_step, state}` | Advance + update handler state |
| `:stop` | Shut down this scraper |
| `:noop` | Do nothing (still accumulating) |

### Optional (with defaults via `use PageScraper.Handler`)

#### setup_database() → :ok

Create the handler's table in `:page_scraper_db`. Called once at application startup. Default: no-op.

#### url_exists?(url) → boolean

Check if a URL has already been extracted and stored. Called before starting extraction — if true, the server stops without extracting. Default: `false` (always extract).

#### store_result(url, captures, data) → {:ok, id} | {:error, reason}

Store extraction results into the handler's own table. Receives the URL, regex captures, and the data map from `{:store, data}`. Default: `{:ok, 0}` (no-op).

#### on_departure(state) → :ok

Called when the user navigates away. Default: no-op.

#### on_mutation(selector, data, state) → {action, state}

Called when a MutationObserver fires. Default: `{:noop, state}`.

## Writing a Handler

### 1. Create the handler module

```elixir
defmodule PageScraper.Handlers.MyHandler do
  use PageScraper.Handler
  require Logger

  @db :page_scraper_db
  @table "my_table"

  # ── Database ─────────────────────────────────────────────────────────────

  @impl PageScraper.Handler
  def setup_database do
    Sqler.sql(@db, """
      CREATE TABLE IF NOT EXISTS #{@table} (
        id INTEGER PRIMARY KEY,
        updated_at INTEGER,
        slug TEXT NOT NULL,
        url TEXT NOT NULL UNIQUE,
        title TEXT,
        status TEXT
      )
    """)
    Sqler.sql(@db, "CREATE INDEX IF NOT EXISTS idx_my_slug ON #{@table}(slug)")
    :ok
  end

  @impl PageScraper.Handler
  def url_exists?(url) do
    case Sqler.sql(@db, "SELECT COUNT(*) FROM #{@table} WHERE url = ?", [url]) do
      [[count]] when count > 0 -> true
      _ -> false
    end
  end

  @impl PageScraper.Handler
  def store_result(url, captures, data) do
    row =
      data
      |> Map.put("slug", captures["slug"])
      |> Map.put("url", url)

    Sqler.insert(@db, @table, row)
  end

  # ── Extraction ───────────────────────────────────────────────────────────

  @impl PageScraper.Handler
  def on_match(url, captures) do
    %{
      wait: %{selector: ".main-content", timeout: 10_000},
      extractions: [
        %{name: "title", selector: "h1", type: :text},
        %{name: "status", selector: ".badge", type: :text}
      ],
      handler_state: %{
        slug: captures["slug"],
        url: url,
        extractions: %{}
      }
    }
  end

  @expected ~w(title status)

  @impl PageScraper.Handler
  def on_extraction(step, data, state) do
    extractions = Map.put(state.extractions, step, data)
    state = %{state | extractions: extractions}

    if MapSet.subset?(MapSet.new(@expected), MapSet.new(Map.keys(extractions))) do
      {{:store, extractions}, state}
    else
      {:noop, state}
    end
  end
end
```

### 2. Register the URL pattern

Add an entry to `@patterns` in `lib/page_scraper.ex`:

```elixir
@patterns [
  {"github_pr",
   ~r{^https://github\.com/(?P<owner>[^/]+)/(?P<repo>[^/]+)/pull/(?P<number>\d+)},
   PageScraper.Handlers.GithubPR},

  {"crunchbase_funding",
   ~r{^https://www\.crunchbase\.com/organization/(?P<slug>[^/]+)},
   PageScraper.Handlers.CrunchbaseFunding},

  {"my_pattern",
   ~r{^https://example\.com/dashboard/(?P<slug>[^/]+)},
   PageScraper.Handlers.MyHandler}
]
```

### 3. That's it

The system automatically:
- Matches URLs on every `page_visit` event from the Chrome extension
- Checks `url_exists?` — skips if already stored
- Starts a `PageScraperServer` for each new match
- Calls your handler callbacks
- Stores results via `store_result/3` to your handler's table
- Stops the server on page departure (5s grace period) or idle timeout (30min)

## Content Script Commands

The content script (`mcp-extension/content.js`) supports these DOM commands:

| Command | Payload | Description |
|---------|---------|-------------|
| `dom_query` | `{selector, type, attribute}` | querySelector, return single element data |
| `dom_query_all` | `{selector, type, attribute}` | querySelectorAll, return array |
| `dom_click` | `{selector}` | Click an element |
| `dom_wait` | `{selector, timeout}` | MutationObserver until selector appears |
| `dom_watch` | `{selector, attributes, children, subtree}` | Persistent MutationObserver |
| `dom_unwatch` | `{selector}` | Disconnect a watcher |
| `dom_extract` | `{extractions: [specs]}` | Structured multi-field extraction |

### Extraction Types

| Type | JavaScript | Description |
|------|-----------|-------------|
| `text` | `el.textContent.trim()` | Plain text content |
| `html` | `el.innerHTML` | Raw HTML |
| `attribute` | `el.getAttribute(name)` | Element attribute value |
| `table` | `{headers, rows}` | Structured table data |
| `list` | `[text, ...]` | Array of textContent from querySelectorAll |

## Message Protocol

### Browser → Server

| Type | Key Fields | Description |
|------|-----------|-------------|
| `page_visit` | `url, title, tab_id, timestamp` | User navigated to a page |
| `page_departure` | `url, tab_id, timestamp` | User left a page or closed tab |
| `dom_result` | `pattern_id, request_id, result` | Content script extraction result |
| `dom_mutation` | `pattern_id, selector, data` | MutationObserver fired |

### Server → Browser

| Type | Key Fields | Description |
|------|-----------|-------------|
| `dom_command` | `command, payload, tab_id, pattern_id, request_id` | DOM command for content script |

## Server Lifecycle

### Startup

1. `WsHandler` receives `page_visit` with URL and tab_id
2. `PageScraper.match_url(url)` checks against registered patterns
3. For each match, `PageScraperServer.get_or_start/7` is called
4. `init` checks `handler.url_exists?(url)` — stops if already stored
5. If new, calls `handler.on_match/2`, begins extraction pipeline
6. If already running, casts `{:update_page, ...}` — only re-extracts if URL changed

### Extraction Pipeline

```
start_extraction
  ├── plan has :wait? → send dom_wait → on result → run_extractions
  └── no :wait → run_extractions
                    ├── send_extractions (one dom_extract per spec)
                    ├── dom_watch (set up MutationObservers)
                    ├── poll timer (Process.send_after for periodic re-extract)
                    └── sequence steps (process one at a time)
```

### Request Tracking

Each DOM command gets a monotonic `request_id`. The browser echoes it back in `dom_result`. The server maps `request_id → request_info` tuple to dispatch results:

| request_info | Meaning |
|-------------|---------|
| `{:wait_complete}` | dom_wait finished — run extractions |
| `{:watch_setup, selector}` | dom_watch confirmed — record active watch |
| `{:sequence_click}` | Click done — advance sequence |
| `{:sequence_wait}` | Wait done — advance sequence |
| `{:extraction, phase, step_name}` | Extraction result — call handler |

### Timeout Handling

- **Default request timeout:** 15 seconds — if the browser doesn't respond, the pending request is cleaned up
- **dom_wait timeout:** `wait_timeout + 5s` — the server-side timeout must exceed the browser-side wait to avoid a race condition where the server expires the request before the browser responds
- **Idle timeout:** 30 minutes of no activity shuts down the server
- **Departure grace period:** 5 seconds — cancelled if user returns

### Shutdown

- **Page departure**: 5-second grace period. Cancelled if user returns.
- **Idle timeout**: 30 minutes of no activity.
- **Handler stop**: Handler returns `:stop` action.

## IEx Cheatsheet

```elixir
# ── URL Matching ─────────────────────────────────────────────────────────

PageScraper.match_url("https://github.com/owner/repo/pull/123")
#=> [{"github_pr", PageScraper.Handlers.GithubPR, %{"owner" => ..., "repo" => ..., "number" => ...}}]

PageScraper.match_url("https://www.crunchbase.com/organization/anthropic")
#=> [{"crunchbase_funding", PageScraper.Handlers.CrunchbaseFunding, %{"slug" => "anthropic"}}]

PageScraper.patterns()
#=> [{pattern_id, regex, handler}, ...]

# ── Check Running Scrapers ───────────────────────────────────────────────

Registry.lookup(PageScraperRegistry, {"bob", "github_pr"})
#=> [{#PID<0.123.0>, nil}]

# ── Query Handler Tables (preferred) ────────────────────────────────────

alias PageScraper.Handlers.GithubPR
GithubPR.query(limit: 5)
GithubPR.query(owner: "elixir-lang", repo: "elixir")

alias PageScraper.Handlers.CrunchbaseFunding
CrunchbaseFunding.query(limit: 5)
CrunchbaseFunding.query(slug: "anthropic")

# ── Row Counts ───────────────────────────────────────────────────────────

Sqler.sql(:page_scraper_db, "SELECT COUNT(*) FROM github_prs")
Sqler.sql(:page_scraper_db, "SELECT COUNT(*) FROM crunchbase_companies")

# ── Re-scrape (delete first, then revisit the page) ─────────────────────

Sqler.sql(:page_scraper_db, "DELETE FROM github_prs WHERE url = 'https://...'")
Sqler.sql(:page_scraper_db, "DELETE FROM crunchbase_companies WHERE slug = 'composio'")

# ── Delete bad rows ──────────────────────────────────────────────────────

Sqler.sql(:page_scraper_db, "DELETE FROM crunchbase_companies WHERE company_name IS NULL")

# ── Query Legacy Generic Table ───────────────────────────────────────────

PageScraper.query_extractions("github_pr", limit: 5)
PageScraper.query_extractions("github_pr", user_id: 1, limit: 10)
```

## Public API

### PageScraper

#### match_url/1

Match a URL against registered patterns. Returns `[{pattern_id, handler, captures}]`.

```elixir
PageScraper.match_url("https://github.com/owner/repo/pull/123")
#=> [{"github_pr", PageScraper.Handlers.GithubPR, %{"owner" => "owner", ...}}]
```

#### store_extraction/2

Store to the legacy generic table. New handlers should use `store_result/3` instead.

#### query_extractions/2

Query the legacy generic table by pattern_id.

#### GithubPR.query/1

```elixir
GithubPR.query()                                    # all PRs, limit 50
GithubPR.query(owner: "elixir-lang", repo: "elixir") # filter by repo
GithubPR.query(limit: 5)                            # limit results
```

#### CrunchbaseFunding.query/1

```elixir
CrunchbaseFunding.query()             # all companies, limit 50
CrunchbaseFunding.query(slug: "anthropic") # filter by slug
CrunchbaseFunding.query(limit: 5)     # limit results
```

### PageScraperServer

#### get_or_start/7

Start or update a scraper. If already running, casts an update (only re-extracts if URL changed).

```elixir
PageScraperServer.get_or_start("bob", 1, "github_pr", PageScraper.Handlers.GithubPR,
  "https://github.com/owner/repo/pull/123", %{"owner" => "owner"}, 42)
```

#### page_departed/2

Notify a specific scraper of page departure. Starts a 5-second grace period.

#### page_departed_all/2

Notify all scrapers for a user of a departure. Each checks if URL matches.

#### dom_result/4

Deliver a DOM extraction result from the browser.

#### dom_mutation/4

Deliver a DOM mutation notification.

## Dependencies

| Module | Purpose |
|--------|---------|
| `Sqler` | SQLite database wrapper (single `:page_scraper_db` instance shared by all handlers) |
| `Jason` | JSON encoding/decoding for list/map columns |
| `WsRegistry` | Lookup WsHandler pid for sending DOM commands to browser |
| `PageScraperRegistry` | Process registry for per-user per-pattern GenServers |