150 lines
11 KiB
HTML
150 lines
11 KiB
HTML
<h2>Data Layer</h2>
|
|
<p class="subtitle">LanceDB Schema, Indexer Bridge, Chunking & Embedding Pipeline</p>
|
|
|
|
<div class="section">
|
|
<h3>LanceDB Table Schema</h3>
|
|
<div class="mockup">
|
|
<div class="mockup-header">obsidian_chunks table</div>
|
|
<div class="mockup-body" style="font-family: monospace; font-size: 12px; padding: 16px; background: #0d1117; color: #c9d1d9;">
|
|
<pre style="margin: 0;">┌──────────────────┬──────────────┬─────────────────────────────────────┐
|
|
│ Column │ Type │ Description │
|
|
├──────────────────┼──────────────┼─────────────────────────────────────┤
|
|
│ vector │ FixedList │ 1024-dim float32 embedding │
|
|
│ │ float32[1024]│ (mxbai-embed-large via Ollama) │
|
|
├──────────────────┼──────────────┼─────────────────────────────────────┤
|
|
│ chunk_id │ string │ UUID v4, primary key │
|
|
├──────────────────┼──────────────┼─────────────────────────────────────┤
|
|
│ chunk_text │ string │ Raw markdown text of the chunk │
|
|
├──────────────────┼──────────────┼─────────────────────────────────────┤
|
|
│ source_file │ string │ Relative path from vault root │
|
|
│ │ │ e.g. "Journal/2024-01-15.md" │
|
|
├──────────────────┼──────────────┼─────────────────────────────────────┤
|
|
│ source_directory │ string │ Top-level directory name │
|
|
│ │ │ e.g. "Journal", "Entertainment" │
|
|
├──────────────────┼──────────────┼─────────────────────────────────────┤
|
|
│ section │ string|null │ Section heading (structured notes) │
|
|
│ │ │ e.g. "#mentalhealth", "#finance" │
|
|
├──────────────────┼──────────────┼─────────────────────────────────────┤
|
|
│ date │ string|null │ ISO 8601 date parsed from filename │
|
|
│ │ │ e.g. "2024-01-15" │
|
|
├──────────────────┼──────────────┼─────────────────────────────────────┤
|
|
│ tags │ list<string> │ All hashtags in chunk │
|
|
│ │ │ e.g. ["#mentalhealth", "#therapy"] │
|
|
├──────────────────┼──────────────┼─────────────────────────────────────┤
|
|
│ chunk_index │ int32 │ Position within source document │
|
|
│ │ │ 0-indexed │
|
|
├──────────────────┼──────────────┼─────────────────────────────────────┤
|
|
│ total_chunks │ int32 │ Total chunks for this source file │
|
|
├──────────────────┼──────────────┼─────────────────────────────────────┤
|
|
│ modified_at │ string │ File mtime, ISO 8601 │
|
|
│ │ │ Used for incremental sync │
|
|
├──────────────────┼──────────────┼─────────────────────────────────────┤
|
|
│ indexed_at │ string │ When this chunk was indexed │
|
|
│ │ │ ISO 8601 timestamp │
|
|
└──────────────────┴──────────────┴─────────────────────────────────────┘</pre>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="section" style="margin-top: 16px;">
|
|
<h3>Indexer Bridge: TS ↔ Python</h3>
|
|
<div class="mockup">
|
|
<div class="mockup-header">How the TypeScript plugin invokes the Python indexer</div>
|
|
<div class="mockup-body" style="font-family: monospace; font-size: 12px; line-height: 1.7; padding: 16px;">
|
|
<div style="color: #53a8b6; font-weight: bold; margin-bottom: 8px;">INVOCATION</div>
|
|
<div style="color: #eee; margin-left: 12px;">child_process.spawn("obsidian-rag", [mode], { env: { ...config } })</div>
|
|
<div style="color: #888; margin-left: 12px; margin-bottom: 8px;">Plugin spawns the Python CLI as a subprocess</div>
|
|
|
|
<div style="color: #53a8b6; font-weight: bold; margin-bottom: 8px;">COMMUNICATION</div>
|
|
<div style="color: #eee; margin-left: 12px;">stdin: Not used</div>
|
|
<div style="color: #eee; margin-left: 12px;">stdout: NDJSON progress lines (streamed to agent via status)</div>
|
|
<div style="color: #eee; margin-left: 12px;">stderr: Error output (logged, surfaced on non-zero exit)</div>
|
|
<div style="color: #eee; margin-left: 12px;">exit code: 0 = success, 1 = partial failure, 2 = fatal</div>
|
|
|
|
<div style="color: #53a8b6; font-weight: bold; margin-top: 12px; margin-bottom: 8px;">PROGRESS EVENTS (stdout NDJSON)</div>
|
|
<div style="background: #0d1117; padding: 10px; border-radius: 6px; color: #c9d1d9; line-height: 1.6;">
|
|
<div>{"type":"progress","phase":"embedding","current":150,"total":677}</div>
|
|
<div>{"type":"progress","phase":"storing","current":150,"total":677}</div>
|
|
<div>{"type":"complete","indexed_files":677,"total_chunks":3421,"duration_ms":45230,"errors":[]}</div>
|
|
</div>
|
|
|
|
<div style="color: #53a8b6; font-weight: bold; margin-top: 12px; margin-bottom: 8px;">SYNC RESULT FILE</div>
|
|
<div style="color: #eee; margin-left: 12px;">Written to ~/.obsidian-rag/sync-result.json on completion</div>
|
|
<div style="color: #eee; margin-left: 12px;">Read by plugin on next tool call to update health state</div>
|
|
<div style="color: #eee; margin-left: 12px;">Also serves as the source for status tool responses</div>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="section" style="margin-top: 16px;">
|
|
<h3>Chunking Pipeline</h3>
|
|
<div class="mockup">
|
|
<div class="mockup-header">From markdown file → embeddable chunks</div>
|
|
<div class="mockup-body" style="font-family: monospace; font-size: 12px; line-height: 1.7; padding: 16px;">
|
|
<div style="display: flex; flex-direction: column; gap: 4px;">
|
|
<div style="display: flex; align-items: center; gap: 8px;">
|
|
<div style="background: #1a472a; color: #2ecc71; padding: 2px 8px; border-radius: 4px; font-size: 11px;">1. SCAN</div>
|
|
<div style="color: #eee;">Walk vault dirs (respect deny/allow lists) → collect *.md files</div>
|
|
</div>
|
|
<div style="color: #666; margin-left: 40px;">↓</div>
|
|
<div style="display: flex; align-items: center; gap: 8px;">
|
|
<div style="background: #1a472a; color: #2ecc71; padding: 2px 8px; border-radius: 4px; font-size: 11px;">2. PARSE</div>
|
|
<div style="color: #eee;">Extract frontmatter, headings, tags, dates from filename/path</div>
|
|
</div>
|
|
<div style="color: #666; margin-left: 40px;">↓</div>
|
|
<div style="display: flex; align-items: center; gap: 8px;">
|
|
<div style="background: #3d3200; color: #f0a500; padding: 2px 8px; border-radius: 4px; font-size: 11px;">3. CHUNK</div>
|
|
<div style="color: #eee;">Structured notes → split by section headers</div>
|
|
<div style="color: #eee;">Unstructured → sliding window (500 tok, 100 overlap)</div>
|
|
</div>
|
|
<div style="color: #666; margin-left: 40px;">↓</div>
|
|
<div style="display: flex; align-items: center; gap: 8px;">
|
|
<div style="background: #3d3200; color: #f0a500; padding: 2px 8px; border-radius: 4px; font-size: 11px;">4. ENRICH</div>
|
|
<div style="color: #eee;">Attach metadata: source_file, section, date, tags, chunk_index</div>
|
|
</div>
|
|
<div style="color: #666; margin-left: 40px;">↓</div>
|
|
<div style="display: flex; align-items: center; gap: 8px;">
|
|
<div style="background: #1a3a5c; color: #53a8b6; padding: 2px 8px; border-radius: 4px; font-size: 11px;">5. EMBED</div>
|
|
<div style="color: #eee;">Batch chunks → Ollama /api/embed (mxbai-embed-large, 1024-dim)</div>
|
|
<div style="color: #888;">Batch size: 64 chunks per request</div>
|
|
</div>
|
|
<div style="color: #666; margin-left: 40px;">↓</div>
|
|
<div style="display: flex; align-items: center; gap: 8px;">
|
|
<div style="background: #1a3a5c; color: #53a8b6; padding: 2px 8px; border-radius: 4px; font-size: 11px;">6. STORE</div>
|
|
<div style="color: #eee;">Write vectors + metadata to LanceDB (upsert by chunk_id)</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div style="color: #53a8b6; font-weight: bold; margin-top: 16px; margin-bottom: 6px;">INCREMENTAL SYNC LOGIC</div>
|
|
<div style="color: #eee; line-height: 1.6;">
|
|
<div>1. Read last_sync_mtime from sync-result.json</div>
|
|
<div>2. Walk vault, compare each file's mtime vs last_sync_mtime</div>
|
|
<div>3. New/modified files → full chunk pipeline</div>
|
|
<div>4. Deleted files → remove chunks by source_file from LanceDB</div>
|
|
<div>5. Unchanged files → skip entirely</div>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="section" style="margin-top: 20px;">
|
|
<h3>Key Design Points</h3>
|
|
<div class="pros-cons">
|
|
<div class="pros"><h4>Why this works</h4>
|
|
<ul>
|
|
<li>Subprocess bridge is simple — no socket, no HTTP, just spawn+stdout</li>
|
|
<li>NDJSON progress lets plugin stream indexing status to agent in real-time</li>
|
|
<li>LanceDB upsert by chunk_id makes incremental sync idempotent</li>
|
|
<li>Batch embedding (64 chunks) balances throughput vs Ollama memory</li>
|
|
<li>Two chunking strategies handle both structured and unstructured notes</li>
|
|
</ul>
|
|
</div>
|
|
<div class="cons"><h4>Trade-offs</h4>
|
|
<ul>
|
|
<li>Subprocess spawn adds ~200ms per invocation (acceptable for CLI)</li>
|
|
<li>sync-result.json is a shared filesystem contract — needs careful locking</li>
|
|
<li>500-token sliding window is configurable but not adaptive to content</li>
|
|
</ul>
|
|
</div>
|
|
</div>
|
|
</div> |