Data Layer

LanceDB Schema, Indexer Bridge, Chunking & Embedding Pipeline

LanceDB Table Schema

obsidian_chunks table
┌──────────────────┬──────────────┬─────────────────────────────────────┐
│ Column           │ Type         │ Description                         │
├──────────────────┼──────────────┼─────────────────────────────────────┤
│ vector           │ FixedList    │ 1024-dim float32 embedding          │
│                  │ float32[1024]│ (mxbai-embed-large via Ollama)     │
├──────────────────┼──────────────┼─────────────────────────────────────┤
│ chunk_id         │ string       │ UUID v4, primary key                │
├──────────────────┼──────────────┼─────────────────────────────────────┤
│ chunk_text       │ string       │ Raw markdown text of the chunk      │
├──────────────────┼──────────────┼─────────────────────────────────────┤
│ source_file      │ string       │ Relative path from vault root       │
│                  │              │ e.g. "Journal/2024-01-15.md"       │
├──────────────────┼──────────────┼─────────────────────────────────────┤
│ source_directory │ string       │ Top-level directory name            │
│                  │              │ e.g. "Journal", "Entertainment"    │
├──────────────────┼──────────────┼─────────────────────────────────────┤
│ section          │ string|null  │ Section heading (structured notes) │
│                  │              │ e.g. "#mentalhealth", "#finance"   │
├──────────────────┼──────────────┼─────────────────────────────────────┤
│ date             │ string|null  │ ISO 8601 date parsed from filename │
│                  │              │ e.g. "2024-01-15"                  │
├──────────────────┼──────────────┼─────────────────────────────────────┤
│ tags             │ list<string> │ All hashtags in chunk              │
│                  │              │ e.g. ["#mentalhealth", "#therapy"] │
├──────────────────┼──────────────┼─────────────────────────────────────┤
│ chunk_index      │ int32        │ Position within source document    │
│                  │              │ 0-indexed                          │
├──────────────────┼──────────────┼─────────────────────────────────────┤
│ total_chunks     │ int32        │ Total chunks for this source file  │
├──────────────────┼──────────────┼─────────────────────────────────────┤
│ modified_at      │ string       │ File mtime, ISO 8601               │
│                  │              │ Used for incremental sync           │
├──────────────────┼──────────────┼─────────────────────────────────────┤
│ indexed_at       │ string       │ When this chunk was indexed        │
│                  │              │ ISO 8601 timestamp                 │
└──────────────────┴──────────────┴─────────────────────────────────────┘

Indexer Bridge: TS ↔ Python

How the TypeScript plugin invokes the Python indexer
INVOCATION
child_process.spawn("obsidian-rag", [mode], { env: { ...config } })
Plugin spawns the Python CLI as a subprocess
COMMUNICATION
stdin: Not used
stdout: NDJSON progress lines (streamed to agent via status)
stderr: Error output (logged, surfaced on non-zero exit)
exit code: 0 = success, 1 = partial failure, 2 = fatal
PROGRESS EVENTS (stdout NDJSON)
{"type":"progress","phase":"embedding","current":150,"total":677}
{"type":"progress","phase":"storing","current":150,"total":677}
{"type":"complete","indexed_files":677,"total_chunks":3421,"duration_ms":45230,"errors":[]}
SYNC RESULT FILE
Written to ~/.obsidian-rag/sync-result.json on completion
Read by plugin on next tool call to update health state
Also serves as the source for status tool responses

Chunking Pipeline

From markdown file → embeddable chunks
1. SCAN
Walk vault dirs (respect deny/allow lists) → collect *.md files
2. PARSE
Extract frontmatter, headings, tags, dates from filename/path
3. CHUNK
Structured notes → split by section headers
Unstructured → sliding window (500 tok, 100 overlap)
4. ENRICH
Attach metadata: source_file, section, date, tags, chunk_index
5. EMBED
Batch chunks → Ollama /api/embed (mxbai-embed-large, 1024-dim)
Batch size: 64 chunks per request
6. STORE
Write vectors + metadata to LanceDB (upsert by chunk_id)
INCREMENTAL SYNC LOGIC
1. Read last_sync_mtime from sync-result.json
2. Walk vault, compare each file's mtime vs last_sync_mtime
3. New/modified files → full chunk pipeline
4. Deleted files → remove chunks by source_file from LanceDB
5. Unchanged files → skip entirely

Key Design Points

Why this works

  • Subprocess bridge is simple — no socket, no HTTP, just spawn+stdout
  • NDJSON progress lets plugin stream indexing status to agent in real-time
  • LanceDB upsert by chunk_id makes incremental sync idempotent
  • Batch embedding (64 chunks) balances throughput vs Ollama memory
  • Two chunking strategies handle both structured and unstructured notes

Trade-offs

  • Subprocess spawn adds ~200ms per invocation (acceptable for CLI)
  • sync-result.json is a shared filesystem contract — needs careful locking
  • 500-token sliding window is configurable but not adaptive to content