santhoshj/kv-ai

Fork 0

Files

Santhosh Janardhanan 8d2762268b feat: add LanceDB vector store with upsert, delete, and search

2026-04-13 14:18:04 -04:00

8.3 KiB

Raw Blame History

Obsidian RAG Plugin for OpenClaw — Design Spec

Date: 2026-04-10 Status: Approved Author: Santhosh Janardhanan

Overview

An OpenClaw plugin that enables semantic search through Obsidian vault notes using RAG (Retrieval-Augmented Generation). The plugin allows OpenClaw to respond to natural language queries about personal journal entries, shopping lists, financial records, health data, podcast notes, and project ideas stored in an Obsidian vault.

Problem Statement

Personal knowledge is fragmented across 677+ markdown files in an Obsidian vault, organized by topic but not searchable by meaning. Questions like "How was my mental health in 2024?" or "How much do I owe Sreenivas?" require reading multiple files across directories and synthesizing the answer. The plugin provides semantic search to surface relevant context.

Architecture

Approach: Separate Indexer Service + Thin Plugin

KnowledgeVault → Python Indexer (CLI) → LanceDB (filesystem)
                                           ↑ query
OpenClaw → TS Plugin (tools) ─────────────┘

Python Indexer: Handles vault scanning, markdown parsing, chunking, embedding generation via Ollama, and LanceDB storage. Runs as a CLI tool.
TypeScript Plugin: Registers OpenClaw tools that query the pre-built LanceDB index. Thin wrapper that provides the agent interface.
LanceDB: Embedded vector database stored on local filesystem at ~/.obsidian-rag/vectors.lance. No server required.

Technology Choices

Component	Choice	Rationale
Embedding model	`mxbai-embed-large` (1024-dim) via Ollama	Local, free, meets 1024+ dimension requirement, SOTA accuracy
Vector store	LanceDB (embedded)	No server, file-based, Rust-based efficiency, zero-copy versioning for incremental updates
Indexer language	Python	Richer embedding/ML ecosystem, better markdown parsing libraries
Plugin language	TypeScript	Native OpenClaw ecosystem, type safety, SDK examples
Config	Separate `.obsidian-rag/config.json`	Keeps plugin config separate from OpenClaw config

CLI Commands (Python Indexer)

Command	Purpose
`obsidian-rag index`	Initial full index of the vault (first-time setup)
`obsidian-rag sync`	Incremental — only process files modified since last sync
`obsidian-rag reindex`	Force full reindex (nuke existing, start fresh)
`obsidian-rag status`	Show index health: total docs, last sync time, unindexed files

Plugin Tools (TypeScript)

`obsidian_rag_search`

Primary search tool for OpenClaw agent.

Parameters:

query (required, string): Natural language question
max_results (optional, default 5): Max chunks to return
directory_filter (optional, string or string[]): Limit to subdirectories (e.g., ["Journal", "Entertainment Index"])
date_range (optional, object): { from: "2025-01-01", to: "2025-12-31" }
tags (optional, string[]): Filter by hashtags

`obsidian_rag_index`

Trigger indexing from within OpenClaw.

Parameters:

mode (required, enum): "full" | "sync" | "reindex"

`obsidian_rag_status`

Check index health — doc count, last sync, unindexed files.

`obsidian_rag_memory_store`

Commit important facts to OpenClaw's memory for faster future retrieval.

Parameters:

key (string): Identifier
value (string): The fact to remember
source (string): Source file path

Auto-suggest logic: When search results contain financial, health, or commitment patterns, the plugin suggests the agent use obsidian_rag_memory_store. The agent decides whether to commit.

Chunking Strategy

Structured notes (Journal entries)

Chunk by section headers (#mentalhealth, #finance, etc.). Each section becomes its own chunk with metadata: source_file, section_name, date, tags.

Unstructured notes (shopping lists, project ideas, entertainment index)

Sliding window chunking (500 tokens, 100 token overlap). Each chunk gets metadata: source_file, chunk_index, total_chunks, headings.

Metadata per chunk

source_file: Relative path from vault root
source_directory: Top-level directory (enables directory filtering)
section: Section heading (for structured notes)
date: Parsed from filename (journal entries)
tags: All hashtags found in the chunk
chunk_index: Position within the document
modified_at: File mtime for incremental sync

Security & Privacy

Path traversal prevention — All file reads restricted to configured vault path. No ../, symlinks outside vault, or absolute paths.
Input sanitization — Strip HTML tags, remove executable code blocks, normalize whitespace. All vault content treated as untrusted.
Local-only enforcement — Ollama on localhost, LanceDB on filesystem. Network audit test verifies no outbound requests.
Directory allow/deny lists — Config supports deny_dirs (default: .obsidian, .trash, zzz-Archive, .git) and allow_dirs.
Sensitive content guard — Detects health (#mentalhealth, #physicalhealth), financial debt, and personal relationship content. Blocks external API transmission of sensitive content. Requires user confirmation if an external embedding endpoint is configured.

Configuration

Config file at ~/.obsidian-rag/config.json:

{
  "vault_path": "/home/san/KnowledgeVault/Default",
  "embedding": {
    "provider": "ollama",
    "model": "mxbai-embed-large",
    "base_url": "http://localhost:11434",
    "dimensions": 1024
  },
  "vector_store": {
    "type": "lancedb",
    "path": "~/.obsidian-rag/vectors.lance"
  },
  "indexing": {
    "chunk_size": 500,
    "chunk_overlap": 100,
    "file_patterns": ["*.md"],
    "deny_dirs": [".obsidian", ".trash", "zzz-Archive", ".git"],
    "allow_dirs": []
  },
  "security": {
    "require_confirmation_for": ["health", "financial_debt"],
    "sensitive_sections": ["#mentalhealth", "#physicalhealth", "#Relations"],
    "local_only": true
  },
  "memory": {
    "auto_suggest": true,
    "patterns": {
      "financial": ["owe", "owed", "debt", "paid", "$", "spent", "spend"],
      "health": ["#mentalhealth", "#physicalhealth", "medication", "therapy"],
      "commitments": ["shopping list", "costco", "amazon", "grocery"]
    }
  }
}

Project Structure

obsidian-rag-skill/
├── README.md
├── LICENSE
├── .gitignore
├── openclaw.plugin.json
├── package.json
├── tsconfig.json
├── src/
│   ├── index.ts
│   ├── tools/
│   │   ├── search.ts
│   │   ├── index.ts
│   │   ├── status.ts
│   │   └── memory.ts
│   ├── services/
│   │   ├── vault-watcher.ts
│   │   ├── indexer-bridge.ts
│   │   └── security-guard.ts
│   └── utils/
│       ├── config.ts
│       └── lancedb.ts
├── python/
│   ├── pyproject.toml
│   ├── obsidian_rag/
│   │   ├── __init__.py
│   │   ├── cli.py
│   │   ├── indexer.py
│   │   ├── chunker.py
│   │   ├── embedder.py
│   │   ├── vector_store.py
│   │   ├── security.py
│   │   └── config.py
│   └── tests/
│       ├── test_chunker.py
│       ├── test_security.py
│       ├── test_embedder.py
│       ├── test_vector_store.py
│       └── test_indexer.py
├── tests/
│   ├── tools/
│   │   ├── search.test.ts
│   │   ├── index.test.ts
│   │   └── memory.test.ts
│   └── services/
│       ├── vault-watcher.test.ts
│       └── security-guard.test.ts
└── docs/
    └── superpowers/
        └── specs/

Testing Strategy

Python: pytest with mocked Ollama, path traversal tests, input sanitization, LanceDB CRUD
TypeScript: vitest with tool parameter validation, security guard, search filter logic
Security: Dedicated test suites for path traversal, XSS, prompt injection, network audit, sensitive content detection

Publishing

Published to ClawHub as both a skill (SKILL.md) and a plugin package:

clawhub skill publish ./skill --slug obsidian-rag --version 1.0.0
clawhub package publish santhosh/obsidian-rag

Install: openclaw plugins install clawhub:obsidian-rag

8.3 KiB Raw Blame History

Obsidian RAG Plugin for OpenClaw — Design Spec

Overview

Problem Statement

Architecture

Approach: Separate Indexer Service + Thin Plugin

Technology Choices

CLI Commands (Python Indexer)

Plugin Tools (TypeScript)

obsidian_rag_search

obsidian_rag_index

obsidian_rag_status

obsidian_rag_memory_store

Chunking Strategy

Structured notes (Journal entries)

Unstructured notes (shopping lists, project ideas, entertainment index)

Metadata per chunk

Security & Privacy

Configuration

Project Structure

Testing Strategy

Publishing

8.3 KiB

Raw Blame History

`obsidian_rag_search`

`obsidian_rag_index`

`obsidian_rag_status`

`obsidian_rag_memory_store`