santhoshj/kv-ai

Fork 0

Files

Santhosh Janardhanan 8d2762268b feat: add LanceDB vector store with upsert, delete, and search

2026-04-13 14:18:04 -04:00

14 KiB

Raw Blame History

Personal Companion AI — Design Spec

Date: 2026-04-13
Status: Approved
Author: Santhosh Janardhanan

1. Overview

A fully local, privacy-first AI companion trained on Santhosh's Obsidian vault. The companion is a reflective confidante—not a digital twin or advisor—that can answer questions about past events, summarize relationships, and explore life patterns alongside him.

The system combines a fine-tuned local LLM (for reasoning style and reflective voice) with a RAG layer (for factual retrieval from 677+ vault notes).

2. Core Philosophy

Companion, not clone: The AI does not speak as Santhosh. It speaks to him.
Fully local: No vault data leaves the machine. Ollama, LanceDB, and inference all run locally.
Evolving self: Quarterly model retraining + daily RAG sync keeps the companion aligned with his changing life.
Minimal noise: Notifications are quiet (streaming text + logs). No pop-ups.

3. Architecture

Approach: Decoupled Services

Three independent processes:

┌─────────────────────────────────────────────────────────────┐
│  Companion Chat (Web UI / CLI)                              │
│  React + Vite  ←───→  FastAPI backend (orchestrator)        │
└─────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        ↓                     ↓                     ↓
┌──────────────┐    ┌─────────────────┐    ┌─────────────────┐
│ Fine-tuned   │    │  RAG Engine     │    │  Vault Indexer  │
│ 7B Model     │    │  (LanceDB)      │    │  (daemon/CLI)   │
│ (llama.cpp)  │    │                 │    │                 │
│              │    │  • semantic     │    │  • watches vault│
│  Quarterly   │    │    search       │    │  • chunks/embeds│
│  retrain     │    │  • hybrid       │    │  • daily auto   │
│              │    │    filters      │    │    sync         │
│              │    │  • relationship │    │  • manual       │
│              │    │    graph        │    │    trigger      │
└──────────────┘    └─────────────────┘    └─────────────────┘

Service Responsibilities

Service	Language	Role
Companion Engine	Python (FastAPI) + TS (React)	Chat UI, session memory, prompt orchestration
RAG Engine	Python	LanceDB queries, embedding cache, hybrid search
Vault Indexer	Python (CLI + daemon)	File watching, chunking, embedding via Ollama
Model Forge	Python (on-demand)	QLoRA fine-tuning, GGUF export

4. Data Flow: A Chat Turn

User asks: "I've been feeling off about my friendship with Vinay. What do you think?"
Orchestrator detects this needs relationship context + reflective reasoning.
RAG Engine queries LanceDB for Vinay, friendship, #Relations — returns top 8 chunks.
Prompt construction:
- System: Companion persona + reasoning instructions
- Retrieved context: Relevant vault entries
- Conversation history: Last 20 turns
- User message
Local LLM streams a reflective response.
Optional: Companion asks a gentle follow-up to deepen reflection.

5. Technology Choices

Component	Choice	Rationale
Base model	Meta-Llama-3.1-8B-Instruct	Strong reasoning, fits 12GB VRAM quantized
Fine-tuning	Unsloth + QLoRA (4-bit)	Fast, memory-efficient, runs on RTX 5070
Inference	llama.cpp (GGUF)	Mature, fast local inference, easy GPU layer tuning
Embedding	`mxbai-embed-large` via Ollama	1024-dim, local, high quality
Vector store	LanceDB (embedded)	File-based, no server, Rust-backed
Backend	FastAPI + WebSockets	Streaming chat, simple Python API
Frontend	React + Vite	Lightweight, fast dev loop
File watcher	`watchdog` (Python)	Reliable cross-platform vault monitoring

6. Fine-Tuning Strategy

What the model learns

Reflective reasoning style: How Santhosh thinks through situations
Values and priorities: What he tends to weigh in decisions
Communication patterns: His tone in journal entries (direct, questioning, humorous)
Relationship dynamics: Patterns in how he describes people over time

What stays in RAG

Specific dates, events, amounts
Exact quotes and conversations
Recent updates (between retrainings)
Granular facts

Training data format

Curated "reflection examples" from the vault, formatted as conversation turns:

{
  "messages": [
    {"role": "system", "content": "You are a thoughtful companion..."},
    {"role": "user", "content": "Journal entry about Vinay visit... What do you notice?"},
    {"role": "assistant", "content": "It seems like you value these drop-ins..."}
  ]
}

Training schedule

Quarterly retrain: Automatic reminder (log + chat stream) every 90 days.
Manual trigger: User can initiate retrain anytime via CLI/UI.
Pipeline: vault → extract reflections → curate → train → export GGUF → swap model file

7. RAG Engine Design

Indexing Modes

index: Full rebuild of the vector store.
sync: Incremental — only process files modified since last sync.
reindex: Force full rebuild.
status: Show doc count, last sync, unindexed files.

Auto-Sync Strategy

File system watcher: watchdog monitors vault root and triggers incremental sync on any .md change.
Daily full sync: At 3:00 AM, run a full sync to catch any missed events.
Manual trigger: POST /index/trigger from chat or CLI.

Per-Directory Chunking Rules

Different vault directories need different granularity:

"chunking_rules": {
  "default": {
    "strategy": "sliding_window",
    "chunk_size": 500,
    "chunk_overlap": 100
  },
  "Journal/**": {
    "strategy": "section",
    "section_tags": ["#DayInShort", "#mentalhealth", "#physicalhealth", "#work", "#finance", "#Relations"],
    "chunk_size": 300,
    "chunk_overlap": 50
  },
  "zzz-Archive/**": {
    "strategy": "sliding_window",
    "chunk_size": 800,
    "chunk_overlap": 150
  }
}

Rationale: Journal entries contain dense emotional/factual tags and benefit from section-based chunking with smaller chunks. Archives are reference material and can be chunked more coarsely.

Metadata per Chunk

source_file: Relative path from vault root
source_directory: Top-level directory
section: Section heading (for structured notes)
date: Parsed from filename or frontmatter
tags: All hashtags and wikilinks found in chunk
chunk_index: Position in document
modified_at: File mtime for incremental sync
rule_applied: Which chunking rule was used

Search

Default top-k: 8 chunks
Max top-k: 20
Similarity threshold: 0.75
Hybrid search: Enabled by default (30% keyword, 70% semantic)
Filters: date range, tag list, directory glob

8. Configuration Schema

{
  "companion": {
    "name": "SAN",
    "persona": {
      "role": "companion",
      "tone": "reflective",
      "style": "questioning",
      "boundaries": [
        "does_not_impersonate_user",
        "no_future_predictions",
        "no_medical_or_legal_advice"
      ]
    },
    "memory": {
      "session_turns": 20,
      "persistent_store": "~/.companion/memory.db",
      "summarize_after": 10
    },
    "chat": {
      "streaming": true,
      "max_response_tokens": 2048,
      "default_temperature": 0.7,
      "allow_temperature_override": true
    }
  },
  "vault": {
    "path": "/home/san/KnowledgeVault/Default",
    "indexing": {
      "auto_sync": true,
      "auto_sync_interval_minutes": 1440,
      "watch_fs_events": true,
      "file_patterns": ["*.md"],
      "deny_dirs": [".obsidian", ".trash", "zzz-Archive", ".git", ".logseq"],
      "deny_patterns": ["*.tmp", "*.bak", "*conflict*", ".*"]
    },
    "chunking_rules": {
      "default": { "strategy": "sliding_window", "chunk_size": 500, "chunk_overlap": 100 },
      "Journal/**": { "strategy": "section", "chunk_size": 300, "chunk_overlap": 50 },
      "zzz-Archive/**": { "strategy": "sliding_window", "chunk_size": 800, "chunk_overlap": 150 }
    }
  },
  "rag": {
    "embedding": {
      "provider": "ollama",
      "model": "mxbai-embed-large",
      "base_url": "http://localhost:11434",
      "dimensions": 1024,
      "batch_size": 32
    },
    "vector_store": {
      "type": "lancedb",
      "path": "~/.companion/vectors.lance"
    },
    "search": {
      "default_top_k": 8,
      "max_top_k": 20,
      "similarity_threshold": 0.75,
      "hybrid_search": { "enabled": true, "keyword_weight": 0.3, "semantic_weight": 0.7 },
      "filters": { "date_range_enabled": true, "tag_filter_enabled": true, "directory_filter_enabled": true }
    }
  },
  "model": {
    "inference": {
      "backend": "llama.cpp",
      "model_path": "~/.companion/models/companion-7b-q4.gguf",
      "context_length": 8192,
      "gpu_layers": 35,
      "batch_size": 512,
      "threads": 8
    },
    "fine_tuning": {
      "base_model": "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
      "output_dir": "~/.companion/training",
      "lora_rank": 16,
      "lora_alpha": 32,
      "learning_rate": 0.0002,
      "batch_size": 4,
      "gradient_accumulation_steps": 4,
      "num_epochs": 3,
      "warmup_steps": 100,
      "save_steps": 500,
      "eval_steps": 250,
      "training_data_path": "~/.companion/training_data/",
      "validation_split": 0.1
    },
    "retrain_schedule": {
      "auto_reminder": true,
      "default_interval_days": 90,
      "reminder_channels": ["chat_stream", "log"]
    }
  },
  "api": {
    "host": "127.0.0.1",
    "port": 7373,
    "cors_origins": ["http://localhost:5173"],
    "auth": { "enabled": false }
  },
  "ui": {
    "web": {
      "enabled": true,
      "theme": "obsidian",
      "features": { "streaming": true, "citations": true, "source_preview": true }
    },
    "cli": { "enabled": true, "rich_output": true }
  },
  "logging": {
    "level": "INFO",
    "file": "~/.companion/logs/companion.log",
    "max_size_mb": 100,
    "backup_count": 5
  },
  "security": {
    "local_only": true,
    "vault_path_traversal_check": true,
    "sensitive_content_detection": true,
    "sensitive_patterns": ["#mentalhealth", "#physicalhealth", "#finance", "#Relations"],
    "require_confirmation_for_external_apis": true
  }
}

9. Project Structure

companion/
├── README.md
├── pyproject.toml
├── config.json
├── src/
│   ├── companion/              # FastAPI backend + orchestrator
│   │   ├── main.py
│   │   ├── api/
│   │   │   ├── chat.py
│   │   │   ├── index.py
│   │   │   └── status.py
│   │   ├── core/
│   │   │   ├── orchestrator.py
│   │   │   ├── memory.py
│   │   │   └── prompts.py
│   │   └── config.py
│   ├── rag/                    # RAG engine
│   │   ├── indexer.py
│   │   ├── chunker.py
│   │   ├── embedder.py
│   │   ├── vector_store.py
│   │   └── search.py
│   ├── indexer_daemon/         # Vault watcher + indexer CLI
│   │   ├── daemon.py
│   │   ├── cli.py
│   │   └── watcher.py
│   └── forge/                  # Fine-tuning pipeline
│       ├── extract.py
│       ├── train.py
│       ├── export.py
│       └── evaluate.py
├── ui/                         # React frontend
│   ├── src/
│   │   ├── App.tsx
│   │   ├── components/
│   │   │   ├── Chat.tsx
│   │   │   ├── Message.tsx
│   │   │   └── Settings.tsx
│   │   └── api.ts
│   └── package.json
├── tests/
│   ├── companion/
│   ├── rag/
│   └── forge/
└── docs/
    └── superpowers/
        └── specs/
            └── 2026-04-13-personal-companion-ai-design.md

10. Testing Strategy

Layer	Tests
RAG	Chunking correctness, search relevance, incremental sync accuracy
API	Chat streaming, parameter validation, error handling
Security	Path traversal, sensitive content detection, local-only enforcement
Forge	Training convergence, eval loss trends, output GGUF validity
E2E	Full chat turn with RAG retrieval, citation rendering

11. Risks & Mitigations

Risk	Mitigation
Overfitting on small dataset	LoRA rank 16, strong regularization, 10% validation split, human eval
Temporal drift	Quarterly retrain + daily RAG sync
Privacy leak in training data	Manual curation of training examples; exclude others' private details
Emotional weight / uncanny valley	Persona is companion, not predictor; framed as reflection
Hardware limits	QLoRA 4-bit, 8B model, ~35 GPU layers; fallback to CPU offloading if needed
Maintenance fatigue	Auto-sync removes daily work; retrain is one script + reminder

12. Success Criteria

Chat interface streams responses locally with sub-second first-token latency
RAG retrieves relevant vault context for >80% of personal questions
Fine-tuned model produces responses that "feel" recognizably aligned with Santhosh's reflective style
Quarterly retrain completes successfully on RTX 5070 in <6 hours
Daily auto-sync and manual trigger both work reliably
No vault data leaves the local machine

13. Next Steps

Write implementation plan using writing-plans skill
Scaffold the repository structure
Build Vault Indexer + RAG engine first ( Week 1-2)
Integrate chat UI with base model (Week 3)
Curate training data and begin fine-tuning experiments (Week 4-6)
Polish, evaluate, and integrate (Week 7-8)

14 KiB Raw Blame History