Files
kv-ai/docs/superpowers/specs/2026-04-13-personal-companion-ai-design.md

14 KiB

Personal Companion AI — Design Spec

Date: 2026-04-13
Status: Approved
Author: Santhosh Janardhanan

1. Overview

A fully local, privacy-first AI companion trained on Santhosh's Obsidian vault. The companion is a reflective confidante—not a digital twin or advisor—that can answer questions about past events, summarize relationships, and explore life patterns alongside him.

The system combines a fine-tuned local LLM (for reasoning style and reflective voice) with a RAG layer (for factual retrieval from 677+ vault notes).

2. Core Philosophy

  • Companion, not clone: The AI does not speak as Santhosh. It speaks to him.
  • Fully local: No vault data leaves the machine. Ollama, LanceDB, and inference all run locally.
  • Evolving self: Quarterly model retraining + daily RAG sync keeps the companion aligned with his changing life.
  • Minimal noise: Notifications are quiet (streaming text + logs). No pop-ups.

3. Architecture

Approach: Decoupled Services

Three independent processes:

┌─────────────────────────────────────────────────────────────┐
│  Companion Chat (Web UI / CLI)                              │
│  React + Vite  ←───→  FastAPI backend (orchestrator)        │
└─────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        ↓                     ↓                     ↓
┌──────────────┐    ┌─────────────────┐    ┌─────────────────┐
│ Fine-tuned   │    │  RAG Engine     │    │  Vault Indexer  │
│ 7B Model     │    │  (LanceDB)      │    │  (daemon/CLI)   │
│ (llama.cpp)  │    │                 │    │                 │
│              │    │  • semantic     │    │  • watches vault│
│  Quarterly   │    │    search       │    │  • chunks/embeds│
│  retrain     │    │  • hybrid       │    │  • daily auto   │
│              │    │    filters      │    │    sync         │
│              │    │  • relationship │    │  • manual       │
│              │    │    graph        │    │    trigger      │
└──────────────┘    └─────────────────┘    └─────────────────┘

Service Responsibilities

Service Language Role
Companion Engine Python (FastAPI) + TS (React) Chat UI, session memory, prompt orchestration
RAG Engine Python LanceDB queries, embedding cache, hybrid search
Vault Indexer Python (CLI + daemon) File watching, chunking, embedding via Ollama
Model Forge Python (on-demand) QLoRA fine-tuning, GGUF export

4. Data Flow: A Chat Turn

  1. User asks: "I've been feeling off about my friendship with Vinay. What do you think?"
  2. Orchestrator detects this needs relationship context + reflective reasoning.
  3. RAG Engine queries LanceDB for Vinay, friendship, #Relations — returns top 8 chunks.
  4. Prompt construction:
    • System: Companion persona + reasoning instructions
    • Retrieved context: Relevant vault entries
    • Conversation history: Last 20 turns
    • User message
  5. Local LLM streams a reflective response.
  6. Optional: Companion asks a gentle follow-up to deepen reflection.

5. Technology Choices

Component Choice Rationale
Base model Meta-Llama-3.1-8B-Instruct Strong reasoning, fits 12GB VRAM quantized
Fine-tuning Unsloth + QLoRA (4-bit) Fast, memory-efficient, runs on RTX 5070
Inference llama.cpp (GGUF) Mature, fast local inference, easy GPU layer tuning
Embedding mxbai-embed-large via Ollama 1024-dim, local, high quality
Vector store LanceDB (embedded) File-based, no server, Rust-backed
Backend FastAPI + WebSockets Streaming chat, simple Python API
Frontend React + Vite Lightweight, fast dev loop
File watcher watchdog (Python) Reliable cross-platform vault monitoring

6. Fine-Tuning Strategy

What the model learns

  • Reflective reasoning style: How Santhosh thinks through situations
  • Values and priorities: What he tends to weigh in decisions
  • Communication patterns: His tone in journal entries (direct, questioning, humorous)
  • Relationship dynamics: Patterns in how he describes people over time

What stays in RAG

  • Specific dates, events, amounts
  • Exact quotes and conversations
  • Recent updates (between retrainings)
  • Granular facts

Training data format

Curated "reflection examples" from the vault, formatted as conversation turns:

{
  "messages": [
    {"role": "system", "content": "You are a thoughtful companion..."},
    {"role": "user", "content": "Journal entry about Vinay visit... What do you notice?"},
    {"role": "assistant", "content": "It seems like you value these drop-ins..."}
  ]
}

Training schedule

  • Quarterly retrain: Automatic reminder (log + chat stream) every 90 days.
  • Manual trigger: User can initiate retrain anytime via CLI/UI.
  • Pipeline: vault → extract reflections → curate → train → export GGUF → swap model file

7. RAG Engine Design

Indexing Modes

  • index: Full rebuild of the vector store.
  • sync: Incremental — only process files modified since last sync.
  • reindex: Force full rebuild.
  • status: Show doc count, last sync, unindexed files.

Auto-Sync Strategy

  • File system watcher: watchdog monitors vault root and triggers incremental sync on any .md change.
  • Daily full sync: At 3:00 AM, run a full sync to catch any missed events.
  • Manual trigger: POST /index/trigger from chat or CLI.

Per-Directory Chunking Rules

Different vault directories need different granularity:

"chunking_rules": {
  "default": {
    "strategy": "sliding_window",
    "chunk_size": 500,
    "chunk_overlap": 100
  },
  "Journal/**": {
    "strategy": "section",
    "section_tags": ["#DayInShort", "#mentalhealth", "#physicalhealth", "#work", "#finance", "#Relations"],
    "chunk_size": 300,
    "chunk_overlap": 50
  },
  "zzz-Archive/**": {
    "strategy": "sliding_window",
    "chunk_size": 800,
    "chunk_overlap": 150
  }
}

Rationale: Journal entries contain dense emotional/factual tags and benefit from section-based chunking with smaller chunks. Archives are reference material and can be chunked more coarsely.

Metadata per Chunk

  • source_file: Relative path from vault root
  • source_directory: Top-level directory
  • section: Section heading (for structured notes)
  • date: Parsed from filename or frontmatter
  • tags: All hashtags and wikilinks found in chunk
  • chunk_index: Position in document
  • modified_at: File mtime for incremental sync
  • rule_applied: Which chunking rule was used
  • Default top-k: 8 chunks
  • Max top-k: 20
  • Similarity threshold: 0.75
  • Hybrid search: Enabled by default (30% keyword, 70% semantic)
  • Filters: date range, tag list, directory glob

8. Configuration Schema

{
  "companion": {
    "name": "SAN",
    "persona": {
      "role": "companion",
      "tone": "reflective",
      "style": "questioning",
      "boundaries": [
        "does_not_impersonate_user",
        "no_future_predictions",
        "no_medical_or_legal_advice"
      ]
    },
    "memory": {
      "session_turns": 20,
      "persistent_store": "~/.companion/memory.db",
      "summarize_after": 10
    },
    "chat": {
      "streaming": true,
      "max_response_tokens": 2048,
      "default_temperature": 0.7,
      "allow_temperature_override": true
    }
  },
  "vault": {
    "path": "/home/san/KnowledgeVault/Default",
    "indexing": {
      "auto_sync": true,
      "auto_sync_interval_minutes": 1440,
      "watch_fs_events": true,
      "file_patterns": ["*.md"],
      "deny_dirs": [".obsidian", ".trash", "zzz-Archive", ".git", ".logseq"],
      "deny_patterns": ["*.tmp", "*.bak", "*conflict*", ".*"]
    },
    "chunking_rules": {
      "default": { "strategy": "sliding_window", "chunk_size": 500, "chunk_overlap": 100 },
      "Journal/**": { "strategy": "section", "chunk_size": 300, "chunk_overlap": 50 },
      "zzz-Archive/**": { "strategy": "sliding_window", "chunk_size": 800, "chunk_overlap": 150 }
    }
  },
  "rag": {
    "embedding": {
      "provider": "ollama",
      "model": "mxbai-embed-large",
      "base_url": "http://localhost:11434",
      "dimensions": 1024,
      "batch_size": 32
    },
    "vector_store": {
      "type": "lancedb",
      "path": "~/.companion/vectors.lance"
    },
    "search": {
      "default_top_k": 8,
      "max_top_k": 20,
      "similarity_threshold": 0.75,
      "hybrid_search": { "enabled": true, "keyword_weight": 0.3, "semantic_weight": 0.7 },
      "filters": { "date_range_enabled": true, "tag_filter_enabled": true, "directory_filter_enabled": true }
    }
  },
  "model": {
    "inference": {
      "backend": "llama.cpp",
      "model_path": "~/.companion/models/companion-7b-q4.gguf",
      "context_length": 8192,
      "gpu_layers": 35,
      "batch_size": 512,
      "threads": 8
    },
    "fine_tuning": {
      "base_model": "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
      "output_dir": "~/.companion/training",
      "lora_rank": 16,
      "lora_alpha": 32,
      "learning_rate": 0.0002,
      "batch_size": 4,
      "gradient_accumulation_steps": 4,
      "num_epochs": 3,
      "warmup_steps": 100,
      "save_steps": 500,
      "eval_steps": 250,
      "training_data_path": "~/.companion/training_data/",
      "validation_split": 0.1
    },
    "retrain_schedule": {
      "auto_reminder": true,
      "default_interval_days": 90,
      "reminder_channels": ["chat_stream", "log"]
    }
  },
  "api": {
    "host": "127.0.0.1",
    "port": 7373,
    "cors_origins": ["http://localhost:5173"],
    "auth": { "enabled": false }
  },
  "ui": {
    "web": {
      "enabled": true,
      "theme": "obsidian",
      "features": { "streaming": true, "citations": true, "source_preview": true }
    },
    "cli": { "enabled": true, "rich_output": true }
  },
  "logging": {
    "level": "INFO",
    "file": "~/.companion/logs/companion.log",
    "max_size_mb": 100,
    "backup_count": 5
  },
  "security": {
    "local_only": true,
    "vault_path_traversal_check": true,
    "sensitive_content_detection": true,
    "sensitive_patterns": ["#mentalhealth", "#physicalhealth", "#finance", "#Relations"],
    "require_confirmation_for_external_apis": true
  }
}

9. Project Structure

companion/
├── README.md
├── pyproject.toml
├── config.json
├── src/
│   ├── companion/              # FastAPI backend + orchestrator
│   │   ├── main.py
│   │   ├── api/
│   │   │   ├── chat.py
│   │   │   ├── index.py
│   │   │   └── status.py
│   │   ├── core/
│   │   │   ├── orchestrator.py
│   │   │   ├── memory.py
│   │   │   └── prompts.py
│   │   └── config.py
│   ├── rag/                    # RAG engine
│   │   ├── indexer.py
│   │   ├── chunker.py
│   │   ├── embedder.py
│   │   ├── vector_store.py
│   │   └── search.py
│   ├── indexer_daemon/         # Vault watcher + indexer CLI
│   │   ├── daemon.py
│   │   ├── cli.py
│   │   └── watcher.py
│   └── forge/                  # Fine-tuning pipeline
│       ├── extract.py
│       ├── train.py
│       ├── export.py
│       └── evaluate.py
├── ui/                         # React frontend
│   ├── src/
│   │   ├── App.tsx
│   │   ├── components/
│   │   │   ├── Chat.tsx
│   │   │   ├── Message.tsx
│   │   │   └── Settings.tsx
│   │   └── api.ts
│   └── package.json
├── tests/
│   ├── companion/
│   ├── rag/
│   └── forge/
└── docs/
    └── superpowers/
        └── specs/
            └── 2026-04-13-personal-companion-ai-design.md

10. Testing Strategy

Layer Tests
RAG Chunking correctness, search relevance, incremental sync accuracy
API Chat streaming, parameter validation, error handling
Security Path traversal, sensitive content detection, local-only enforcement
Forge Training convergence, eval loss trends, output GGUF validity
E2E Full chat turn with RAG retrieval, citation rendering

11. Risks & Mitigations

Risk Mitigation
Overfitting on small dataset LoRA rank 16, strong regularization, 10% validation split, human eval
Temporal drift Quarterly retrain + daily RAG sync
Privacy leak in training data Manual curation of training examples; exclude others' private details
Emotional weight / uncanny valley Persona is companion, not predictor; framed as reflection
Hardware limits QLoRA 4-bit, 8B model, ~35 GPU layers; fallback to CPU offloading if needed
Maintenance fatigue Auto-sync removes daily work; retrain is one script + reminder

12. Success Criteria

  • Chat interface streams responses locally with sub-second first-token latency
  • RAG retrieves relevant vault context for >80% of personal questions
  • Fine-tuned model produces responses that "feel" recognizably aligned with Santhosh's reflective style
  • Quarterly retrain completes successfully on RTX 5070 in <6 hours
  • Daily auto-sync and manual trigger both work reliably
  • No vault data leaves the local machine

13. Next Steps

  1. Write implementation plan using writing-plans skill
  2. Scaffold the repository structure
  3. Build Vault Indexer + RAG engine first ( Week 1-2)
  4. Integrate chat UI with base model (Week 3)
  5. Curate training data and begin fine-tuning experiments (Week 4-6)
  6. Polish, evaluate, and integrate (Week 7-8)