14 KiB
Personal Companion AI — Design Spec
Date: 2026-04-13
Status: Approved
Author: Santhosh Janardhanan
1. Overview
A fully local, privacy-first AI companion trained on Santhosh's Obsidian vault. The companion is a reflective confidante—not a digital twin or advisor—that can answer questions about past events, summarize relationships, and explore life patterns alongside him.
The system combines a fine-tuned local LLM (for reasoning style and reflective voice) with a RAG layer (for factual retrieval from 677+ vault notes).
2. Core Philosophy
- Companion, not clone: The AI does not speak as Santhosh. It speaks to him.
- Fully local: No vault data leaves the machine. Ollama, LanceDB, and inference all run locally.
- Evolving self: Quarterly model retraining + daily RAG sync keeps the companion aligned with his changing life.
- Minimal noise: Notifications are quiet (streaming text + logs). No pop-ups.
3. Architecture
Approach: Decoupled Services
Three independent processes:
┌─────────────────────────────────────────────────────────────┐
│ Companion Chat (Web UI / CLI) │
│ React + Vite ←───→ FastAPI backend (orchestrator) │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────┼─────────────────────┐
↓ ↓ ↓
┌──────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Fine-tuned │ │ RAG Engine │ │ Vault Indexer │
│ 7B Model │ │ (LanceDB) │ │ (daemon/CLI) │
│ (llama.cpp) │ │ │ │ │
│ │ │ • semantic │ │ • watches vault│
│ Quarterly │ │ search │ │ • chunks/embeds│
│ retrain │ │ • hybrid │ │ • daily auto │
│ │ │ filters │ │ sync │
│ │ │ • relationship │ │ • manual │
│ │ │ graph │ │ trigger │
└──────────────┘ └─────────────────┘ └─────────────────┘
Service Responsibilities
| Service | Language | Role |
|---|---|---|
| Companion Engine | Python (FastAPI) + TS (React) | Chat UI, session memory, prompt orchestration |
| RAG Engine | Python | LanceDB queries, embedding cache, hybrid search |
| Vault Indexer | Python (CLI + daemon) | File watching, chunking, embedding via Ollama |
| Model Forge | Python (on-demand) | QLoRA fine-tuning, GGUF export |
4. Data Flow: A Chat Turn
- User asks: "I've been feeling off about my friendship with Vinay. What do you think?"
- Orchestrator detects this needs relationship context + reflective reasoning.
- RAG Engine queries LanceDB for
Vinay,friendship,#Relations— returns top 8 chunks. - Prompt construction:
- System: Companion persona + reasoning instructions
- Retrieved context: Relevant vault entries
- Conversation history: Last 20 turns
- User message
- Local LLM streams a reflective response.
- Optional: Companion asks a gentle follow-up to deepen reflection.
5. Technology Choices
| Component | Choice | Rationale |
|---|---|---|
| Base model | Meta-Llama-3.1-8B-Instruct | Strong reasoning, fits 12GB VRAM quantized |
| Fine-tuning | Unsloth + QLoRA (4-bit) | Fast, memory-efficient, runs on RTX 5070 |
| Inference | llama.cpp (GGUF) | Mature, fast local inference, easy GPU layer tuning |
| Embedding | mxbai-embed-large via Ollama |
1024-dim, local, high quality |
| Vector store | LanceDB (embedded) | File-based, no server, Rust-backed |
| Backend | FastAPI + WebSockets | Streaming chat, simple Python API |
| Frontend | React + Vite | Lightweight, fast dev loop |
| File watcher | watchdog (Python) |
Reliable cross-platform vault monitoring |
6. Fine-Tuning Strategy
What the model learns
- Reflective reasoning style: How Santhosh thinks through situations
- Values and priorities: What he tends to weigh in decisions
- Communication patterns: His tone in journal entries (direct, questioning, humorous)
- Relationship dynamics: Patterns in how he describes people over time
What stays in RAG
- Specific dates, events, amounts
- Exact quotes and conversations
- Recent updates (between retrainings)
- Granular facts
Training data format
Curated "reflection examples" from the vault, formatted as conversation turns:
{
"messages": [
{"role": "system", "content": "You are a thoughtful companion..."},
{"role": "user", "content": "Journal entry about Vinay visit... What do you notice?"},
{"role": "assistant", "content": "It seems like you value these drop-ins..."}
]
}
Training schedule
- Quarterly retrain: Automatic reminder (log + chat stream) every 90 days.
- Manual trigger: User can initiate retrain anytime via CLI/UI.
- Pipeline:
vault → extract reflections → curate → train → export GGUF → swap model file
7. RAG Engine Design
Indexing Modes
index: Full rebuild of the vector store.sync: Incremental — only process files modified since last sync.reindex: Force full rebuild.status: Show doc count, last sync, unindexed files.
Auto-Sync Strategy
- File system watcher:
watchdogmonitors vault root and triggers incremental sync on any.mdchange. - Daily full sync: At 3:00 AM, run a full sync to catch any missed events.
- Manual trigger:
POST /index/triggerfrom chat or CLI.
Per-Directory Chunking Rules
Different vault directories need different granularity:
"chunking_rules": {
"default": {
"strategy": "sliding_window",
"chunk_size": 500,
"chunk_overlap": 100
},
"Journal/**": {
"strategy": "section",
"section_tags": ["#DayInShort", "#mentalhealth", "#physicalhealth", "#work", "#finance", "#Relations"],
"chunk_size": 300,
"chunk_overlap": 50
},
"zzz-Archive/**": {
"strategy": "sliding_window",
"chunk_size": 800,
"chunk_overlap": 150
}
}
Rationale: Journal entries contain dense emotional/factual tags and benefit from section-based chunking with smaller chunks. Archives are reference material and can be chunked more coarsely.
Metadata per Chunk
source_file: Relative path from vault rootsource_directory: Top-level directorysection: Section heading (for structured notes)date: Parsed from filename or frontmattertags: All hashtags and wikilinks found in chunkchunk_index: Position in documentmodified_at: File mtime for incremental syncrule_applied: Which chunking rule was used
Search
- Default top-k: 8 chunks
- Max top-k: 20
- Similarity threshold: 0.75
- Hybrid search: Enabled by default (30% keyword, 70% semantic)
- Filters: date range, tag list, directory glob
8. Configuration Schema
{
"companion": {
"name": "SAN",
"persona": {
"role": "companion",
"tone": "reflective",
"style": "questioning",
"boundaries": [
"does_not_impersonate_user",
"no_future_predictions",
"no_medical_or_legal_advice"
]
},
"memory": {
"session_turns": 20,
"persistent_store": "~/.companion/memory.db",
"summarize_after": 10
},
"chat": {
"streaming": true,
"max_response_tokens": 2048,
"default_temperature": 0.7,
"allow_temperature_override": true
}
},
"vault": {
"path": "/home/san/KnowledgeVault/Default",
"indexing": {
"auto_sync": true,
"auto_sync_interval_minutes": 1440,
"watch_fs_events": true,
"file_patterns": ["*.md"],
"deny_dirs": [".obsidian", ".trash", "zzz-Archive", ".git", ".logseq"],
"deny_patterns": ["*.tmp", "*.bak", "*conflict*", ".*"]
},
"chunking_rules": {
"default": { "strategy": "sliding_window", "chunk_size": 500, "chunk_overlap": 100 },
"Journal/**": { "strategy": "section", "chunk_size": 300, "chunk_overlap": 50 },
"zzz-Archive/**": { "strategy": "sliding_window", "chunk_size": 800, "chunk_overlap": 150 }
}
},
"rag": {
"embedding": {
"provider": "ollama",
"model": "mxbai-embed-large",
"base_url": "http://localhost:11434",
"dimensions": 1024,
"batch_size": 32
},
"vector_store": {
"type": "lancedb",
"path": "~/.companion/vectors.lance"
},
"search": {
"default_top_k": 8,
"max_top_k": 20,
"similarity_threshold": 0.75,
"hybrid_search": { "enabled": true, "keyword_weight": 0.3, "semantic_weight": 0.7 },
"filters": { "date_range_enabled": true, "tag_filter_enabled": true, "directory_filter_enabled": true }
}
},
"model": {
"inference": {
"backend": "llama.cpp",
"model_path": "~/.companion/models/companion-7b-q4.gguf",
"context_length": 8192,
"gpu_layers": 35,
"batch_size": 512,
"threads": 8
},
"fine_tuning": {
"base_model": "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
"output_dir": "~/.companion/training",
"lora_rank": 16,
"lora_alpha": 32,
"learning_rate": 0.0002,
"batch_size": 4,
"gradient_accumulation_steps": 4,
"num_epochs": 3,
"warmup_steps": 100,
"save_steps": 500,
"eval_steps": 250,
"training_data_path": "~/.companion/training_data/",
"validation_split": 0.1
},
"retrain_schedule": {
"auto_reminder": true,
"default_interval_days": 90,
"reminder_channels": ["chat_stream", "log"]
}
},
"api": {
"host": "127.0.0.1",
"port": 7373,
"cors_origins": ["http://localhost:5173"],
"auth": { "enabled": false }
},
"ui": {
"web": {
"enabled": true,
"theme": "obsidian",
"features": { "streaming": true, "citations": true, "source_preview": true }
},
"cli": { "enabled": true, "rich_output": true }
},
"logging": {
"level": "INFO",
"file": "~/.companion/logs/companion.log",
"max_size_mb": 100,
"backup_count": 5
},
"security": {
"local_only": true,
"vault_path_traversal_check": true,
"sensitive_content_detection": true,
"sensitive_patterns": ["#mentalhealth", "#physicalhealth", "#finance", "#Relations"],
"require_confirmation_for_external_apis": true
}
}
9. Project Structure
companion/
├── README.md
├── pyproject.toml
├── config.json
├── src/
│ ├── companion/ # FastAPI backend + orchestrator
│ │ ├── main.py
│ │ ├── api/
│ │ │ ├── chat.py
│ │ │ ├── index.py
│ │ │ └── status.py
│ │ ├── core/
│ │ │ ├── orchestrator.py
│ │ │ ├── memory.py
│ │ │ └── prompts.py
│ │ └── config.py
│ ├── rag/ # RAG engine
│ │ ├── indexer.py
│ │ ├── chunker.py
│ │ ├── embedder.py
│ │ ├── vector_store.py
│ │ └── search.py
│ ├── indexer_daemon/ # Vault watcher + indexer CLI
│ │ ├── daemon.py
│ │ ├── cli.py
│ │ └── watcher.py
│ └── forge/ # Fine-tuning pipeline
│ ├── extract.py
│ ├── train.py
│ ├── export.py
│ └── evaluate.py
├── ui/ # React frontend
│ ├── src/
│ │ ├── App.tsx
│ │ ├── components/
│ │ │ ├── Chat.tsx
│ │ │ ├── Message.tsx
│ │ │ └── Settings.tsx
│ │ └── api.ts
│ └── package.json
├── tests/
│ ├── companion/
│ ├── rag/
│ └── forge/
└── docs/
└── superpowers/
└── specs/
└── 2026-04-13-personal-companion-ai-design.md
10. Testing Strategy
| Layer | Tests |
|---|---|
| RAG | Chunking correctness, search relevance, incremental sync accuracy |
| API | Chat streaming, parameter validation, error handling |
| Security | Path traversal, sensitive content detection, local-only enforcement |
| Forge | Training convergence, eval loss trends, output GGUF validity |
| E2E | Full chat turn with RAG retrieval, citation rendering |
11. Risks & Mitigations
| Risk | Mitigation |
|---|---|
| Overfitting on small dataset | LoRA rank 16, strong regularization, 10% validation split, human eval |
| Temporal drift | Quarterly retrain + daily RAG sync |
| Privacy leak in training data | Manual curation of training examples; exclude others' private details |
| Emotional weight / uncanny valley | Persona is companion, not predictor; framed as reflection |
| Hardware limits | QLoRA 4-bit, 8B model, ~35 GPU layers; fallback to CPU offloading if needed |
| Maintenance fatigue | Auto-sync removes daily work; retrain is one script + reminder |
12. Success Criteria
- Chat interface streams responses locally with sub-second first-token latency
- RAG retrieves relevant vault context for >80% of personal questions
- Fine-tuned model produces responses that "feel" recognizably aligned with Santhosh's reflective style
- Quarterly retrain completes successfully on RTX 5070 in <6 hours
- Daily auto-sync and manual trigger both work reliably
- No vault data leaves the local machine
13. Next Steps
- Write implementation plan using
writing-plansskill - Scaffold the repository structure
- Build Vault Indexer + RAG engine first ( Week 1-2)
- Integrate chat UI with base model (Week 3)
- Curate training data and begin fine-tuning experiments (Week 4-6)
- Polish, evaluate, and integrate (Week 7-8)