270 lines
6.6 KiB
Markdown
270 lines
6.6 KiB
Markdown
# RAG Module Documentation
|
|
|
|
The RAG (Retrieval-Augmented Generation) module provides semantic search over your Obsidian vault. It handles document chunking, embedding generation, and vector similarity search.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
Vault Markdown Files
|
|
↓
|
|
┌─────────────────┐
|
|
│ Chunker │ - Split by strategy (sliding window / section)
|
|
│ (chunker.py) │ - Extract metadata (tags, dates, sections)
|
|
└────────┬────────┘
|
|
↓
|
|
┌─────────────────┐
|
|
│ Embedder │ - HTTP client for Ollama API
|
|
│ (embedder.py) │ - Batch processing with retries
|
|
└────────┬────────┘
|
|
↓
|
|
┌─────────────────┐
|
|
│ Vector Store │ - LanceDB persistence
|
|
│(vector_store.py)│ - Upsert, delete, search
|
|
└────────┬────────┘
|
|
↓
|
|
┌─────────────────┐
|
|
│ Indexer │ - Full/incremental sync
|
|
│ (indexer.py) │ - File watching
|
|
└─────────────────┘
|
|
```
|
|
|
|
## Components
|
|
|
|
### Chunker (`companion.rag.chunker`)
|
|
|
|
Splits markdown files into searchable chunks.
|
|
|
|
```python
|
|
from companion.rag.chunker import chunk_file, ChunkingRule
|
|
|
|
rules = {
|
|
"default": ChunkingRule(strategy="sliding_window", chunk_size=500, chunk_overlap=100),
|
|
"Journal/**": ChunkingRule(strategy="section", section_tags=["#DayInShort"], chunk_size=300, chunk_overlap=50),
|
|
}
|
|
|
|
chunks = chunk_file(
|
|
file_path=Path("journal/2026-04-12.md"),
|
|
vault_root=Path("~/vault"),
|
|
rules=rules,
|
|
modified_at=1234567890.0,
|
|
)
|
|
|
|
for chunk in chunks:
|
|
print(f"{chunk.source_file}:{chunk.chunk_index}")
|
|
print(f"Text: {chunk.text[:100]}...")
|
|
print(f"Tags: {chunk.tags}")
|
|
print(f"Date: {chunk.date}")
|
|
```
|
|
|
|
#### Chunking Strategies
|
|
|
|
**Sliding Window**
|
|
- Fixed-size chunks with overlap
|
|
- Best for: Longform text, articles
|
|
|
|
```python
|
|
ChunkingRule(
|
|
strategy="sliding_window",
|
|
chunk_size=500, # words per chunk
|
|
chunk_overlap=100, # words overlap between chunks
|
|
)
|
|
```
|
|
|
|
**Section-Based**
|
|
- Split on section headers (tags)
|
|
- Best for: Structured journals, daily notes
|
|
|
|
```python
|
|
ChunkingRule(
|
|
strategy="section",
|
|
section_tags=["#DayInShort", "#mentalhealth", "#work"],
|
|
chunk_size=300,
|
|
chunk_overlap=50,
|
|
)
|
|
```
|
|
|
|
#### Metadata Extraction
|
|
|
|
Each chunk includes:
|
|
- `source_file` - Relative path from vault root
|
|
- `source_directory` - Top-level directory
|
|
- `section` - Section header (for section strategy)
|
|
- `date` - Parsed from filename
|
|
- `tags` - Hashtags and wikilinks
|
|
- `chunk_index` - Position in document
|
|
- `modified_at` - File mtime for sync
|
|
|
|
### Embedder (`companion.rag.embedder`)
|
|
|
|
Generates embeddings via Ollama API.
|
|
|
|
```python
|
|
from companion.rag.embedder import OllamaEmbedder
|
|
|
|
embedder = OllamaEmbedder(
|
|
base_url="http://localhost:11434",
|
|
model="mxbai-embed-large",
|
|
batch_size=32,
|
|
)
|
|
|
|
# Single embedding
|
|
embeddings = embedder.embed(["Hello world"])
|
|
print(len(embeddings[0])) # 1024 dimensions
|
|
|
|
# Batch embedding (with automatic batching)
|
|
texts = ["text 1", "text 2", "text 3", ...] # 100 texts
|
|
embeddings = embedder.embed(texts) # Automatically batches
|
|
```
|
|
|
|
#### Features
|
|
|
|
- **Batching**: Automatically splits large requests
|
|
- **Retries**: Exponential backoff on failures
|
|
- **Context Manager**: Proper resource cleanup
|
|
|
|
```python
|
|
with OllamaEmbedder(...) as embedder:
|
|
embeddings = embedder.embed(texts)
|
|
```
|
|
|
|
### Vector Store (`companion.rag.vector_store`)
|
|
|
|
LanceDB wrapper for vector storage.
|
|
|
|
```python
|
|
from companion.rag.vector_store import VectorStore
|
|
|
|
store = VectorStore(
|
|
uri="~/.companion/vectors.lance",
|
|
dimensions=1024,
|
|
)
|
|
|
|
# Upsert chunks
|
|
store.upsert(
|
|
ids=["file.md::0", "file.md::1"],
|
|
texts=["chunk 1", "chunk 2"],
|
|
embeddings=[[0.1, ...], [0.2, ...]],
|
|
metadatas=[
|
|
{"source_file": "file.md", "source_directory": "docs"},
|
|
{"source_file": "file.md", "source_directory": "docs"},
|
|
],
|
|
)
|
|
|
|
# Search
|
|
results = store.search(
|
|
query_vector=[0.1, ...],
|
|
top_k=8,
|
|
filters={"source_directory": "Journal"},
|
|
)
|
|
```
|
|
|
|
#### Schema
|
|
|
|
| Field | Type | Nullable |
|
|
|-------|------|----------|
|
|
| id | string | No |
|
|
| text | string | No |
|
|
| vector | list[float32] | No |
|
|
| source_file | string | No |
|
|
| source_directory | string | No |
|
|
| section | string | Yes |
|
|
| date | string | Yes |
|
|
| tags | list[string] | Yes |
|
|
| chunk_index | int32 | No |
|
|
| total_chunks | int32 | No |
|
|
| modified_at | float64 | Yes |
|
|
| rule_applied | string | No |
|
|
|
|
### Indexer (`companion.rag.indexer`)
|
|
|
|
Orchestrates vault indexing.
|
|
|
|
```python
|
|
from companion.config import load_config
|
|
from companion.rag.indexer import Indexer
|
|
from companion.rag.vector_store import VectorStore
|
|
|
|
config = load_config()
|
|
store = VectorStore(
|
|
uri=config.rag.vector_store.path,
|
|
dimensions=config.rag.embedding.dimensions,
|
|
)
|
|
|
|
indexer = Indexer(config, store)
|
|
|
|
# Full reindex (clear + rebuild)
|
|
indexer.full_index()
|
|
|
|
# Incremental sync (only changed files)
|
|
indexer.sync()
|
|
|
|
# Get status
|
|
status = indexer.status()
|
|
print(f"Total chunks: {status['total_chunks']}")
|
|
print(f"Unindexed files: {status['unindexed_files']}")
|
|
```
|
|
|
|
### Search (`companion.rag.search`)
|
|
|
|
High-level search interface.
|
|
|
|
```python
|
|
from companion.rag.search import SearchEngine
|
|
|
|
engine = SearchEngine(
|
|
vector_store=store,
|
|
embedder_base_url="http://localhost:11434",
|
|
embedder_model="mxbai-embed-large",
|
|
default_top_k=8,
|
|
similarity_threshold=0.75,
|
|
hybrid_search_enabled=False,
|
|
)
|
|
|
|
results = engine.search(
|
|
query="What did I learn about friendships?",
|
|
top_k=8,
|
|
filters={"source_directory": "Journal"},
|
|
)
|
|
|
|
for result in results:
|
|
print(f"Source: {result['source_file']}")
|
|
print(f"Relevance: {1 - result['_distance']:.2f}")
|
|
```
|
|
|
|
## CLI Commands
|
|
|
|
```bash
|
|
# Full index
|
|
python -m companion.indexer_daemon.cli index
|
|
|
|
# Incremental sync
|
|
python -m companion.indexer_daemon.cli sync
|
|
|
|
# Check status
|
|
python -m companion.indexer_daemon.cli status
|
|
|
|
# Reindex (same as index)
|
|
python -m companion.indexer_daemon.cli reindex
|
|
```
|
|
|
|
## Performance Tips
|
|
|
|
1. **Chunk Size**: Smaller chunks = better retrieval, larger = more context
|
|
2. **Batch Size**: 32 is optimal for Ollama embeddings
|
|
3. **Filters**: Use directory filters to narrow search scope
|
|
4. **Sync vs Index**: Use `sync` for daily updates, `index` for full rebuilds
|
|
|
|
## Troubleshooting
|
|
|
|
**Slow indexing**
|
|
- Check Ollama is running: `ollama ps`
|
|
- Reduce batch size if OOM
|
|
|
|
**No results**
|
|
- Verify vault path in config
|
|
- Check `indexer.status()` for unindexed files
|
|
|
|
**Duplicate chunks**
|
|
- Each chunk ID is `{source_file}::{chunk_index}`
|
|
- Use `full_index()` to clear and rebuild
|