Compare commits
10 Commits
9f333c6f26
...
model/mini
| Author | SHA1 | Date | |
|---|---|---|---|
| 21b9704e21 | |||
| 4ab504e87c | |||
| 9d919dc237 | |||
| fe428511d1 | |||
| a12e27b83a | |||
| 34f3ce97f7 | |||
| a744c0c566 | |||
| d946cf34e1 | |||
| 928a027cec | |||
| fabdd48877 |
76
AGENTS.md
Normal file
76
AGENTS.md
Normal file
@@ -0,0 +1,76 @@
|
|||||||
|
# AGENTS.md
|
||||||
|
|
||||||
|
## Stack
|
||||||
|
|
||||||
|
Two independent packages in one repo:
|
||||||
|
|
||||||
|
| Directory | Role | Entry | Build |
|
||||||
|
|-----------|------|-------|-------|
|
||||||
|
| `src/` | TypeScript OpenClaw plugin | `src/index.ts` | esbuild → `dist/index.js` |
|
||||||
|
| `python/` | Python CLI indexer | `obsidian_rag/cli.py` | pip install -e |
|
||||||
|
|
||||||
|
## Commands
|
||||||
|
|
||||||
|
**TypeScript (OpenClaw plugin):**
|
||||||
|
```bash
|
||||||
|
npm run build # esbuild → dist/index.js
|
||||||
|
npm run typecheck # tsc --noEmit
|
||||||
|
npm run test # vitest run
|
||||||
|
```
|
||||||
|
|
||||||
|
**Python (RAG indexer):**
|
||||||
|
```bash
|
||||||
|
pip install -e python/ # editable install
|
||||||
|
obsidian-rag index|sync|reindex|status # CLI
|
||||||
|
pytest python/ # tests
|
||||||
|
ruff check python/ # lint
|
||||||
|
```
|
||||||
|
|
||||||
|
## OpenClaw Plugin Install
|
||||||
|
|
||||||
|
Plugin `package.json` MUST have:
|
||||||
|
```json
|
||||||
|
"openclaw": {
|
||||||
|
"extensions": ["./dist/index.js"],
|
||||||
|
"hook": []
|
||||||
|
}
|
||||||
|
```
|
||||||
|
- `extensions` = array, string path
|
||||||
|
- `hook` = singular, not `hooks`
|
||||||
|
|
||||||
|
## Config
|
||||||
|
|
||||||
|
User config at `~/.obsidian-rag/config.json` or `./obsidian-rag/` dev config.
|
||||||
|
|
||||||
|
Key indexing fields:
|
||||||
|
- `indexing.chunk_size` — sliding window chunk size (default 500)
|
||||||
|
- `indexing.chunk_overlap` — overlap between chunks (default 100)
|
||||||
|
- `indexing.max_section_chars` — max chars per section before hierarchical split (default 4000)
|
||||||
|
|
||||||
|
Key security fields:
|
||||||
|
- `security.require_confirmation_for` — list of categories (e.g. `["health", "financial_debt"]`). Empty list disables guard.
|
||||||
|
- `security.auto_approve_sensitive` — `true` bypasses sensitive content prompts.
|
||||||
|
- `security.local_only` — `true` blocks non-localhost Ollama.
|
||||||
|
|
||||||
|
## Ollama Context Length
|
||||||
|
|
||||||
|
`python/obsidian_rag/embedder.py` truncates chunks at `MAX_CHUNK_CHARS = 8000` before embedding. If Ollama 500 error returns, increase `max_section_chars` (to reduce section sizes) or reduce `chunk_size` in config.
|
||||||
|
|
||||||
|
## Hierarchical Chunking
|
||||||
|
|
||||||
|
Structured notes (date-named files) use section-split first, then sliding-window within sections that exceed `max_section_chars`. Small sections stay intact; large sections are broken into sub-chunks with the parent section heading preserved.
|
||||||
|
|
||||||
|
## Sensitive Content Guard
|
||||||
|
|
||||||
|
Triggered by categories in `require_confirmation_for`. Raises `SensitiveContentError` from `obsidian_rag/indexer.py`.
|
||||||
|
|
||||||
|
To disable: set `require_confirmation_for: []` or `auto_approve_sensitive: true` in config.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
User query → OpenClaw (TypeScript plugin src/index.ts)
|
||||||
|
→ obsidian_rag_* tools (python/obsidian_rag/)
|
||||||
|
→ Ollama embeddings (http://localhost:11434)
|
||||||
|
→ LanceDB vector store
|
||||||
|
```
|
||||||
56
docs/troubleshooting/MISSING_OPENCLAW_HOOKS_ERROR.md
Normal file
56
docs/troubleshooting/MISSING_OPENCLAW_HOOKS_ERROR.md
Normal file
@@ -0,0 +1,56 @@
|
|||||||
|
# Troubleshooting: "missing openclaw.hooks" Error
|
||||||
|
|
||||||
|
## Symptoms
|
||||||
|
|
||||||
|
When installing a plugin using `openclaw plugins install --link <path>`, the following error appears:
|
||||||
|
|
||||||
|
```
|
||||||
|
package.json missing openclaw.hooks; update the plugin package to include openclaw.extensions (for example ["./dist/index.js"]). See https://docs.openclaw.ai/help/troubleshooting#plugin-install-fails-with-missing-openclaw-extensions
|
||||||
|
Also not a valid hook pack: Error: package.json missing openclaw.hooks
|
||||||
|
```
|
||||||
|
|
||||||
|
## Root Cause
|
||||||
|
|
||||||
|
This error message is **a misleading fallback cascade** that occurs when the plugin installation fails for a different reason. The error message suggests the problem is a missing `openclaw.hooks` field, but this is actually a secondary error that appears because the primary plugin installation failed.
|
||||||
|
|
||||||
|
### How the Error Cascade Works
|
||||||
|
|
||||||
|
1. When `--link` is used with a local plugin path, OpenClaw first attempts to install the plugin via `installPluginFromPath()`
|
||||||
|
2. The installation flow calls `installPluginFromDir()` → `installPluginFromSourceDir()` → `detectNativePackageInstallSource()`
|
||||||
|
3. If `detectNativePackageInstallSource()` returns `false` (e.g., due to a dangerous code scan failure), it falls through to `installPluginFromPackageDir()`
|
||||||
|
4. When that also fails (e.g., due to `child_process` usage being flagged), the code falls back to `tryInstallHookPackFromLocalPath()`
|
||||||
|
5. The hook pack installer calls `ensureOpenClawHooks()`, which expects a `hooks` array in the manifest
|
||||||
|
6. Since your plugin has no `hooks` field, it throws "missing openclaw.hooks"
|
||||||
|
|
||||||
|
### Primary vs Secondary Errors
|
||||||
|
|
||||||
|
The **primary error** blocking installation is:
|
||||||
|
|
||||||
|
```
|
||||||
|
Plugin "obsidian-rag" installation blocked: dangerous code patterns detected: Shell command execution detected (child_process)
|
||||||
|
```
|
||||||
|
|
||||||
|
The **secondary error** ("missing openclaw.hooks") is a fallback diagnostic that appears because:
|
||||||
|
- The plugin installation path failed
|
||||||
|
- OpenClaw tried to interpret the path as a hook pack as a last resort
|
||||||
|
- Hook packs require a `hooks` field, which normal plugins don't have
|
||||||
|
|
||||||
|
## Common Primary Errors
|
||||||
|
|
||||||
|
1. **Dangerous code patterns detected**: Plugins using `child_process`, `eval()`, file system operations, or network requests may be blocked
|
||||||
|
2. **Plugin ID mismatch**: The `id` in `openclaw.plugin.json` doesn't match expected values
|
||||||
|
3. **Missing `openclaw.extensions`**: The extensions array is missing or malformed
|
||||||
|
|
||||||
|
## Solution
|
||||||
|
|
||||||
|
The "missing openclaw.hooks" error can be **safely ignored** once the primary installation error is resolved. For the dangerous code blocking issue, use the `--dangerously-force-unsafe-install` flag:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
openclaw plugins install --link --dangerously-force-unsafe-install /path/to/plugin/
|
||||||
|
```
|
||||||
|
|
||||||
|
## Why This Matters for AI Analysis
|
||||||
|
|
||||||
|
When analyzing this error, focus on the **first error shown**, not the secondary hook-related message. The OpenClaw installer attempts multiple installation strategies in sequence, and the final "missing openclaw.hooks" error is a diagnostic artifact from the fallback path, not the actual problem.
|
||||||
|
|
||||||
|
If the primary error is about dangerous code patterns, the plugin is functioning correctly from a technical standpoint—it simply uses APIs (like `child_process`) that OpenClaw's security scanning flags as potentially unsafe. Whether this is acceptable depends on the plugin's use case and trust level.
|
||||||
@@ -65,6 +65,11 @@
|
|||||||
"type": "integer",
|
"type": "integer",
|
||||||
"minimum": 0
|
"minimum": 0
|
||||||
},
|
},
|
||||||
|
"max_section_chars": {
|
||||||
|
"type": "integer",
|
||||||
|
"minimum": 1,
|
||||||
|
"description": "Max chars per section before splitting into sub-chunks. Default 4000."
|
||||||
|
},
|
||||||
"file_patterns": {
|
"file_patterns": {
|
||||||
"type": "array",
|
"type": "array",
|
||||||
"items": {
|
"items": {
|
||||||
|
|||||||
@@ -5,8 +5,7 @@
|
|||||||
"main": "dist/index.js",
|
"main": "dist/index.js",
|
||||||
"type": "module",
|
"type": "module",
|
||||||
"openclaw": {
|
"openclaw": {
|
||||||
"extensions": ["./dist/index.js"],
|
"extensions": ["./dist/index.js"]
|
||||||
"hook": []
|
|
||||||
},
|
},
|
||||||
"scripts": {
|
"scripts": {
|
||||||
"build": "esbuild src/index.ts --bundle --platform=node --target=node18 --outfile=dist/index.js --format=esm --external:@lancedb/lancedb --external:@lancedb/lancedb-darwin-arm64 --external:fsevents --external:chokidar",
|
"build": "esbuild src/index.ts --bundle --platform=node --target=node18 --outfile=dist/index.js --format=esm --external:@lancedb/lancedb --external:@lancedb/lancedb-darwin-arm64 --external:fsevents --external:chokidar",
|
||||||
|
|||||||
@@ -3,7 +3,6 @@
|
|||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
import re
|
import re
|
||||||
import unicodedata
|
|
||||||
import hashlib
|
import hashlib
|
||||||
from dataclasses import dataclass, field
|
from dataclasses import dataclass, field
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
@@ -181,9 +180,7 @@ def chunk_file(
|
|||||||
Uses section-split for structured notes (journal entries with date filenames),
|
Uses section-split for structured notes (journal entries with date filenames),
|
||||||
sliding window for everything else.
|
sliding window for everything else.
|
||||||
"""
|
"""
|
||||||
import uuid
|
|
||||||
|
|
||||||
vault_path = Path(config.vault_path)
|
|
||||||
rel_path = filepath if filepath.is_absolute() else filepath
|
rel_path = filepath if filepath.is_absolute() else filepath
|
||||||
source_file = str(rel_path)
|
source_file = str(rel_path)
|
||||||
source_directory = rel_path.parts[0] if rel_path.parts else ""
|
source_directory = rel_path.parts[0] if rel_path.parts else ""
|
||||||
@@ -201,7 +198,6 @@ def chunk_file(
|
|||||||
chunks: list[Chunk] = []
|
chunks: list[Chunk] = []
|
||||||
|
|
||||||
if is_structured_note(filepath):
|
if is_structured_note(filepath):
|
||||||
# Section-split for journal/daily notes
|
|
||||||
sections = split_by_sections(body, metadata)
|
sections = split_by_sections(body, metadata)
|
||||||
total = len(sections)
|
total = len(sections)
|
||||||
|
|
||||||
@@ -211,20 +207,38 @@ def chunk_file(
|
|||||||
section_tags = extract_tags(section_text)
|
section_tags = extract_tags(section_text)
|
||||||
combined_tags = list(dict.fromkeys([*tags, *section_tags]))
|
combined_tags = list(dict.fromkeys([*tags, *section_tags]))
|
||||||
|
|
||||||
chunk_text = section_text
|
section_heading = f"#{section}" if section else None
|
||||||
chunk = Chunk(
|
if len(section_text) > config.indexing.max_section_chars:
|
||||||
chunk_id=f"{chunk_id_prefix}{_stable_chunk_id(content_hash, idx)}",
|
sub_chunks = sliding_window_chunks(section_text, chunk_size, overlap)
|
||||||
text=chunk_text,
|
sub_total = len(sub_chunks)
|
||||||
source_file=source_file,
|
for sub_idx, sub_text in enumerate(sub_chunks):
|
||||||
source_directory=source_directory,
|
chunk = Chunk(
|
||||||
section=f"#{section}" if section else None,
|
chunk_id=f"{chunk_id_prefix}{_stable_chunk_id(content_hash, idx)}_{sub_idx}",
|
||||||
date=date,
|
text=sub_text,
|
||||||
tags=combined_tags,
|
source_file=source_file,
|
||||||
chunk_index=idx,
|
source_directory=source_directory,
|
||||||
total_chunks=total,
|
section=section_heading,
|
||||||
modified_at=modified_at,
|
date=date,
|
||||||
)
|
tags=combined_tags,
|
||||||
chunks.append(chunk)
|
chunk_index=sub_idx,
|
||||||
|
total_chunks=sub_total,
|
||||||
|
modified_at=modified_at,
|
||||||
|
)
|
||||||
|
chunks.append(chunk)
|
||||||
|
else:
|
||||||
|
chunk = Chunk(
|
||||||
|
chunk_id=f"{chunk_id_prefix}{_stable_chunk_id(content_hash, idx)}",
|
||||||
|
text=section_text,
|
||||||
|
source_file=source_file,
|
||||||
|
source_directory=source_directory,
|
||||||
|
section=section_heading,
|
||||||
|
date=date,
|
||||||
|
tags=combined_tags,
|
||||||
|
chunk_index=idx,
|
||||||
|
total_chunks=total,
|
||||||
|
modified_at=modified_at,
|
||||||
|
)
|
||||||
|
chunks.append(chunk)
|
||||||
else:
|
else:
|
||||||
# Sliding window for unstructured notes
|
# Sliding window for unstructured notes
|
||||||
text_chunks = sliding_window_chunks(body, chunk_size, overlap)
|
text_chunks = sliding_window_chunks(body, chunk_size, overlap)
|
||||||
@@ -247,4 +261,4 @@ def chunk_file(
|
|||||||
)
|
)
|
||||||
chunks.append(chunk)
|
chunks.append(chunk)
|
||||||
|
|
||||||
return chunks
|
return chunks
|
||||||
|
|||||||
@@ -51,7 +51,10 @@ def _index(config) -> int:
|
|||||||
gen = indexer.full_index()
|
gen = indexer.full_index()
|
||||||
result: dict = {"indexed_files": 0, "total_chunks": 0, "errors": []}
|
result: dict = {"indexed_files": 0, "total_chunks": 0, "errors": []}
|
||||||
for item in gen:
|
for item in gen:
|
||||||
result = item # progress yields are dicts; final dict from return
|
if item.get("type") == "complete":
|
||||||
|
result = item
|
||||||
|
elif item.get("type") == "progress":
|
||||||
|
pass # skip progress logs in result
|
||||||
duration_ms = int((time.monotonic() - t0) * 1000)
|
duration_ms = int((time.monotonic() - t0) * 1000)
|
||||||
print(
|
print(
|
||||||
json.dumps(
|
json.dumps(
|
||||||
|
|||||||
@@ -3,7 +3,6 @@
|
|||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
import json
|
import json
|
||||||
import os
|
|
||||||
from enum import Enum
|
from enum import Enum
|
||||||
from dataclasses import dataclass, field
|
from dataclasses import dataclass, field
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
@@ -32,6 +31,7 @@ class VectorStoreConfig:
|
|||||||
class IndexingConfig:
|
class IndexingConfig:
|
||||||
chunk_size: int = 500
|
chunk_size: int = 500
|
||||||
chunk_overlap: int = 100
|
chunk_overlap: int = 100
|
||||||
|
max_section_chars: int = 4000
|
||||||
file_patterns: list[str] = field(default_factory=lambda: ["*.md"])
|
file_patterns: list[str] = field(default_factory=lambda: ["*.md"])
|
||||||
deny_dirs: list[str] = field(
|
deny_dirs: list[str] = field(
|
||||||
default_factory=lambda: [
|
default_factory=lambda: [
|
||||||
|
|||||||
@@ -12,6 +12,7 @@ if TYPE_CHECKING:
|
|||||||
from obsidian_rag.config import ObsidianRagConfig
|
from obsidian_rag.config import ObsidianRagConfig
|
||||||
|
|
||||||
DEFAULT_TIMEOUT = 120.0 # seconds
|
DEFAULT_TIMEOUT = 120.0 # seconds
|
||||||
|
MAX_CHUNK_CHARS = 8000 # safe default for most Ollama models
|
||||||
|
|
||||||
|
|
||||||
class EmbeddingError(Exception):
|
class EmbeddingError(Exception):
|
||||||
@@ -42,9 +43,9 @@ class OllamaEmbedder:
|
|||||||
"""Validate that embedding service is local when local_only is True."""
|
"""Validate that embedding service is local when local_only is True."""
|
||||||
if not self.local_only:
|
if not self.local_only:
|
||||||
return
|
return
|
||||||
|
|
||||||
parsed = urllib.parse.urlparse(self.base_url)
|
parsed = urllib.parse.urlparse(self.base_url)
|
||||||
if parsed.hostname not in ['localhost', '127.0.0.1', '::1']:
|
if parsed.hostname not in ["localhost", "127.0.0.1", "::1"]:
|
||||||
raise SecurityError(
|
raise SecurityError(
|
||||||
f"Remote embedding service not allowed when local_only=True: {self.base_url}"
|
f"Remote embedding service not allowed when local_only=True: {self.base_url}"
|
||||||
)
|
)
|
||||||
@@ -84,23 +85,31 @@ class OllamaEmbedder:
|
|||||||
# For batch, call /api/embeddings multiple times sequentially
|
# For batch, call /api/embeddings multiple times sequentially
|
||||||
if len(batch) == 1:
|
if len(batch) == 1:
|
||||||
endpoint = f"{self.base_url}/api/embeddings"
|
endpoint = f"{self.base_url}/api/embeddings"
|
||||||
payload = {"model": self.model, "prompt": batch[0]}
|
prompt = batch[0][:MAX_CHUNK_CHARS]
|
||||||
|
payload = {"model": self.model, "prompt": prompt}
|
||||||
else:
|
else:
|
||||||
# For batch, use /api/embeddings with "input" (multiple calls)
|
# For batch, use /api/embeddings with "input" (multiple calls)
|
||||||
results = []
|
results = []
|
||||||
for text in batch:
|
for text in batch:
|
||||||
|
truncated = text[:MAX_CHUNK_CHARS]
|
||||||
try:
|
try:
|
||||||
resp = self._client.post(
|
resp = self._client.post(
|
||||||
f"{self.base_url}/api/embeddings",
|
f"{self.base_url}/api/embeddings",
|
||||||
json={"model": self.model, "prompt": text},
|
json={"model": self.model, "prompt": truncated},
|
||||||
timeout=DEFAULT_TIMEOUT,
|
timeout=DEFAULT_TIMEOUT,
|
||||||
)
|
)
|
||||||
except httpx.ConnectError as e:
|
except httpx.ConnectError as e:
|
||||||
raise OllamaUnavailableError(f"Cannot connect to Ollama at {self.base_url}") from e
|
raise OllamaUnavailableError(
|
||||||
|
f"Cannot connect to Ollama at {self.base_url}"
|
||||||
|
) from e
|
||||||
except httpx.TimeoutException as e:
|
except httpx.TimeoutException as e:
|
||||||
raise EmbeddingError(f"Embedding request timed out after {DEFAULT_TIMEOUT}s") from e
|
raise EmbeddingError(
|
||||||
|
f"Embedding request timed out after {DEFAULT_TIMEOUT}s"
|
||||||
|
) from e
|
||||||
if resp.status_code != 200:
|
if resp.status_code != 200:
|
||||||
raise EmbeddingError(f"Ollama returned {resp.status_code}: {resp.text}")
|
raise EmbeddingError(
|
||||||
|
f"Ollama returned {resp.status_code}: {resp.text}"
|
||||||
|
)
|
||||||
data = resp.json()
|
data = resp.json()
|
||||||
embedding = data.get("embedding", [])
|
embedding = data.get("embedding", [])
|
||||||
if not embedding:
|
if not embedding:
|
||||||
@@ -111,9 +120,13 @@ class OllamaEmbedder:
|
|||||||
try:
|
try:
|
||||||
resp = self._client.post(endpoint, json=payload, timeout=DEFAULT_TIMEOUT)
|
resp = self._client.post(endpoint, json=payload, timeout=DEFAULT_TIMEOUT)
|
||||||
except httpx.ConnectError as e:
|
except httpx.ConnectError as e:
|
||||||
raise OllamaUnavailableError(f"Cannot connect to Ollama at {self.base_url}") from e
|
raise OllamaUnavailableError(
|
||||||
|
f"Cannot connect to Ollama at {self.base_url}"
|
||||||
|
) from e
|
||||||
except httpx.TimeoutException as e:
|
except httpx.TimeoutException as e:
|
||||||
raise EmbeddingError(f"Embedding request timed out after {DEFAULT_TIMEOUT}s") from e
|
raise EmbeddingError(
|
||||||
|
f"Embedding request timed out after {DEFAULT_TIMEOUT}s"
|
||||||
|
) from e
|
||||||
|
|
||||||
if resp.status_code != 200:
|
if resp.status_code != 200:
|
||||||
raise EmbeddingError(f"Ollama returned {resp.status_code}: {resp.text}")
|
raise EmbeddingError(f"Ollama returned {resp.status_code}: {resp.text}")
|
||||||
|
|||||||
@@ -4,8 +4,6 @@ from __future__ import annotations
|
|||||||
|
|
||||||
import json
|
import json
|
||||||
import os
|
import os
|
||||||
import time
|
|
||||||
import uuid
|
|
||||||
from datetime import datetime, timezone
|
from datetime import datetime, timezone
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import TYPE_CHECKING, Any, Generator, Iterator
|
from typing import TYPE_CHECKING, Any, Generator, Iterator
|
||||||
@@ -16,10 +14,13 @@ if TYPE_CHECKING:
|
|||||||
import obsidian_rag.config as config_mod
|
import obsidian_rag.config as config_mod
|
||||||
from obsidian_rag.config import _resolve_data_dir
|
from obsidian_rag.config import _resolve_data_dir
|
||||||
from obsidian_rag.chunker import chunk_file
|
from obsidian_rag.chunker import chunk_file
|
||||||
from obsidian_rag.embedder import EmbeddingError, OllamaUnavailableError, SecurityError
|
from obsidian_rag.embedder import OllamaUnavailableError
|
||||||
from obsidian_rag.security import should_index_dir, validate_path
|
from obsidian_rag.security import should_index_dir, validate_path
|
||||||
from obsidian_rag.audit_logger import AuditLogger
|
from obsidian_rag.vector_store import (
|
||||||
from obsidian_rag.vector_store import create_table_if_not_exists, delete_by_source_file, get_db, upsert_chunks
|
create_table_if_not_exists,
|
||||||
|
get_db,
|
||||||
|
upsert_chunks,
|
||||||
|
)
|
||||||
|
|
||||||
# ----------------------------------------------------------------------
|
# ----------------------------------------------------------------------
|
||||||
# Pipeline
|
# Pipeline
|
||||||
@@ -43,6 +44,7 @@ class Indexer:
|
|||||||
def embedder(self):
|
def embedder(self):
|
||||||
if self._embedder is None:
|
if self._embedder is None:
|
||||||
from obsidian_rag.embedder import OllamaEmbedder
|
from obsidian_rag.embedder import OllamaEmbedder
|
||||||
|
|
||||||
self._embedder = OllamaEmbedder(self.config)
|
self._embedder = OllamaEmbedder(self.config)
|
||||||
return self._embedder
|
return self._embedder
|
||||||
|
|
||||||
@@ -50,6 +52,7 @@ class Indexer:
|
|||||||
def audit_logger(self):
|
def audit_logger(self):
|
||||||
if self._audit_logger is None:
|
if self._audit_logger is None:
|
||||||
from obsidian_rag.audit_logger import AuditLogger
|
from obsidian_rag.audit_logger import AuditLogger
|
||||||
|
|
||||||
log_dir = _resolve_data_dir() / "audit"
|
log_dir = _resolve_data_dir() / "audit"
|
||||||
self._audit_logger = AuditLogger(log_dir / "audit.log")
|
self._audit_logger = AuditLogger(log_dir / "audit.log")
|
||||||
return self._audit_logger
|
return self._audit_logger
|
||||||
@@ -57,18 +60,18 @@ class Indexer:
|
|||||||
def _check_sensitive_content_approval(self, chunks: list[dict[str, Any]]) -> None:
|
def _check_sensitive_content_approval(self, chunks: list[dict[str, Any]]) -> None:
|
||||||
"""Enforce user approval for sensitive content before indexing."""
|
"""Enforce user approval for sensitive content before indexing."""
|
||||||
from obsidian_rag import security
|
from obsidian_rag import security
|
||||||
|
|
||||||
sensitive_categories = self.config.security.require_confirmation_for
|
sensitive_categories = self.config.security.require_confirmation_for
|
||||||
if not sensitive_categories:
|
if not sensitive_categories:
|
||||||
return
|
return
|
||||||
|
|
||||||
for chunk in chunks:
|
for chunk in chunks:
|
||||||
sensitivity = security.detect_sensitive(
|
sensitivity = security.detect_sensitive(
|
||||||
chunk['chunk_text'],
|
chunk["chunk_text"],
|
||||||
self.config.security.sensitive_sections,
|
self.config.security.sensitive_sections,
|
||||||
self.config.memory.patterns
|
self.config.memory.patterns,
|
||||||
)
|
)
|
||||||
|
|
||||||
for category in sensitive_categories:
|
for category in sensitive_categories:
|
||||||
if sensitivity.get(category, False):
|
if sensitivity.get(category, False):
|
||||||
if not self.config.security.auto_approve_sensitive:
|
if not self.config.security.auto_approve_sensitive:
|
||||||
@@ -99,7 +102,11 @@ class Indexer:
|
|||||||
"""Index a single file. Returns (num_chunks, enriched_chunks)."""
|
"""Index a single file. Returns (num_chunks, enriched_chunks)."""
|
||||||
from obsidian_rag import security
|
from obsidian_rag import security
|
||||||
|
|
||||||
mtime = str(datetime.fromtimestamp(filepath.stat().st_mtime, tz=timezone.utc).isoformat())
|
mtime = str(
|
||||||
|
datetime.fromtimestamp(
|
||||||
|
filepath.stat().st_mtime, tz=timezone.utc
|
||||||
|
).isoformat()
|
||||||
|
)
|
||||||
content = filepath.read_text(encoding="utf-8")
|
content = filepath.read_text(encoding="utf-8")
|
||||||
# Sanitize
|
# Sanitize
|
||||||
content = security.sanitize_text(content)
|
content = security.sanitize_text(content)
|
||||||
@@ -147,24 +154,25 @@ class Indexer:
|
|||||||
num_chunks, enriched = self.process_file(filepath)
|
num_chunks, enriched = self.process_file(filepath)
|
||||||
# Enforce sensitive content policies
|
# Enforce sensitive content policies
|
||||||
self._check_sensitive_content_approval(enriched)
|
self._check_sensitive_content_approval(enriched)
|
||||||
|
|
||||||
# Log sensitive content access
|
# Log sensitive content access
|
||||||
for chunk in enriched:
|
for chunk in enriched:
|
||||||
from obsidian_rag import security
|
from obsidian_rag import security
|
||||||
|
|
||||||
sensitivity = security.detect_sensitive(
|
sensitivity = security.detect_sensitive(
|
||||||
chunk['chunk_text'],
|
chunk["chunk_text"],
|
||||||
self.config.security.sensitive_sections,
|
self.config.security.sensitive_sections,
|
||||||
self.config.memory.patterns
|
self.config.memory.patterns,
|
||||||
)
|
)
|
||||||
for category in ['health', 'financial', 'relations']:
|
for category in ["health", "financial", "relations"]:
|
||||||
if sensitivity.get(category, False):
|
if sensitivity.get(category, False):
|
||||||
self.audit_logger.log_sensitive_access(
|
self.audit_logger.log_sensitive_access(
|
||||||
str(chunk['source_file']),
|
str(chunk["source_file"]),
|
||||||
category,
|
category,
|
||||||
'index',
|
"index",
|
||||||
{'chunk_id': chunk['chunk_id']}
|
{"chunk_id": chunk["chunk_id"]},
|
||||||
)
|
)
|
||||||
|
|
||||||
# Embed chunks
|
# Embed chunks
|
||||||
texts = [e["chunk_text"] for e in enriched]
|
texts = [e["chunk_text"] for e in enriched]
|
||||||
try:
|
try:
|
||||||
@@ -176,8 +184,8 @@ class Indexer:
|
|||||||
for e, v in zip(enriched, vectors):
|
for e, v in zip(enriched, vectors):
|
||||||
e["vector"] = v
|
e["vector"] = v
|
||||||
# Store
|
# Store
|
||||||
upsert_chunks(table, enriched)
|
stored = upsert_chunks(table, enriched)
|
||||||
total_chunks += num_chunks
|
total_chunks += stored
|
||||||
indexed_files += 1
|
indexed_files += 1
|
||||||
except Exception as exc:
|
except Exception as exc:
|
||||||
errors.append({"file": str(filepath), "error": str(exc)})
|
errors.append({"file": str(filepath), "error": str(exc)})
|
||||||
@@ -249,9 +257,16 @@ class Indexer:
|
|||||||
db = get_db(self.config)
|
db = get_db(self.config)
|
||||||
if "obsidian_chunks" in db.list_tables():
|
if "obsidian_chunks" in db.list_tables():
|
||||||
db.drop_table("obsidian_chunks")
|
db.drop_table("obsidian_chunks")
|
||||||
# full_index is a generator — materialize it to get the final dict
|
|
||||||
results = list(self.full_index())
|
results = list(self.full_index())
|
||||||
return results[-1] if results else {"indexed_files": 0, "total_chunks": 0, "errors": []}
|
final = (
|
||||||
|
results[-1]
|
||||||
|
if results
|
||||||
|
else {"indexed_files": 0, "total_chunks": 0, "errors": []}
|
||||||
|
)
|
||||||
|
self._write_sync_result(
|
||||||
|
final["indexed_files"], final["total_chunks"], final["errors"]
|
||||||
|
)
|
||||||
|
return final
|
||||||
|
|
||||||
def _sync_result_path(self) -> Path:
|
def _sync_result_path(self) -> Path:
|
||||||
# Use the same dev-data-dir convention as config.py
|
# Use the same dev-data-dir convention as config.py
|
||||||
@@ -309,24 +324,25 @@ class Indexer:
|
|||||||
num_chunks, enriched = self.process_file(filepath)
|
num_chunks, enriched = self.process_file(filepath)
|
||||||
# Enforce sensitive content policies
|
# Enforce sensitive content policies
|
||||||
self._check_sensitive_content_approval(enriched)
|
self._check_sensitive_content_approval(enriched)
|
||||||
|
|
||||||
# Log sensitive content access
|
# Log sensitive content access
|
||||||
for chunk in enriched:
|
for chunk in enriched:
|
||||||
from obsidian_rag import security
|
from obsidian_rag import security
|
||||||
|
|
||||||
sensitivity = security.detect_sensitive(
|
sensitivity = security.detect_sensitive(
|
||||||
chunk['chunk_text'],
|
chunk["chunk_text"],
|
||||||
self.config.security.sensitive_sections,
|
self.config.security.sensitive_sections,
|
||||||
self.config.memory.patterns
|
self.config.memory.patterns,
|
||||||
)
|
)
|
||||||
for category in ['health', 'financial', 'relations']:
|
for category in ["health", "financial", "relations"]:
|
||||||
if sensitivity.get(category, False):
|
if sensitivity.get(category, False):
|
||||||
self.audit_logger.log_sensitive_access(
|
self.audit_logger.log_sensitive_access(
|
||||||
str(chunk['source_file']),
|
str(chunk["source_file"]),
|
||||||
category,
|
category,
|
||||||
'index',
|
"index",
|
||||||
{'chunk_id': chunk['chunk_id']}
|
{"chunk_id": chunk["chunk_id"]},
|
||||||
)
|
)
|
||||||
|
|
||||||
# Embed chunks
|
# Embed chunks
|
||||||
texts = [e["chunk_text"] for e in enriched]
|
texts = [e["chunk_text"] for e in enriched]
|
||||||
try:
|
try:
|
||||||
@@ -353,4 +369,3 @@ class Indexer:
|
|||||||
if not data_dir.exists() and not (project_root / "KnowledgeVault").exists():
|
if not data_dir.exists() and not (project_root / "KnowledgeVault").exists():
|
||||||
data_dir = Path(os.path.expanduser("~/.obsidian-rag"))
|
data_dir = Path(os.path.expanduser("~/.obsidian-rag"))
|
||||||
return data_dir / "sync-result.json"
|
return data_dir / "sync-result.json"
|
||||||
|
|
||||||
|
|||||||
@@ -206,6 +206,7 @@ def _mock_config(tmp_path: Path) -> MagicMock:
|
|||||||
cfg.vault_path = str(tmp_path)
|
cfg.vault_path = str(tmp_path)
|
||||||
cfg.indexing.chunk_size = 500
|
cfg.indexing.chunk_size = 500
|
||||||
cfg.indexing.chunk_overlap = 100
|
cfg.indexing.chunk_overlap = 100
|
||||||
|
cfg.indexing.max_section_chars = 4000
|
||||||
cfg.indexing.file_patterns = ["*.md"]
|
cfg.indexing.file_patterns = ["*.md"]
|
||||||
cfg.indexing.deny_dirs = [".obsidian", ".trash", "zzz-Archive", ".git"]
|
cfg.indexing.deny_dirs = [".obsidian", ".trash", "zzz-Archive", ".git"]
|
||||||
cfg.indexing.allow_dirs = []
|
cfg.indexing.allow_dirs = []
|
||||||
@@ -248,3 +249,41 @@ def test_chunk_file_unstructured(tmp_path: Path):
|
|||||||
assert len(chunks) > 1
|
assert len(chunks) > 1
|
||||||
assert all(c.section is None for c in chunks)
|
assert all(c.section is None for c in chunks)
|
||||||
assert chunks[0].chunk_index == 0
|
assert chunks[0].chunk_index == 0
|
||||||
|
|
||||||
|
|
||||||
|
def test_large_section_split_into_sub_chunks(tmp_path: Path):
|
||||||
|
"""Large section (exceeding max_section_chars) is split via sliding window."""
|
||||||
|
vault = tmp_path / "Notes"
|
||||||
|
vault.mkdir()
|
||||||
|
fpath = vault / "2024-03-15-Podcast.md"
|
||||||
|
large_content = "word " * 3000 # ~15000 chars, exceeds MAX_SECTION_CHARS
|
||||||
|
fpath.write_text(f"# Episode Notes\n\n{large_content}")
|
||||||
|
|
||||||
|
cfg = _mock_config(tmp_path)
|
||||||
|
cfg.indexing.max_section_chars = 4000
|
||||||
|
chunks = chunk_file(fpath, fpath.read_text(), "2024-03-15T10:00:00Z", cfg)
|
||||||
|
|
||||||
|
# Large section should be split into multiple sub-chunks
|
||||||
|
assert len(chunks) > 1
|
||||||
|
# Each sub-chunk should preserve the section heading
|
||||||
|
for chunk in chunks:
|
||||||
|
assert chunk.section == "#Episode Notes", (
|
||||||
|
f"Expected #Episode Notes, got {chunk.section}"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_small_section_kept_intact(tmp_path: Path):
|
||||||
|
"""Small section (under max_section_chars) remains a single chunk."""
|
||||||
|
vault = tmp_path / "Notes"
|
||||||
|
vault.mkdir()
|
||||||
|
fpath = vault / "2024-03-15-Short.md"
|
||||||
|
fpath.write_text("# Notes\n\nShort content here.")
|
||||||
|
|
||||||
|
cfg = _mock_config(tmp_path)
|
||||||
|
cfg.indexing.max_section_chars = 4000
|
||||||
|
chunks = chunk_file(fpath, fpath.read_text(), "2024-03-15T10:00:00Z", cfg)
|
||||||
|
|
||||||
|
# Small section → single chunk
|
||||||
|
assert len(chunks) == 1
|
||||||
|
assert chunks[0].section == "#Notes"
|
||||||
|
assert chunks[0].text.strip().endswith("Short content here.")
|
||||||
|
|||||||
@@ -98,7 +98,8 @@ export async function probeAll(config: ObsidianRagConfig): Promise<ProbeResult>
|
|||||||
|
|
||||||
if (indexExists) {
|
if (indexExists) {
|
||||||
try {
|
try {
|
||||||
const syncPath = resolve(dbPath, "..", "sync-result.json");
|
const dataDir = resolveDataDir();
|
||||||
|
const syncPath = resolve(dataDir, "sync-result.json");
|
||||||
if (existsSync(syncPath)) {
|
if (existsSync(syncPath)) {
|
||||||
const data = JSON.parse(readFileSync(syncPath, "utf-8"));
|
const data = JSON.parse(readFileSync(syncPath, "utf-8"));
|
||||||
lastSync = data.timestamp ?? null;
|
lastSync = data.timestamp ?? null;
|
||||||
@@ -120,6 +121,17 @@ export async function probeAll(config: ObsidianRagConfig): Promise<ProbeResult>
|
|||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
|
function resolveDataDir(): string {
|
||||||
|
const cwd = process.cwd();
|
||||||
|
const devDataDir = resolve(cwd, "obsidian-rag");
|
||||||
|
const devVaultMarker = resolve(cwd, "KnowledgeVault");
|
||||||
|
if (existsSync(devDataDir) || existsSync(devVaultMarker)) {
|
||||||
|
return devDataDir;
|
||||||
|
}
|
||||||
|
const home = process.env.HOME ?? process.env.USERPROFILE ?? "";
|
||||||
|
return resolve(home, ".obsidian-rag");
|
||||||
|
}
|
||||||
|
|
||||||
async function probeOllama(baseUrl: string): Promise<boolean> {
|
async function probeOllama(baseUrl: string): Promise<boolean> {
|
||||||
try {
|
try {
|
||||||
const res = await fetch(`${baseUrl}/api/tags`, { signal: AbortSignal.timeout(3000) });
|
const res = await fetch(`${baseUrl}/api/tags`, { signal: AbortSignal.timeout(3000) });
|
||||||
|
|||||||
@@ -109,7 +109,7 @@ export function readSyncResult(config: ObsidianRagConfig): {
|
|||||||
total_chunks: number;
|
total_chunks: number;
|
||||||
errors: Array<{ file: string; error: string }>;
|
errors: Array<{ file: string; error: string }>;
|
||||||
} | null {
|
} | null {
|
||||||
const dataDir = resolve(process.cwd(), ".obsidian-rag");
|
const dataDir = _resolveDataDir();
|
||||||
const path = resolve(dataDir, "sync-result.json");
|
const path = resolve(dataDir, "sync-result.json");
|
||||||
if (!existsSync(path)) return null;
|
if (!existsSync(path)) return null;
|
||||||
try {
|
try {
|
||||||
@@ -118,3 +118,14 @@ export function readSyncResult(config: ObsidianRagConfig): {
|
|||||||
return null;
|
return null;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
function _resolveDataDir(): string {
|
||||||
|
const cwd = process.cwd();
|
||||||
|
const devDataDir = resolve(cwd, "obsidian-rag");
|
||||||
|
const devVaultMarker = resolve(cwd, "KnowledgeVault");
|
||||||
|
if (existsSync(devDataDir) || existsSync(devVaultMarker)) {
|
||||||
|
return devDataDir;
|
||||||
|
}
|
||||||
|
const home = process.env.HOME ?? process.env.USERPROFILE ?? "";
|
||||||
|
return resolve(home, ".obsidian-rag");
|
||||||
|
}
|
||||||
|
|||||||
@@ -88,7 +88,7 @@ function defaults(): ObsidianRagConfig {
|
|||||||
}
|
}
|
||||||
|
|
||||||
export function loadConfig(configPath?: string): ObsidianRagConfig {
|
export function loadConfig(configPath?: string): ObsidianRagConfig {
|
||||||
const defaultPath = resolve(process.cwd(), ".obsidian-rag", "config.json");
|
const defaultPath = resolve(process.cwd(), "obsidian-rag", "config.json");
|
||||||
const path = configPath ?? defaultPath;
|
const path = configPath ?? defaultPath;
|
||||||
try {
|
try {
|
||||||
const raw = JSON.parse(readFileSync(path, "utf-8"));
|
const raw = JSON.parse(readFileSync(path, "utf-8"));
|
||||||
|
|||||||
Reference in New Issue
Block a user