Compare commits

10 Commits

Author SHA1 Message Date
21b9704e21 fix(indexer): use upsert_chunks return value for chunk count
Previously total_chunks counted from process_file return (num_chunks)
which could differ from actual stored count if upsert silently failed.
Now using stored count returned by upsert_chunks.

Also fixes cli._index to skip progress yields when building result.
2026-04-12 02:16:19 -04:00
4ab504e87c fix(indexer): write sync-result.json in reindex() not just sync()
reindex() was consuming the full_index() generator but never calling
_write_sync_result(), leaving sync-result.json stale while CLI output
showed correct indexed_files/total_chunks.
2026-04-12 01:49:43 -04:00
9d919dc237 fix(config): use 'obsidian-rag' not '.obsidian-rag' for dev config path 2026-04-12 01:03:00 -04:00
fe428511d1 fix(bridge): dev data dir is 'obsidian-rag' not '.obsidian-rag'
Python's _resolve_data_dir() uses 'obsidian-rag' (no dot).
TypeScript was using '.obsidian-rag' (with dot) — mismatch caused
sync-result.json to never be found by the agent plugin.
2026-04-12 00:56:10 -04:00
a12e27b83a fix(bridge): resolve data dir same as Python for sync-result.json
Both readSyncResult() and probeAll() now mirror Python's
_resolve_data_dir() logic: dev detection (cwd/.obsidian-rag
or cwd/KnowledgeVault) then home dir fallback.

Previously readSyncResult always used cwd/.obsidian-rag (wrong for
server deployments) and probeAll resolved sync-result.json relative
to db path (wrong for absolute paths like /home/san/.obsidian-rag/).
2026-04-12 00:10:38 -04:00
34f3ce97f7 feat(indexer): hierarchical chunking for large sections
- Section-split first for structured notes
- Large sections (>max_section_chars) broken via sliding-window
- Small sections stay intact with heading preserved
- Adds max_section_chars config (default 4000)
- 2 new TDD tests for hierarchical chunking
2026-04-11 23:58:05 -04:00
a744c0c566 docs: add AGENTS.md with repo-specific guidance 2026-04-11 23:16:47 -04:00
d946cf34e1 fix(indexer): truncate chunks exceeding Ollama context window 2026-04-11 23:12:13 -04:00
928a027cec merge: resolve conflict, keeping our extensions array fix 2026-04-11 22:55:18 -04:00
fabdd48877 docs: add troubleshooting guide for misleading openclaw.hooks error 2026-04-11 22:50:04 -04:00
13 changed files with 310 additions and 67 deletions

76
AGENTS.md Normal file
View File

@@ -0,0 +1,76 @@
# AGENTS.md
## Stack
Two independent packages in one repo:
| Directory | Role | Entry | Build |
|-----------|------|-------|-------|
| `src/` | TypeScript OpenClaw plugin | `src/index.ts` | esbuild → `dist/index.js` |
| `python/` | Python CLI indexer | `obsidian_rag/cli.py` | pip install -e |
## Commands
**TypeScript (OpenClaw plugin):**
```bash
npm run build # esbuild → dist/index.js
npm run typecheck # tsc --noEmit
npm run test # vitest run
```
**Python (RAG indexer):**
```bash
pip install -e python/ # editable install
obsidian-rag index|sync|reindex|status # CLI
pytest python/ # tests
ruff check python/ # lint
```
## OpenClaw Plugin Install
Plugin `package.json` MUST have:
```json
"openclaw": {
"extensions": ["./dist/index.js"],
"hook": []
}
```
- `extensions` = array, string path
- `hook` = singular, not `hooks`
## Config
User config at `~/.obsidian-rag/config.json` or `./obsidian-rag/` dev config.
Key indexing fields:
- `indexing.chunk_size` — sliding window chunk size (default 500)
- `indexing.chunk_overlap` — overlap between chunks (default 100)
- `indexing.max_section_chars` — max chars per section before hierarchical split (default 4000)
Key security fields:
- `security.require_confirmation_for` — list of categories (e.g. `["health", "financial_debt"]`). Empty list disables guard.
- `security.auto_approve_sensitive``true` bypasses sensitive content prompts.
- `security.local_only``true` blocks non-localhost Ollama.
## Ollama Context Length
`python/obsidian_rag/embedder.py` truncates chunks at `MAX_CHUNK_CHARS = 8000` before embedding. If Ollama 500 error returns, increase `max_section_chars` (to reduce section sizes) or reduce `chunk_size` in config.
## Hierarchical Chunking
Structured notes (date-named files) use section-split first, then sliding-window within sections that exceed `max_section_chars`. Small sections stay intact; large sections are broken into sub-chunks with the parent section heading preserved.
## Sensitive Content Guard
Triggered by categories in `require_confirmation_for`. Raises `SensitiveContentError` from `obsidian_rag/indexer.py`.
To disable: set `require_confirmation_for: []` or `auto_approve_sensitive: true` in config.
## Architecture
```
User query → OpenClaw (TypeScript plugin src/index.ts)
→ obsidian_rag_* tools (python/obsidian_rag/)
→ Ollama embeddings (http://localhost:11434)
→ LanceDB vector store
```

View File

@@ -0,0 +1,56 @@
# Troubleshooting: "missing openclaw.hooks" Error
## Symptoms
When installing a plugin using `openclaw plugins install --link <path>`, the following error appears:
```
package.json missing openclaw.hooks; update the plugin package to include openclaw.extensions (for example ["./dist/index.js"]). See https://docs.openclaw.ai/help/troubleshooting#plugin-install-fails-with-missing-openclaw-extensions
Also not a valid hook pack: Error: package.json missing openclaw.hooks
```
## Root Cause
This error message is **a misleading fallback cascade** that occurs when the plugin installation fails for a different reason. The error message suggests the problem is a missing `openclaw.hooks` field, but this is actually a secondary error that appears because the primary plugin installation failed.
### How the Error Cascade Works
1. When `--link` is used with a local plugin path, OpenClaw first attempts to install the plugin via `installPluginFromPath()`
2. The installation flow calls `installPluginFromDir()``installPluginFromSourceDir()``detectNativePackageInstallSource()`
3. If `detectNativePackageInstallSource()` returns `false` (e.g., due to a dangerous code scan failure), it falls through to `installPluginFromPackageDir()`
4. When that also fails (e.g., due to `child_process` usage being flagged), the code falls back to `tryInstallHookPackFromLocalPath()`
5. The hook pack installer calls `ensureOpenClawHooks()`, which expects a `hooks` array in the manifest
6. Since your plugin has no `hooks` field, it throws "missing openclaw.hooks"
### Primary vs Secondary Errors
The **primary error** blocking installation is:
```
Plugin "obsidian-rag" installation blocked: dangerous code patterns detected: Shell command execution detected (child_process)
```
The **secondary error** ("missing openclaw.hooks") is a fallback diagnostic that appears because:
- The plugin installation path failed
- OpenClaw tried to interpret the path as a hook pack as a last resort
- Hook packs require a `hooks` field, which normal plugins don't have
## Common Primary Errors
1. **Dangerous code patterns detected**: Plugins using `child_process`, `eval()`, file system operations, or network requests may be blocked
2. **Plugin ID mismatch**: The `id` in `openclaw.plugin.json` doesn't match expected values
3. **Missing `openclaw.extensions`**: The extensions array is missing or malformed
## Solution
The "missing openclaw.hooks" error can be **safely ignored** once the primary installation error is resolved. For the dangerous code blocking issue, use the `--dangerously-force-unsafe-install` flag:
```bash
openclaw plugins install --link --dangerously-force-unsafe-install /path/to/plugin/
```
## Why This Matters for AI Analysis
When analyzing this error, focus on the **first error shown**, not the secondary hook-related message. The OpenClaw installer attempts multiple installation strategies in sequence, and the final "missing openclaw.hooks" error is a diagnostic artifact from the fallback path, not the actual problem.
If the primary error is about dangerous code patterns, the plugin is functioning correctly from a technical standpoint—it simply uses APIs (like `child_process`) that OpenClaw's security scanning flags as potentially unsafe. Whether this is acceptable depends on the plugin's use case and trust level.

View File

@@ -65,6 +65,11 @@
"type": "integer", "type": "integer",
"minimum": 0 "minimum": 0
}, },
"max_section_chars": {
"type": "integer",
"minimum": 1,
"description": "Max chars per section before splitting into sub-chunks. Default 4000."
},
"file_patterns": { "file_patterns": {
"type": "array", "type": "array",
"items": { "items": {

View File

@@ -5,8 +5,7 @@
"main": "dist/index.js", "main": "dist/index.js",
"type": "module", "type": "module",
"openclaw": { "openclaw": {
"extensions": ["./dist/index.js"], "extensions": ["./dist/index.js"]
"hook": []
}, },
"scripts": { "scripts": {
"build": "esbuild src/index.ts --bundle --platform=node --target=node18 --outfile=dist/index.js --format=esm --external:@lancedb/lancedb --external:@lancedb/lancedb-darwin-arm64 --external:fsevents --external:chokidar", "build": "esbuild src/index.ts --bundle --platform=node --target=node18 --outfile=dist/index.js --format=esm --external:@lancedb/lancedb --external:@lancedb/lancedb-darwin-arm64 --external:fsevents --external:chokidar",

View File

@@ -3,7 +3,6 @@
from __future__ import annotations from __future__ import annotations
import re import re
import unicodedata
import hashlib import hashlib
from dataclasses import dataclass, field from dataclasses import dataclass, field
from pathlib import Path from pathlib import Path
@@ -181,9 +180,7 @@ def chunk_file(
Uses section-split for structured notes (journal entries with date filenames), Uses section-split for structured notes (journal entries with date filenames),
sliding window for everything else. sliding window for everything else.
""" """
import uuid
vault_path = Path(config.vault_path)
rel_path = filepath if filepath.is_absolute() else filepath rel_path = filepath if filepath.is_absolute() else filepath
source_file = str(rel_path) source_file = str(rel_path)
source_directory = rel_path.parts[0] if rel_path.parts else "" source_directory = rel_path.parts[0] if rel_path.parts else ""
@@ -201,7 +198,6 @@ def chunk_file(
chunks: list[Chunk] = [] chunks: list[Chunk] = []
if is_structured_note(filepath): if is_structured_note(filepath):
# Section-split for journal/daily notes
sections = split_by_sections(body, metadata) sections = split_by_sections(body, metadata)
total = len(sections) total = len(sections)
@@ -211,13 +207,31 @@ def chunk_file(
section_tags = extract_tags(section_text) section_tags = extract_tags(section_text)
combined_tags = list(dict.fromkeys([*tags, *section_tags])) combined_tags = list(dict.fromkeys([*tags, *section_tags]))
chunk_text = section_text section_heading = f"#{section}" if section else None
if len(section_text) > config.indexing.max_section_chars:
sub_chunks = sliding_window_chunks(section_text, chunk_size, overlap)
sub_total = len(sub_chunks)
for sub_idx, sub_text in enumerate(sub_chunks):
chunk = Chunk( chunk = Chunk(
chunk_id=f"{chunk_id_prefix}{_stable_chunk_id(content_hash, idx)}", chunk_id=f"{chunk_id_prefix}{_stable_chunk_id(content_hash, idx)}_{sub_idx}",
text=chunk_text, text=sub_text,
source_file=source_file, source_file=source_file,
source_directory=source_directory, source_directory=source_directory,
section=f"#{section}" if section else None, section=section_heading,
date=date,
tags=combined_tags,
chunk_index=sub_idx,
total_chunks=sub_total,
modified_at=modified_at,
)
chunks.append(chunk)
else:
chunk = Chunk(
chunk_id=f"{chunk_id_prefix}{_stable_chunk_id(content_hash, idx)}",
text=section_text,
source_file=source_file,
source_directory=source_directory,
section=section_heading,
date=date, date=date,
tags=combined_tags, tags=combined_tags,
chunk_index=idx, chunk_index=idx,

View File

@@ -51,7 +51,10 @@ def _index(config) -> int:
gen = indexer.full_index() gen = indexer.full_index()
result: dict = {"indexed_files": 0, "total_chunks": 0, "errors": []} result: dict = {"indexed_files": 0, "total_chunks": 0, "errors": []}
for item in gen: for item in gen:
result = item # progress yields are dicts; final dict from return if item.get("type") == "complete":
result = item
elif item.get("type") == "progress":
pass # skip progress logs in result
duration_ms = int((time.monotonic() - t0) * 1000) duration_ms = int((time.monotonic() - t0) * 1000)
print( print(
json.dumps( json.dumps(

View File

@@ -3,7 +3,6 @@
from __future__ import annotations from __future__ import annotations
import json import json
import os
from enum import Enum from enum import Enum
from dataclasses import dataclass, field from dataclasses import dataclass, field
from pathlib import Path from pathlib import Path
@@ -32,6 +31,7 @@ class VectorStoreConfig:
class IndexingConfig: class IndexingConfig:
chunk_size: int = 500 chunk_size: int = 500
chunk_overlap: int = 100 chunk_overlap: int = 100
max_section_chars: int = 4000
file_patterns: list[str] = field(default_factory=lambda: ["*.md"]) file_patterns: list[str] = field(default_factory=lambda: ["*.md"])
deny_dirs: list[str] = field( deny_dirs: list[str] = field(
default_factory=lambda: [ default_factory=lambda: [

View File

@@ -12,6 +12,7 @@ if TYPE_CHECKING:
from obsidian_rag.config import ObsidianRagConfig from obsidian_rag.config import ObsidianRagConfig
DEFAULT_TIMEOUT = 120.0 # seconds DEFAULT_TIMEOUT = 120.0 # seconds
MAX_CHUNK_CHARS = 8000 # safe default for most Ollama models
class EmbeddingError(Exception): class EmbeddingError(Exception):
@@ -44,7 +45,7 @@ class OllamaEmbedder:
return return
parsed = urllib.parse.urlparse(self.base_url) parsed = urllib.parse.urlparse(self.base_url)
if parsed.hostname not in ['localhost', '127.0.0.1', '::1']: if parsed.hostname not in ["localhost", "127.0.0.1", "::1"]:
raise SecurityError( raise SecurityError(
f"Remote embedding service not allowed when local_only=True: {self.base_url}" f"Remote embedding service not allowed when local_only=True: {self.base_url}"
) )
@@ -84,23 +85,31 @@ class OllamaEmbedder:
# For batch, call /api/embeddings multiple times sequentially # For batch, call /api/embeddings multiple times sequentially
if len(batch) == 1: if len(batch) == 1:
endpoint = f"{self.base_url}/api/embeddings" endpoint = f"{self.base_url}/api/embeddings"
payload = {"model": self.model, "prompt": batch[0]} prompt = batch[0][:MAX_CHUNK_CHARS]
payload = {"model": self.model, "prompt": prompt}
else: else:
# For batch, use /api/embeddings with "input" (multiple calls) # For batch, use /api/embeddings with "input" (multiple calls)
results = [] results = []
for text in batch: for text in batch:
truncated = text[:MAX_CHUNK_CHARS]
try: try:
resp = self._client.post( resp = self._client.post(
f"{self.base_url}/api/embeddings", f"{self.base_url}/api/embeddings",
json={"model": self.model, "prompt": text}, json={"model": self.model, "prompt": truncated},
timeout=DEFAULT_TIMEOUT, timeout=DEFAULT_TIMEOUT,
) )
except httpx.ConnectError as e: except httpx.ConnectError as e:
raise OllamaUnavailableError(f"Cannot connect to Ollama at {self.base_url}") from e raise OllamaUnavailableError(
f"Cannot connect to Ollama at {self.base_url}"
) from e
except httpx.TimeoutException as e: except httpx.TimeoutException as e:
raise EmbeddingError(f"Embedding request timed out after {DEFAULT_TIMEOUT}s") from e raise EmbeddingError(
f"Embedding request timed out after {DEFAULT_TIMEOUT}s"
) from e
if resp.status_code != 200: if resp.status_code != 200:
raise EmbeddingError(f"Ollama returned {resp.status_code}: {resp.text}") raise EmbeddingError(
f"Ollama returned {resp.status_code}: {resp.text}"
)
data = resp.json() data = resp.json()
embedding = data.get("embedding", []) embedding = data.get("embedding", [])
if not embedding: if not embedding:
@@ -111,9 +120,13 @@ class OllamaEmbedder:
try: try:
resp = self._client.post(endpoint, json=payload, timeout=DEFAULT_TIMEOUT) resp = self._client.post(endpoint, json=payload, timeout=DEFAULT_TIMEOUT)
except httpx.ConnectError as e: except httpx.ConnectError as e:
raise OllamaUnavailableError(f"Cannot connect to Ollama at {self.base_url}") from e raise OllamaUnavailableError(
f"Cannot connect to Ollama at {self.base_url}"
) from e
except httpx.TimeoutException as e: except httpx.TimeoutException as e:
raise EmbeddingError(f"Embedding request timed out after {DEFAULT_TIMEOUT}s") from e raise EmbeddingError(
f"Embedding request timed out after {DEFAULT_TIMEOUT}s"
) from e
if resp.status_code != 200: if resp.status_code != 200:
raise EmbeddingError(f"Ollama returned {resp.status_code}: {resp.text}") raise EmbeddingError(f"Ollama returned {resp.status_code}: {resp.text}")

View File

@@ -4,8 +4,6 @@ from __future__ import annotations
import json import json
import os import os
import time
import uuid
from datetime import datetime, timezone from datetime import datetime, timezone
from pathlib import Path from pathlib import Path
from typing import TYPE_CHECKING, Any, Generator, Iterator from typing import TYPE_CHECKING, Any, Generator, Iterator
@@ -16,10 +14,13 @@ if TYPE_CHECKING:
import obsidian_rag.config as config_mod import obsidian_rag.config as config_mod
from obsidian_rag.config import _resolve_data_dir from obsidian_rag.config import _resolve_data_dir
from obsidian_rag.chunker import chunk_file from obsidian_rag.chunker import chunk_file
from obsidian_rag.embedder import EmbeddingError, OllamaUnavailableError, SecurityError from obsidian_rag.embedder import OllamaUnavailableError
from obsidian_rag.security import should_index_dir, validate_path from obsidian_rag.security import should_index_dir, validate_path
from obsidian_rag.audit_logger import AuditLogger from obsidian_rag.vector_store import (
from obsidian_rag.vector_store import create_table_if_not_exists, delete_by_source_file, get_db, upsert_chunks create_table_if_not_exists,
get_db,
upsert_chunks,
)
# ---------------------------------------------------------------------- # ----------------------------------------------------------------------
# Pipeline # Pipeline
@@ -43,6 +44,7 @@ class Indexer:
def embedder(self): def embedder(self):
if self._embedder is None: if self._embedder is None:
from obsidian_rag.embedder import OllamaEmbedder from obsidian_rag.embedder import OllamaEmbedder
self._embedder = OllamaEmbedder(self.config) self._embedder = OllamaEmbedder(self.config)
return self._embedder return self._embedder
@@ -50,6 +52,7 @@ class Indexer:
def audit_logger(self): def audit_logger(self):
if self._audit_logger is None: if self._audit_logger is None:
from obsidian_rag.audit_logger import AuditLogger from obsidian_rag.audit_logger import AuditLogger
log_dir = _resolve_data_dir() / "audit" log_dir = _resolve_data_dir() / "audit"
self._audit_logger = AuditLogger(log_dir / "audit.log") self._audit_logger = AuditLogger(log_dir / "audit.log")
return self._audit_logger return self._audit_logger
@@ -64,9 +67,9 @@ class Indexer:
for chunk in chunks: for chunk in chunks:
sensitivity = security.detect_sensitive( sensitivity = security.detect_sensitive(
chunk['chunk_text'], chunk["chunk_text"],
self.config.security.sensitive_sections, self.config.security.sensitive_sections,
self.config.memory.patterns self.config.memory.patterns,
) )
for category in sensitive_categories: for category in sensitive_categories:
@@ -99,7 +102,11 @@ class Indexer:
"""Index a single file. Returns (num_chunks, enriched_chunks).""" """Index a single file. Returns (num_chunks, enriched_chunks)."""
from obsidian_rag import security from obsidian_rag import security
mtime = str(datetime.fromtimestamp(filepath.stat().st_mtime, tz=timezone.utc).isoformat()) mtime = str(
datetime.fromtimestamp(
filepath.stat().st_mtime, tz=timezone.utc
).isoformat()
)
content = filepath.read_text(encoding="utf-8") content = filepath.read_text(encoding="utf-8")
# Sanitize # Sanitize
content = security.sanitize_text(content) content = security.sanitize_text(content)
@@ -151,18 +158,19 @@ class Indexer:
# Log sensitive content access # Log sensitive content access
for chunk in enriched: for chunk in enriched:
from obsidian_rag import security from obsidian_rag import security
sensitivity = security.detect_sensitive( sensitivity = security.detect_sensitive(
chunk['chunk_text'], chunk["chunk_text"],
self.config.security.sensitive_sections, self.config.security.sensitive_sections,
self.config.memory.patterns self.config.memory.patterns,
) )
for category in ['health', 'financial', 'relations']: for category in ["health", "financial", "relations"]:
if sensitivity.get(category, False): if sensitivity.get(category, False):
self.audit_logger.log_sensitive_access( self.audit_logger.log_sensitive_access(
str(chunk['source_file']), str(chunk["source_file"]),
category, category,
'index', "index",
{'chunk_id': chunk['chunk_id']} {"chunk_id": chunk["chunk_id"]},
) )
# Embed chunks # Embed chunks
@@ -176,8 +184,8 @@ class Indexer:
for e, v in zip(enriched, vectors): for e, v in zip(enriched, vectors):
e["vector"] = v e["vector"] = v
# Store # Store
upsert_chunks(table, enriched) stored = upsert_chunks(table, enriched)
total_chunks += num_chunks total_chunks += stored
indexed_files += 1 indexed_files += 1
except Exception as exc: except Exception as exc:
errors.append({"file": str(filepath), "error": str(exc)}) errors.append({"file": str(filepath), "error": str(exc)})
@@ -249,9 +257,16 @@ class Indexer:
db = get_db(self.config) db = get_db(self.config)
if "obsidian_chunks" in db.list_tables(): if "obsidian_chunks" in db.list_tables():
db.drop_table("obsidian_chunks") db.drop_table("obsidian_chunks")
# full_index is a generator — materialize it to get the final dict
results = list(self.full_index()) results = list(self.full_index())
return results[-1] if results else {"indexed_files": 0, "total_chunks": 0, "errors": []} final = (
results[-1]
if results
else {"indexed_files": 0, "total_chunks": 0, "errors": []}
)
self._write_sync_result(
final["indexed_files"], final["total_chunks"], final["errors"]
)
return final
def _sync_result_path(self) -> Path: def _sync_result_path(self) -> Path:
# Use the same dev-data-dir convention as config.py # Use the same dev-data-dir convention as config.py
@@ -313,18 +328,19 @@ class Indexer:
# Log sensitive content access # Log sensitive content access
for chunk in enriched: for chunk in enriched:
from obsidian_rag import security from obsidian_rag import security
sensitivity = security.detect_sensitive( sensitivity = security.detect_sensitive(
chunk['chunk_text'], chunk["chunk_text"],
self.config.security.sensitive_sections, self.config.security.sensitive_sections,
self.config.memory.patterns self.config.memory.patterns,
) )
for category in ['health', 'financial', 'relations']: for category in ["health", "financial", "relations"]:
if sensitivity.get(category, False): if sensitivity.get(category, False):
self.audit_logger.log_sensitive_access( self.audit_logger.log_sensitive_access(
str(chunk['source_file']), str(chunk["source_file"]),
category, category,
'index', "index",
{'chunk_id': chunk['chunk_id']} {"chunk_id": chunk["chunk_id"]},
) )
# Embed chunks # Embed chunks
@@ -353,4 +369,3 @@ class Indexer:
if not data_dir.exists() and not (project_root / "KnowledgeVault").exists(): if not data_dir.exists() and not (project_root / "KnowledgeVault").exists():
data_dir = Path(os.path.expanduser("~/.obsidian-rag")) data_dir = Path(os.path.expanduser("~/.obsidian-rag"))
return data_dir / "sync-result.json" return data_dir / "sync-result.json"

View File

@@ -206,6 +206,7 @@ def _mock_config(tmp_path: Path) -> MagicMock:
cfg.vault_path = str(tmp_path) cfg.vault_path = str(tmp_path)
cfg.indexing.chunk_size = 500 cfg.indexing.chunk_size = 500
cfg.indexing.chunk_overlap = 100 cfg.indexing.chunk_overlap = 100
cfg.indexing.max_section_chars = 4000
cfg.indexing.file_patterns = ["*.md"] cfg.indexing.file_patterns = ["*.md"]
cfg.indexing.deny_dirs = [".obsidian", ".trash", "zzz-Archive", ".git"] cfg.indexing.deny_dirs = [".obsidian", ".trash", "zzz-Archive", ".git"]
cfg.indexing.allow_dirs = [] cfg.indexing.allow_dirs = []
@@ -248,3 +249,41 @@ def test_chunk_file_unstructured(tmp_path: Path):
assert len(chunks) > 1 assert len(chunks) > 1
assert all(c.section is None for c in chunks) assert all(c.section is None for c in chunks)
assert chunks[0].chunk_index == 0 assert chunks[0].chunk_index == 0
def test_large_section_split_into_sub_chunks(tmp_path: Path):
"""Large section (exceeding max_section_chars) is split via sliding window."""
vault = tmp_path / "Notes"
vault.mkdir()
fpath = vault / "2024-03-15-Podcast.md"
large_content = "word " * 3000 # ~15000 chars, exceeds MAX_SECTION_CHARS
fpath.write_text(f"# Episode Notes\n\n{large_content}")
cfg = _mock_config(tmp_path)
cfg.indexing.max_section_chars = 4000
chunks = chunk_file(fpath, fpath.read_text(), "2024-03-15T10:00:00Z", cfg)
# Large section should be split into multiple sub-chunks
assert len(chunks) > 1
# Each sub-chunk should preserve the section heading
for chunk in chunks:
assert chunk.section == "#Episode Notes", (
f"Expected #Episode Notes, got {chunk.section}"
)
def test_small_section_kept_intact(tmp_path: Path):
"""Small section (under max_section_chars) remains a single chunk."""
vault = tmp_path / "Notes"
vault.mkdir()
fpath = vault / "2024-03-15-Short.md"
fpath.write_text("# Notes\n\nShort content here.")
cfg = _mock_config(tmp_path)
cfg.indexing.max_section_chars = 4000
chunks = chunk_file(fpath, fpath.read_text(), "2024-03-15T10:00:00Z", cfg)
# Small section → single chunk
assert len(chunks) == 1
assert chunks[0].section == "#Notes"
assert chunks[0].text.strip().endswith("Short content here.")

View File

@@ -98,7 +98,8 @@ export async function probeAll(config: ObsidianRagConfig): Promise<ProbeResult>
if (indexExists) { if (indexExists) {
try { try {
const syncPath = resolve(dbPath, "..", "sync-result.json"); const dataDir = resolveDataDir();
const syncPath = resolve(dataDir, "sync-result.json");
if (existsSync(syncPath)) { if (existsSync(syncPath)) {
const data = JSON.parse(readFileSync(syncPath, "utf-8")); const data = JSON.parse(readFileSync(syncPath, "utf-8"));
lastSync = data.timestamp ?? null; lastSync = data.timestamp ?? null;
@@ -120,6 +121,17 @@ export async function probeAll(config: ObsidianRagConfig): Promise<ProbeResult>
}; };
} }
function resolveDataDir(): string {
const cwd = process.cwd();
const devDataDir = resolve(cwd, "obsidian-rag");
const devVaultMarker = resolve(cwd, "KnowledgeVault");
if (existsSync(devDataDir) || existsSync(devVaultMarker)) {
return devDataDir;
}
const home = process.env.HOME ?? process.env.USERPROFILE ?? "";
return resolve(home, ".obsidian-rag");
}
async function probeOllama(baseUrl: string): Promise<boolean> { async function probeOllama(baseUrl: string): Promise<boolean> {
try { try {
const res = await fetch(`${baseUrl}/api/tags`, { signal: AbortSignal.timeout(3000) }); const res = await fetch(`${baseUrl}/api/tags`, { signal: AbortSignal.timeout(3000) });

View File

@@ -109,7 +109,7 @@ export function readSyncResult(config: ObsidianRagConfig): {
total_chunks: number; total_chunks: number;
errors: Array<{ file: string; error: string }>; errors: Array<{ file: string; error: string }>;
} | null { } | null {
const dataDir = resolve(process.cwd(), ".obsidian-rag"); const dataDir = _resolveDataDir();
const path = resolve(dataDir, "sync-result.json"); const path = resolve(dataDir, "sync-result.json");
if (!existsSync(path)) return null; if (!existsSync(path)) return null;
try { try {
@@ -118,3 +118,14 @@ export function readSyncResult(config: ObsidianRagConfig): {
return null; return null;
} }
} }
function _resolveDataDir(): string {
const cwd = process.cwd();
const devDataDir = resolve(cwd, "obsidian-rag");
const devVaultMarker = resolve(cwd, "KnowledgeVault");
if (existsSync(devDataDir) || existsSync(devVaultMarker)) {
return devDataDir;
}
const home = process.env.HOME ?? process.env.USERPROFILE ?? "";
return resolve(home, ".obsidian-rag");
}

View File

@@ -88,7 +88,7 @@ function defaults(): ObsidianRagConfig {
} }
export function loadConfig(configPath?: string): ObsidianRagConfig { export function loadConfig(configPath?: string): ObsidianRagConfig {
const defaultPath = resolve(process.cwd(), ".obsidian-rag", "config.json"); const defaultPath = resolve(process.cwd(), "obsidian-rag", "config.json");
const path = configPath ?? defaultPath; const path = configPath ?? defaultPath;
try { try {
const raw = JSON.parse(readFileSync(path, "utf-8")); const raw = JSON.parse(readFileSync(path, "utf-8"));