clawfort/openspec/changes/archive/2026-02-13-p09-image-choices/design.md

## Context

ClawFort enriches news articles with royalty-free images via `fetch_royalty_free_image()` in `backend/news_service.py` (lines 297-344). The current implementation follows a priority chain:

1. **MCP Endpoint** (if configured) — POST `{"query": headline}` to custom endpoint
2. **Wikimedia Commons** — Search API with headline, returns first result
3. **Picsum** (default) — Deterministic URL from MD5 hash, returns random unrelated images

The problem: Picsum is the effective default, producing visually random images with no relevance to article content. Wikimedia search quality is inconsistent. The MCP endpoint infrastructure exists but requires external service configuration.

Three MCP-based image services are available and verified:
- **Pixabay MCP** (`zym9863/pixabay-mcp`) — npm package, `search_pixabay_images` tool
- **Unsplash MCP** (`cevatkerim/unsplash-mcp`) — Python/FastMCP, `search` tool
- **Pexels MCP** (`garylab/pexels-mcp-server`) — PyPI package, `photos_search` tool

## Goals / Non-Goals

**Goals:**
- Integrate Pixabay, Unsplash, and Pexels as first-class image providers via direct API calls
- Implement configurable provider priority chain with automatic fallback
- Improve query construction for better image relevance (keyword extraction)
- Maintain backward compatibility with existing MCP endpoint and Wikimedia/Picsum fallbacks
- Handle provider-specific attribution requirements

**Non-Goals:**
- Replacing the existing MCP endpoint mechanism (it remains as override option)
- Caching image search results (can be added later)
- User-facing image selection UI (future enhancement)
- Video retrieval from Pixabay/Pexels (images only)

## Decisions

### Decision 1: Direct API Integration vs MCP Server Dependency

**Choice:** Direct HTTP API calls to Pixabay/Unsplash/Pexels APIs

**Alternatives Considered:**
- *MCP Server Sidecar:* Run MCP servers as separate processes, call via MCP protocol
  - Rejected: Adds deployment complexity (multiple processes, stdio communication)
- *Existing MCP Endpoint Override:* Require users to run their own MCP bridge
  - Rejected: High friction, existing behavior preserved as optional override

**Rationale:** The provider APIs are simple REST endpoints. Direct `httpx` calls match the existing Wikimedia pattern, require no additional infrastructure, and keep the codebase consistent.

### Decision 2: Provider Priority Chain Architecture

**Choice:** Ordered provider list with per-provider enable/disable via environment variables

```
ROYALTY_IMAGE_PROVIDERS=pixabay,unsplash,pexels,wikimedia,picsum
PIXABAY_API_KEY=xxx
UNSPLASH_ACCESS_KEY=xxx
PEXELS_API_KEY=xxx
```

**Alternatives Considered:**
- *Single Provider Selection:* `ROYALTY_IMAGE_PROVIDER=pixabay` (current pattern)
  - Rejected: No fallback when provider fails or returns empty results
- *Hardcoded Priority:* Always try Pixabay → Unsplash → Pexels
  - Rejected: Users may prefer different ordering or want to disable specific providers

**Rationale:** Flexible ordering lets users prioritize by image quality preference, rate limits, or API cost. Fallback chain ensures images are found even when primary provider fails.

### Decision 3: Query Refinement Strategy

**Choice:** Extract first 3-5 keywords from headline, remove stop words, join with spaces

**Alternatives Considered:**
- *Full Headline:* Pass entire headline to image APIs
  - Rejected: Long queries often return poor results; APIs work better with keywords
- *LLM Keyword Extraction:* Use Perplexity/OpenRouter to extract search terms
  - Rejected: Adds latency and API cost for marginal improvement
- *NLP Library:* Use spaCy or NLTK for keyword extraction
  - Rejected: Heavy dependency for simple task

**Rationale:** Simple regex-based extraction (remove common stop words, take first N significant words) is fast, dependency-free, and effective for image search.

### Decision 4: Attribution Handling

**Choice:** Provider-specific credit format stored in `summary_image_credit`

| Provider | Credit Format |
|----------|---------------|
| Pixabay | `"Photo by {user} on Pixabay"` |
| Unsplash | `"Photo by {user} on Unsplash"` (required by TOS) |
| Pexels | `"Photo by {photographer} on Pexels"` |
| Wikimedia | `"Wikimedia Commons"` (existing) |
| Picsum | `"Picsum Photos"` (existing) |

**Rationale:** Each provider has specific attribution requirements. Unsplash TOS requires photographer credit. Format is consistent and human-readable.

## Risks / Trade-offs

| Risk | Mitigation |
|------|------------|
| API rate limits exhausted | Fallback chain continues to next provider; Picsum always succeeds as final fallback |
| API keys exposed in logs | Never log API keys; use `***` masking if debugging request URLs |
| Provider API changes | Isolate each provider in separate function; easy to update one without affecting others |
| Slow image search adds latency | Existing 15s timeout per provider; total chain timeout could reach 45s+ if all fail |
| Empty search results | Fallback to next provider; if all fail, use existing article image or placeholder |

## Provider API Contracts

### Pixabay API
- **Endpoint:** `https://pixabay.com/api/`
- **Auth:** Query param `key={PIXABAY_API_KEY}`
- **Search:** `GET /?key={key}&q={query}&image_type=photo&per_page=3&safesearch=true`
- **Response:** `{"hits": [{"webformatURL": "...", "user": "..."}]}`

### Unsplash API
- **Endpoint:** `https://api.unsplash.com/search/photos`
- **Auth:** Header `Authorization: Client-ID {UNSPLASH_ACCESS_KEY}`
- **Search:** `GET ?query={query}&per_page=3`
- **Response:** `{"results": [{"urls": {"regular": "..."}, "user": {"name": "..."}}]}`

### Pexels API
- **Endpoint:** `https://api.pexels.com/v1/search`
- **Auth:** Header `Authorization: {PEXELS_API_KEY}`
- **Search:** `GET ?query={query}&per_page=3`
- **Response:** `{"photos": [{"src": {"large": "..."}, "photographer": "..."}]}`

## Migration Plan

1. **Phase 1:** Add new provider functions (non-breaking)
   - Add `fetch_pixabay_image()`, `fetch_unsplash_image()`, `fetch_pexels_image()`
   - Add new config variables with empty defaults
   - Existing behavior unchanged

2. **Phase 2:** Integrate provider chain
   - Modify `fetch_royalty_free_image()` to iterate configured providers
   - Preserve existing MCP endpoint override as highest priority
   - Preserve Wikimedia and Picsum as fallback options

3. **Phase 3:** Documentation and rollout
   - Update README with new environment variables
   - Update `.env.example` with provider key placeholders

**Rollback:** Set `ROYALTY_IMAGE_PROVIDERS=picsum` to restore original behavior.

## Open Questions

- Should we add retry logic per provider (currently single attempt each)?
- Should provider timeouts be configurable or keep fixed 15s?
- Consider caching search results by query hash to reduce API calls?