145 lines
6.7 KiB
Markdown
145 lines
6.7 KiB
Markdown
## Context
|
|
|
|
ClawFort enriches news articles with royalty-free images via `fetch_royalty_free_image()` in `backend/news_service.py` (lines 297-344). The current implementation follows a priority chain:
|
|
|
|
1. **MCP Endpoint** (if configured) — POST `{"query": headline}` to custom endpoint
|
|
2. **Wikimedia Commons** — Search API with headline, returns first result
|
|
3. **Picsum** (default) — Deterministic URL from MD5 hash, returns random unrelated images
|
|
|
|
The problem: Picsum is the effective default, producing visually random images with no relevance to article content. Wikimedia search quality is inconsistent. The MCP endpoint infrastructure exists but requires external service configuration.
|
|
|
|
Three MCP-based image services are available and verified:
|
|
- **Pixabay MCP** (`zym9863/pixabay-mcp`) — npm package, `search_pixabay_images` tool
|
|
- **Unsplash MCP** (`cevatkerim/unsplash-mcp`) — Python/FastMCP, `search` tool
|
|
- **Pexels MCP** (`garylab/pexels-mcp-server`) — PyPI package, `photos_search` tool
|
|
|
|
## Goals / Non-Goals
|
|
|
|
**Goals:**
|
|
- Integrate Pixabay, Unsplash, and Pexels as first-class image providers via direct API calls
|
|
- Implement configurable provider priority chain with automatic fallback
|
|
- Improve query construction for better image relevance (keyword extraction)
|
|
- Maintain backward compatibility with existing MCP endpoint and Wikimedia/Picsum fallbacks
|
|
- Handle provider-specific attribution requirements
|
|
|
|
**Non-Goals:**
|
|
- Replacing the existing MCP endpoint mechanism (it remains as override option)
|
|
- Caching image search results (can be added later)
|
|
- User-facing image selection UI (future enhancement)
|
|
- Video retrieval from Pixabay/Pexels (images only)
|
|
|
|
## Decisions
|
|
|
|
### Decision 1: Direct API Integration vs MCP Server Dependency
|
|
|
|
**Choice:** Direct HTTP API calls to Pixabay/Unsplash/Pexels APIs
|
|
|
|
**Alternatives Considered:**
|
|
- *MCP Server Sidecar:* Run MCP servers as separate processes, call via MCP protocol
|
|
- Rejected: Adds deployment complexity (multiple processes, stdio communication)
|
|
- *Existing MCP Endpoint Override:* Require users to run their own MCP bridge
|
|
- Rejected: High friction, existing behavior preserved as optional override
|
|
|
|
**Rationale:** The provider APIs are simple REST endpoints. Direct `httpx` calls match the existing Wikimedia pattern, require no additional infrastructure, and keep the codebase consistent.
|
|
|
|
### Decision 2: Provider Priority Chain Architecture
|
|
|
|
**Choice:** Ordered provider list with per-provider enable/disable via environment variables
|
|
|
|
```
|
|
ROYALTY_IMAGE_PROVIDERS=pixabay,unsplash,pexels,wikimedia,picsum
|
|
PIXABAY_API_KEY=xxx
|
|
UNSPLASH_ACCESS_KEY=xxx
|
|
PEXELS_API_KEY=xxx
|
|
```
|
|
|
|
**Alternatives Considered:**
|
|
- *Single Provider Selection:* `ROYALTY_IMAGE_PROVIDER=pixabay` (current pattern)
|
|
- Rejected: No fallback when provider fails or returns empty results
|
|
- *Hardcoded Priority:* Always try Pixabay → Unsplash → Pexels
|
|
- Rejected: Users may prefer different ordering or want to disable specific providers
|
|
|
|
**Rationale:** Flexible ordering lets users prioritize by image quality preference, rate limits, or API cost. Fallback chain ensures images are found even when primary provider fails.
|
|
|
|
### Decision 3: Query Refinement Strategy
|
|
|
|
**Choice:** Extract first 3-5 keywords from headline, remove stop words, join with spaces
|
|
|
|
**Alternatives Considered:**
|
|
- *Full Headline:* Pass entire headline to image APIs
|
|
- Rejected: Long queries often return poor results; APIs work better with keywords
|
|
- *LLM Keyword Extraction:* Use Perplexity/OpenRouter to extract search terms
|
|
- Rejected: Adds latency and API cost for marginal improvement
|
|
- *NLP Library:* Use spaCy or NLTK for keyword extraction
|
|
- Rejected: Heavy dependency for simple task
|
|
|
|
**Rationale:** Simple regex-based extraction (remove common stop words, take first N significant words) is fast, dependency-free, and effective for image search.
|
|
|
|
### Decision 4: Attribution Handling
|
|
|
|
**Choice:** Provider-specific credit format stored in `summary_image_credit`
|
|
|
|
| Provider | Credit Format |
|
|
|----------|---------------|
|
|
| Pixabay | `"Photo by {user} on Pixabay"` |
|
|
| Unsplash | `"Photo by {user} on Unsplash"` (required by TOS) |
|
|
| Pexels | `"Photo by {photographer} on Pexels"` |
|
|
| Wikimedia | `"Wikimedia Commons"` (existing) |
|
|
| Picsum | `"Picsum Photos"` (existing) |
|
|
|
|
**Rationale:** Each provider has specific attribution requirements. Unsplash TOS requires photographer credit. Format is consistent and human-readable.
|
|
|
|
## Risks / Trade-offs
|
|
|
|
| Risk | Mitigation |
|
|
|------|------------|
|
|
| API rate limits exhausted | Fallback chain continues to next provider; Picsum always succeeds as final fallback |
|
|
| API keys exposed in logs | Never log API keys; use `***` masking if debugging request URLs |
|
|
| Provider API changes | Isolate each provider in separate function; easy to update one without affecting others |
|
|
| Slow image search adds latency | Existing 15s timeout per provider; total chain timeout could reach 45s+ if all fail |
|
|
| Empty search results | Fallback to next provider; if all fail, use existing article image or placeholder |
|
|
|
|
## Provider API Contracts
|
|
|
|
### Pixabay API
|
|
- **Endpoint:** `https://pixabay.com/api/`
|
|
- **Auth:** Query param `key={PIXABAY_API_KEY}`
|
|
- **Search:** `GET /?key={key}&q={query}&image_type=photo&per_page=3&safesearch=true`
|
|
- **Response:** `{"hits": [{"webformatURL": "...", "user": "..."}]}`
|
|
|
|
### Unsplash API
|
|
- **Endpoint:** `https://api.unsplash.com/search/photos`
|
|
- **Auth:** Header `Authorization: Client-ID {UNSPLASH_ACCESS_KEY}`
|
|
- **Search:** `GET ?query={query}&per_page=3`
|
|
- **Response:** `{"results": [{"urls": {"regular": "..."}, "user": {"name": "..."}}]}`
|
|
|
|
### Pexels API
|
|
- **Endpoint:** `https://api.pexels.com/v1/search`
|
|
- **Auth:** Header `Authorization: {PEXELS_API_KEY}`
|
|
- **Search:** `GET ?query={query}&per_page=3`
|
|
- **Response:** `{"photos": [{"src": {"large": "..."}, "photographer": "..."}]}`
|
|
|
|
## Migration Plan
|
|
|
|
1. **Phase 1:** Add new provider functions (non-breaking)
|
|
- Add `fetch_pixabay_image()`, `fetch_unsplash_image()`, `fetch_pexels_image()`
|
|
- Add new config variables with empty defaults
|
|
- Existing behavior unchanged
|
|
|
|
2. **Phase 2:** Integrate provider chain
|
|
- Modify `fetch_royalty_free_image()` to iterate configured providers
|
|
- Preserve existing MCP endpoint override as highest priority
|
|
- Preserve Wikimedia and Picsum as fallback options
|
|
|
|
3. **Phase 3:** Documentation and rollout
|
|
- Update README with new environment variables
|
|
- Update `.env.example` with provider key placeholders
|
|
|
|
**Rollback:** Set `ROYALTY_IMAGE_PROVIDERS=picsum` to restore original behavior.
|
|
|
|
## Open Questions
|
|
|
|
- Should we add retry logic per provider (currently single attempt each)?
|
|
- Should provider timeouts be configurable or keep fixed 15s?
|
|
- Consider caching search results by query hash to reduce API calls?
|