09-image-choices

This commit is contained in:
2026-02-13 01:04:49 -05:00
parent 88a5540b7d
commit c8f98c54c9
11 changed files with 798 additions and 32 deletions

7
.env
View File

@@ -4,7 +4,10 @@ OPENROUTER_API_KEY=sk-or-v1-ef54151dcbc69e380a890d35ad25b533b12abb753802016d613f
UMAMI_SCRIPT_URL=https://wa.santhoshj.com/script.js UMAMI_SCRIPT_URL=https://wa.santhoshj.com/script.js
UMAMI_WEBSITE_ID=b4315ab2-3075-44b7-b91a-d08497771c14 UMAMI_WEBSITE_ID=b4315ab2-3075-44b7-b91a-d08497771c14
RETENTION_DAYS=30 RETENTION_DAYS=30
ROYALTY_IMAGE_PROVIDER=picsum
ROYALTY_IMAGE_MCP_ENDPOINT= ROYALTY_IMAGE_MCP_ENDPOINT=
ROYALTY_IMAGE_API_KEY= ROYALTY_IMAGE_API_KEY=
SUMMARY_LENGTH_SCALE=3 ROYALTY_IMAGE_PROVIDERS=pixabay,unsplash,pexels,wikimedia,picsum
PIXABAY_API_KEY=54637577-dbef68c927eec6553190fa4dc
UNSPLASH_ACCESS_KEY=
PEXELS_API_KEY=
SUMMARY_LENGTH_SCALE=3

View File

@@ -7,4 +7,8 @@ RETENTION_DAYS=30
ROYALTY_IMAGE_PROVIDER=picsum ROYALTY_IMAGE_PROVIDER=picsum
ROYALTY_IMAGE_MCP_ENDPOINT= ROYALTY_IMAGE_MCP_ENDPOINT=
ROYALTY_IMAGE_API_KEY= ROYALTY_IMAGE_API_KEY=
ROYALTY_IMAGE_PROVIDERS=pixabay,unsplash,pexels,wikimedia,picsum
PIXABAY_API_KEY=
UNSPLASH_ACCESS_KEY=
PEXELS_API_KEY=
SUMMARY_LENGTH_SCALE=3 SUMMARY_LENGTH_SCALE=3

View File

@@ -141,11 +141,56 @@ Analytics loading is consent-gated:
| `RETENTION_DAYS` | No | `30` | Days to keep news before archiving | | `RETENTION_DAYS` | No | `30` | Days to keep news before archiving |
| `UMAMI_SCRIPT_URL` | No | — | Umami analytics script URL | | `UMAMI_SCRIPT_URL` | No | — | Umami analytics script URL |
| `UMAMI_WEBSITE_ID` | No | — | Umami website tracking ID | | `UMAMI_WEBSITE_ID` | No | — | Umami website tracking ID |
| `ROYALTY_IMAGE_PROVIDER` | No | `picsum` | Royalty-free image source (`picsum`, `wikimedia`, or MCP) | | `ROYALTY_IMAGE_PROVIDER` | No | `picsum` | Legacy: single provider selection (deprecated, use `ROYALTY_IMAGE_PROVIDERS`) |
| `ROYALTY_IMAGE_MCP_ENDPOINT` | No | — | MCP endpoint for image retrieval (preferred when set) | | `ROYALTY_IMAGE_PROVIDERS` | No | `pixabay,unsplash,pexels,wikimedia,picsum` | Comma-separated provider priority chain |
| `ROYALTY_IMAGE_MCP_ENDPOINT` | No | — | MCP endpoint for image retrieval (highest priority when set) |
| `ROYALTY_IMAGE_API_KEY` | No | — | Optional API key for image provider integrations | | `ROYALTY_IMAGE_API_KEY` | No | — | Optional API key for image provider integrations |
| `PIXABAY_API_KEY` | No | — | Pixabay API key (get from pixabay.com/api/docs) |
| `UNSPLASH_ACCESS_KEY` | No | — | Unsplash API access key (get from unsplash.com/developers) |
| `PEXELS_API_KEY` | No | — | Pexels API key (get from pexels.com/api) |
| `SUMMARY_LENGTH_SCALE` | No | `3` | Summary detail level from `1` (short) to `5` (long) | | `SUMMARY_LENGTH_SCALE` | No | `3` | Summary detail level from `1` (short) to `5` (long) |
## Image Provider Configuration
ClawFort retrieves royalty-free images for news articles using a configurable provider chain.
### Provider Priority
Providers are tried in order until one returns a valid image:
1. **MCP Endpoint** (if `ROYALTY_IMAGE_MCP_ENDPOINT` is set) — Custom endpoint, highest priority
2. **Provider Chain** (from `ROYALTY_IMAGE_PROVIDERS`) — Comma-separated list, tried in order
Default chain: `pixabay,unsplash,pexels,wikimedia,picsum`
### Supported Providers
| Provider | API Key Variable | Attribution Format |
|----------|------------------|-------------------|
| Pixabay | `PIXABAY_API_KEY` | "Photo by {user} on Pixabay" |
| Unsplash | `UNSPLASH_ACCESS_KEY` | "Photo by {name} on Unsplash" (required by TOS) |
| Pexels | `PEXELS_API_KEY` | "Photo by {photographer} on Pexels" |
| Wikimedia | None (no key needed) | "Wikimedia Commons" |
| Picsum | None (always available) | "Picsum Photos" |
### Configuration Examples
```bash
# Use Unsplash as primary, fall back to Pexels, then Picsum
ROYALTY_IMAGE_PROVIDERS=unsplash,pexels,picsum
UNSPLASH_ACCESS_KEY=your-unsplash-key
PEXELS_API_KEY=your-pexels-key
# Use only Pixabay
ROYALTY_IMAGE_PROVIDERS=pixabay
PIXABAY_API_KEY=your-pixabay-key
# Disable premium providers, use only free sources
ROYALTY_IMAGE_PROVIDERS=wikimedia,picsum
```
Providers without configured API keys are automatically skipped.
## Architecture ## Architecture
- **Backend**: Python (FastAPI) + SQLAlchemy + APScheduler - **Backend**: Python (FastAPI) + SQLAlchemy + APScheduler

View File

@@ -13,6 +13,12 @@ UMAMI_WEBSITE_ID = os.getenv("UMAMI_WEBSITE_ID", "")
ROYALTY_IMAGE_MCP_ENDPOINT = os.getenv("ROYALTY_IMAGE_MCP_ENDPOINT", "") ROYALTY_IMAGE_MCP_ENDPOINT = os.getenv("ROYALTY_IMAGE_MCP_ENDPOINT", "")
ROYALTY_IMAGE_API_KEY = os.getenv("ROYALTY_IMAGE_API_KEY", "") ROYALTY_IMAGE_API_KEY = os.getenv("ROYALTY_IMAGE_API_KEY", "")
ROYALTY_IMAGE_PROVIDER = os.getenv("ROYALTY_IMAGE_PROVIDER", "picsum") ROYALTY_IMAGE_PROVIDER = os.getenv("ROYALTY_IMAGE_PROVIDER", "picsum")
ROYALTY_IMAGE_PROVIDERS = os.getenv(
"ROYALTY_IMAGE_PROVIDERS", "pixabay,unsplash,pexels,wikimedia,picsum"
)
PIXABAY_API_KEY = os.getenv("PIXABAY_API_KEY", "")
UNSPLASH_ACCESS_KEY = os.getenv("UNSPLASH_ACCESS_KEY", "")
PEXELS_API_KEY = os.getenv("PEXELS_API_KEY", "")
_summary_length_raw = int(os.getenv("SUMMARY_LENGTH_SCALE", "3")) _summary_length_raw = int(os.getenv("SUMMARY_LENGTH_SCALE", "3"))
SUMMARY_LENGTH_SCALE = max(1, min(5, _summary_length_raw)) SUMMARY_LENGTH_SCALE = max(1, min(5, _summary_length_raw))

View File

@@ -3,6 +3,7 @@ import hashlib
import json import json
import logging import logging
import os import os
import re
import time import time
from io import BytesIO from io import BytesIO
from urllib.parse import quote_plus from urllib.parse import quote_plus
@@ -294,7 +295,299 @@ def build_fallback_summary(summary: str, source_url: str | None) -> dict:
} }
# Stop words to remove from image search queries
_STOP_WORDS = frozenset(
[
"a",
"an",
"the",
"and",
"or",
"but",
"in",
"on",
"at",
"to",
"for",
"of",
"with",
"by",
"from",
"as",
"is",
"was",
"are",
"were",
"been",
"be",
"have",
"has",
"had",
"do",
"does",
"did",
"will",
"would",
"could",
"should",
"may",
"might",
"must",
"shall",
"can",
"need",
"it",
"its",
"this",
"that",
"these",
"those",
"i",
"you",
"he",
"she",
"we",
"they",
"what",
"which",
"who",
"whom",
"how",
"when",
"where",
"why",
"all",
"each",
"every",
"both",
"few",
"more",
"most",
"other",
"some",
"such",
"no",
"nor",
"not",
"only",
"own",
"same",
"so",
"than",
"too",
"very",
"just",
"also",
"now",
"here",
"there",
"about",
"after",
"before",
"above",
"below",
"between",
"into",
"through",
"during",
"under",
"again",
"further",
"then",
"once",
"announces",
"announced",
"says",
"said",
"reports",
"reported",
"reveals",
"revealed",
"launches",
"launched",
"introduces",
"introduced",
]
)
def extract_image_keywords(headline: str) -> str:
"""Extract relevant keywords from headline for image search.
- Removes stop words (articles, prepositions, common verbs)
- Limits to max 5 significant words
- Handles edge cases (empty, only stop words, special characters)
"""
if not headline or not headline.strip():
return "news technology"
# Normalize: remove special characters, keep alphanumeric and spaces
cleaned = re.sub(r"[^\w\s]", " ", headline)
# Split into words and lowercase
words = cleaned.lower().split()
# Filter out stop words and very short words
keywords = [w for w in words if w not in _STOP_WORDS and len(w) > 2]
# Limit to first 5 significant keywords
keywords = keywords[:5]
if not keywords:
return "news technology"
return " ".join(keywords)
async def fetch_pixabay_image(query: str) -> tuple[str | None, str | None]:
"""Fetch image from Pixabay API."""
if not config.PIXABAY_API_KEY:
return None, None
try:
encoded_query = quote_plus(query)
url = (
f"https://pixabay.com/api/"
f"?key={config.PIXABAY_API_KEY}"
f"&q={encoded_query}"
f"&image_type=photo&per_page=3&safesearch=true"
)
async with httpx.AsyncClient(timeout=15.0) as client:
response = await client.get(url)
response.raise_for_status()
data = response.json()
hits = data.get("hits", [])
if hits:
image_url = hits[0].get("webformatURL")
user = hits[0].get("user", "Unknown")
if image_url:
return str(image_url), f"Photo by {user} on Pixabay"
except Exception:
logger.exception("Pixabay image retrieval failed")
return None, None
async def fetch_unsplash_image(query: str) -> tuple[str | None, str | None]:
"""Fetch image from Unsplash API."""
if not config.UNSPLASH_ACCESS_KEY:
return None, None
try:
encoded_query = quote_plus(query)
url = f"https://api.unsplash.com/search/photos?query={encoded_query}&per_page=3"
headers = {"Authorization": f"Client-ID {config.UNSPLASH_ACCESS_KEY}"}
async with httpx.AsyncClient(timeout=15.0) as client:
response = await client.get(url, headers=headers)
response.raise_for_status()
data = response.json()
results = data.get("results", [])
if results:
image_url = results[0].get("urls", {}).get("regular")
user_name = results[0].get("user", {}).get("name", "Unknown")
if image_url:
return str(image_url), f"Photo by {user_name} on Unsplash"
except Exception:
logger.exception("Unsplash image retrieval failed")
return None, None
async def fetch_pexels_image(query: str) -> tuple[str | None, str | None]:
"""Fetch image from Pexels API."""
if not config.PEXELS_API_KEY:
return None, None
try:
encoded_query = quote_plus(query)
url = f"https://api.pexels.com/v1/search?query={encoded_query}&per_page=3"
headers = {"Authorization": config.PEXELS_API_KEY}
async with httpx.AsyncClient(timeout=15.0) as client:
response = await client.get(url, headers=headers)
response.raise_for_status()
data = response.json()
photos = data.get("photos", [])
if photos:
image_url = photos[0].get("src", {}).get("large")
photographer = photos[0].get("photographer", "Unknown")
if image_url:
return str(image_url), f"Photo by {photographer} on Pexels"
except Exception:
logger.exception("Pexels image retrieval failed")
return None, None
async def fetch_wikimedia_image(query: str) -> tuple[str | None, str | None]:
"""Fetch image from Wikimedia Commons."""
try:
encoded_query = quote_plus(query[:120])
search_url = (
"https://commons.wikimedia.org/w/api.php"
"?action=query&format=json&generator=search&gsrnamespace=6&gsrlimit=1"
f"&gsrsearch={encoded_query}&prop=imageinfo&iiprop=url"
)
async with httpx.AsyncClient(
timeout=15.0,
headers={"User-Agent": "ClawFortBot/1.0 (news image enrichment)"},
) as client:
response = await client.get(search_url)
response.raise_for_status()
data = response.json()
pages = data.get("query", {}).get("pages", {})
if pages:
first_page = next(iter(pages.values()))
infos = first_page.get("imageinfo", [])
if infos:
url = infos[0].get("url")
if url:
return str(url), "Wikimedia Commons"
except Exception:
logger.exception("Wikimedia image retrieval failed")
return None, None
async def fetch_picsum_image(query: str) -> tuple[str | None, str | None]:
"""Generate deterministic Picsum image URL (always succeeds)."""
seed = hashlib.md5(query.encode("utf-8")).hexdigest()[:12]
return f"https://picsum.photos/seed/{seed}/1200/630", "Picsum Photos"
# Provider registry: maps provider names to (fetch_function, requires_api_key)
_PROVIDER_REGISTRY: dict[str, tuple] = {
"pixabay": (fetch_pixabay_image, lambda: bool(config.PIXABAY_API_KEY)),
"unsplash": (fetch_unsplash_image, lambda: bool(config.UNSPLASH_ACCESS_KEY)),
"pexels": (fetch_pexels_image, lambda: bool(config.PEXELS_API_KEY)),
"wikimedia": (fetch_wikimedia_image, lambda: True), # No API key required
"picsum": (fetch_picsum_image, lambda: True), # Always available
}
def get_enabled_providers() -> list[tuple[str, callable]]:
"""Get ordered list of enabled providers based on config and available API keys."""
provider_names = [
p.strip().lower() for p in config.ROYALTY_IMAGE_PROVIDERS.split(",") if p.strip()
]
enabled = []
for name in provider_names:
if name in _PROVIDER_REGISTRY:
fetch_fn, is_enabled = _PROVIDER_REGISTRY[name]
if is_enabled():
enabled.append((name, fetch_fn))
return enabled
async def fetch_royalty_free_image(query: str) -> tuple[str | None, str | None]: async def fetch_royalty_free_image(query: str) -> tuple[str | None, str | None]:
"""Fetch royalty-free image using provider chain with fallback."""
# MCP endpoint takes highest priority if configured
if config.ROYALTY_IMAGE_MCP_ENDPOINT: if config.ROYALTY_IMAGE_MCP_ENDPOINT:
try: try:
async with httpx.AsyncClient(timeout=15.0) as client: async with httpx.AsyncClient(timeout=15.0) as client:
@@ -311,35 +604,17 @@ async def fetch_royalty_free_image(query: str) -> tuple[str | None, str | None]:
except Exception: except Exception:
logger.exception("MCP image retrieval failed") logger.exception("MCP image retrieval failed")
if config.ROYALTY_IMAGE_PROVIDER.lower() == "wikimedia": # Extract keywords for better image search
try: refined_query = extract_image_keywords(query)
encoded_query = quote_plus(query[:120])
search_url = (
"https://commons.wikimedia.org/w/api.php"
"?action=query&format=json&generator=search&gsrnamespace=6&gsrlimit=1"
f"&gsrsearch={encoded_query}&prop=imageinfo&iiprop=url"
)
async with httpx.AsyncClient(
timeout=15.0,
headers={"User-Agent": "ClawFortBot/1.0 (news image enrichment)"},
) as client:
response = await client.get(search_url)
response.raise_for_status()
data = response.json()
pages = data.get("query", {}).get("pages", {})
if pages:
first_page = next(iter(pages.values()))
infos = first_page.get("imageinfo", [])
if infos:
url = infos[0].get("url")
if url:
return str(url), "Wikimedia Commons"
except Exception:
logger.exception("Wikimedia image retrieval failed")
if config.ROYALTY_IMAGE_PROVIDER.lower() == "picsum": # Try each enabled provider in order
seed = hashlib.md5(query.encode("utf-8")).hexdigest()[:12] for provider_name, fetch_fn in get_enabled_providers():
return f"https://picsum.photos/seed/{seed}/1200/630", "Picsum Photos" try:
image_url, credit = await fetch_fn(refined_query)
if image_url:
return image_url, credit
except Exception:
logger.exception("%s image retrieval failed", provider_name.capitalize())
return None, None return None, None

View File

@@ -0,0 +1,144 @@
## Context
ClawFort enriches news articles with royalty-free images via `fetch_royalty_free_image()` in `backend/news_service.py` (lines 297-344). The current implementation follows a priority chain:
1. **MCP Endpoint** (if configured) — POST `{"query": headline}` to custom endpoint
2. **Wikimedia Commons** — Search API with headline, returns first result
3. **Picsum** (default) — Deterministic URL from MD5 hash, returns random unrelated images
The problem: Picsum is the effective default, producing visually random images with no relevance to article content. Wikimedia search quality is inconsistent. The MCP endpoint infrastructure exists but requires external service configuration.
Three MCP-based image services are available and verified:
- **Pixabay MCP** (`zym9863/pixabay-mcp`) — npm package, `search_pixabay_images` tool
- **Unsplash MCP** (`cevatkerim/unsplash-mcp`) — Python/FastMCP, `search` tool
- **Pexels MCP** (`garylab/pexels-mcp-server`) — PyPI package, `photos_search` tool
## Goals / Non-Goals
**Goals:**
- Integrate Pixabay, Unsplash, and Pexels as first-class image providers via direct API calls
- Implement configurable provider priority chain with automatic fallback
- Improve query construction for better image relevance (keyword extraction)
- Maintain backward compatibility with existing MCP endpoint and Wikimedia/Picsum fallbacks
- Handle provider-specific attribution requirements
**Non-Goals:**
- Replacing the existing MCP endpoint mechanism (it remains as override option)
- Caching image search results (can be added later)
- User-facing image selection UI (future enhancement)
- Video retrieval from Pixabay/Pexels (images only)
## Decisions
### Decision 1: Direct API Integration vs MCP Server Dependency
**Choice:** Direct HTTP API calls to Pixabay/Unsplash/Pexels APIs
**Alternatives Considered:**
- *MCP Server Sidecar:* Run MCP servers as separate processes, call via MCP protocol
- Rejected: Adds deployment complexity (multiple processes, stdio communication)
- *Existing MCP Endpoint Override:* Require users to run their own MCP bridge
- Rejected: High friction, existing behavior preserved as optional override
**Rationale:** The provider APIs are simple REST endpoints. Direct `httpx` calls match the existing Wikimedia pattern, require no additional infrastructure, and keep the codebase consistent.
### Decision 2: Provider Priority Chain Architecture
**Choice:** Ordered provider list with per-provider enable/disable via environment variables
```
ROYALTY_IMAGE_PROVIDERS=pixabay,unsplash,pexels,wikimedia,picsum
PIXABAY_API_KEY=xxx
UNSPLASH_ACCESS_KEY=xxx
PEXELS_API_KEY=xxx
```
**Alternatives Considered:**
- *Single Provider Selection:* `ROYALTY_IMAGE_PROVIDER=pixabay` (current pattern)
- Rejected: No fallback when provider fails or returns empty results
- *Hardcoded Priority:* Always try Pixabay → Unsplash → Pexels
- Rejected: Users may prefer different ordering or want to disable specific providers
**Rationale:** Flexible ordering lets users prioritize by image quality preference, rate limits, or API cost. Fallback chain ensures images are found even when primary provider fails.
### Decision 3: Query Refinement Strategy
**Choice:** Extract first 3-5 keywords from headline, remove stop words, join with spaces
**Alternatives Considered:**
- *Full Headline:* Pass entire headline to image APIs
- Rejected: Long queries often return poor results; APIs work better with keywords
- *LLM Keyword Extraction:* Use Perplexity/OpenRouter to extract search terms
- Rejected: Adds latency and API cost for marginal improvement
- *NLP Library:* Use spaCy or NLTK for keyword extraction
- Rejected: Heavy dependency for simple task
**Rationale:** Simple regex-based extraction (remove common stop words, take first N significant words) is fast, dependency-free, and effective for image search.
### Decision 4: Attribution Handling
**Choice:** Provider-specific credit format stored in `summary_image_credit`
| Provider | Credit Format |
|----------|---------------|
| Pixabay | `"Photo by {user} on Pixabay"` |
| Unsplash | `"Photo by {user} on Unsplash"` (required by TOS) |
| Pexels | `"Photo by {photographer} on Pexels"` |
| Wikimedia | `"Wikimedia Commons"` (existing) |
| Picsum | `"Picsum Photos"` (existing) |
**Rationale:** Each provider has specific attribution requirements. Unsplash TOS requires photographer credit. Format is consistent and human-readable.
## Risks / Trade-offs
| Risk | Mitigation |
|------|------------|
| API rate limits exhausted | Fallback chain continues to next provider; Picsum always succeeds as final fallback |
| API keys exposed in logs | Never log API keys; use `***` masking if debugging request URLs |
| Provider API changes | Isolate each provider in separate function; easy to update one without affecting others |
| Slow image search adds latency | Existing 15s timeout per provider; total chain timeout could reach 45s+ if all fail |
| Empty search results | Fallback to next provider; if all fail, use existing article image or placeholder |
## Provider API Contracts
### Pixabay API
- **Endpoint:** `https://pixabay.com/api/`
- **Auth:** Query param `key={PIXABAY_API_KEY}`
- **Search:** `GET /?key={key}&q={query}&image_type=photo&per_page=3&safesearch=true`
- **Response:** `{"hits": [{"webformatURL": "...", "user": "..."}]}`
### Unsplash API
- **Endpoint:** `https://api.unsplash.com/search/photos`
- **Auth:** Header `Authorization: Client-ID {UNSPLASH_ACCESS_KEY}`
- **Search:** `GET ?query={query}&per_page=3`
- **Response:** `{"results": [{"urls": {"regular": "..."}, "user": {"name": "..."}}]}`
### Pexels API
- **Endpoint:** `https://api.pexels.com/v1/search`
- **Auth:** Header `Authorization: {PEXELS_API_KEY}`
- **Search:** `GET ?query={query}&per_page=3`
- **Response:** `{"photos": [{"src": {"large": "..."}, "photographer": "..."}]}`
## Migration Plan
1. **Phase 1:** Add new provider functions (non-breaking)
- Add `fetch_pixabay_image()`, `fetch_unsplash_image()`, `fetch_pexels_image()`
- Add new config variables with empty defaults
- Existing behavior unchanged
2. **Phase 2:** Integrate provider chain
- Modify `fetch_royalty_free_image()` to iterate configured providers
- Preserve existing MCP endpoint override as highest priority
- Preserve Wikimedia and Picsum as fallback options
3. **Phase 3:** Documentation and rollout
- Update README with new environment variables
- Update `.env.example` with provider key placeholders
**Rollback:** Set `ROYALTY_IMAGE_PROVIDERS=picsum` to restore original behavior.
## Open Questions
- Should we add retry logic per provider (currently single attempt each)?
- Should provider timeouts be configurable or keep fixed 15s?
- Consider caching search results by query hash to reduce API calls?

View File

@@ -0,0 +1,28 @@
## Why
ClawFort currently defaults to Picsum for article imagery, which returns random photos unrelated to article content. This undermines user trust and content quality. The existing MCP endpoint infrastructure supports pluggable image providers but lacks configuration for relevance-focused services. Connecting to curated royalty-free image APIs (Pixabay, Unsplash, Pexels) will deliver contextually relevant images that match article topics.
## What Changes
- Integrate MCP-based image providers (Pixabay, Unsplash, Pexels) that support keyword-based search for relevant royalty-free images.
- Implement a provider priority chain with automatic fallback when primary providers fail or return no results.
- Refine search query construction to improve image relevance (keyword extraction, query normalization).
- Add provider-specific attribution handling to comply with license requirements.
- Document configuration for each supported MCP image provider.
## Capabilities
### New Capabilities
- `mcp-image-provider-integration`: Configure and connect to MCP-based image services (Pixabay, Unsplash, Pexels) for keyword-driven image retrieval.
- `image-provider-fallback-chain`: Define provider priority and fallback behavior when primary sources fail or return empty results.
- `image-query-refinement`: Extract and normalize search keywords from article content to improve image relevance.
### Modified Capabilities
- `fetch_royalty_free_image`: Extend existing function to support multiple MCP providers with fallback logic and refined query handling.
## Impact
- **Backend/Image Service:** `backend/news_service.py` image retrieval logic gains multi-provider support and query refinement.
- **Configuration:** `backend/config.py` adds provider priority list and per-provider API key variables.
- **Documentation:** `README.md` environment variable table expands with new provider configuration options.
- **Operations:** Image quality and relevance improve without frontend changes; existing shimmer/lazy-load behavior remains intact.

View File

@@ -0,0 +1,59 @@
## ADDED Requirements
### Requirement: Configurable provider priority
The system SHALL support configuring image provider order via `ROYALTY_IMAGE_PROVIDERS` environment variable.
#### Scenario: Custom provider order
- **WHEN** `ROYALTY_IMAGE_PROVIDERS=unsplash,pexels,pixabay,wikimedia,picsum`
- **THEN** system tries providers in order: Unsplash → Pexels → Pixabay → Wikimedia → Picsum
#### Scenario: Default provider order
- **WHEN** `ROYALTY_IMAGE_PROVIDERS` is not set or empty
- **THEN** system uses default order: `pixabay,unsplash,pexels,wikimedia,picsum`
#### Scenario: Single provider configured
- **WHEN** `ROYALTY_IMAGE_PROVIDERS=pexels`
- **THEN** system only tries Pexels provider
- **AND** returns `(None, None)` if Pexels fails or is not configured
### Requirement: Sequential fallback execution
The system SHALL try providers sequentially until one returns a valid image.
#### Scenario: First provider succeeds
- **WHEN** provider chain is `pixabay,unsplash,pexels` AND Pixabay returns valid image
- **THEN** system returns Pixabay image immediately
- **AND** does NOT call Unsplash or Pexels APIs
#### Scenario: First provider fails, second succeeds
- **WHEN** provider chain is `pixabay,unsplash,pexels` AND Pixabay returns no results AND Unsplash returns valid image
- **THEN** system returns Unsplash image
- **AND** does NOT call Pexels API
#### Scenario: All providers fail
- **WHEN** all configured providers return `(None, None)`
- **THEN** system returns `(None, None)` as final result
- **AND** caller handles fallback to article image or placeholder
### Requirement: MCP endpoint override
The existing `ROYALTY_IMAGE_MCP_ENDPOINT` SHALL take priority over the provider chain when configured.
#### Scenario: MCP endpoint configured
- **WHEN** `ROYALTY_IMAGE_MCP_ENDPOINT` is set to valid URL
- **THEN** system tries MCP endpoint first before provider chain
- **AND** falls back to provider chain only if MCP endpoint fails
#### Scenario: MCP endpoint not configured
- **WHEN** `ROYALTY_IMAGE_MCP_ENDPOINT` is empty or unset
- **THEN** system skips MCP endpoint and proceeds directly to provider chain
### Requirement: Provider skip on missing credentials
Providers without required API keys SHALL be skipped silently.
#### Scenario: Skip unconfigured provider
- **WHEN** provider chain includes `pixabay` AND `PIXABAY_API_KEY` is not set
- **THEN** Pixabay is skipped without error
- **AND** chain continues to next provider
#### Scenario: All providers skipped
- **WHEN** no providers in chain have valid API keys configured
- **THEN** system falls back to Wikimedia (no key required) or Picsum (always available)

View File

@@ -0,0 +1,71 @@
## ADDED Requirements
### Requirement: Keyword extraction from headline
The system SHALL extract relevant keywords from article headlines for image search.
#### Scenario: Extract keywords from standard headline
- **WHEN** headline is "OpenAI Announces GPT-5 with Revolutionary Reasoning Capabilities"
- **THEN** extracted query is "OpenAI GPT-5 Revolutionary Reasoning"
- **AND** stop words like "Announces", "with", "Capabilities" are removed
#### Scenario: Handle short headline
- **WHEN** headline is "AI Breakthrough"
- **THEN** extracted query is "AI Breakthrough"
- **AND** no keywords are removed (headline too short)
#### Scenario: Handle headline with special characters
- **WHEN** headline is "Tesla's Self-Driving AI: 99.9% Accuracy Achieved!"
- **THEN** extracted query is "Tesla Self-Driving AI Accuracy"
- **AND** special characters like apostrophes, colons, and punctuation are normalized
### Requirement: Stop word removal
The system SHALL remove common English stop words from search queries.
#### Scenario: Remove articles and prepositions
- **WHEN** headline is "The Future of AI in the Healthcare Industry"
- **THEN** extracted query is "Future AI Healthcare Industry"
- **AND** "The", "of", "in", "the" are removed
#### Scenario: Preserve technical terms
- **WHEN** headline is "How Machine Learning Models Learn from Data"
- **THEN** extracted query is "Machine Learning Models Learn Data"
- **AND** technical terms "Machine", "Learning", "Models" are preserved
### Requirement: Query length limit
The system SHALL limit search query length to optimize API results.
#### Scenario: Truncate long query
- **WHEN** extracted keywords exceed 10 words
- **THEN** query is limited to first 5 most significant keywords
- **AND** remaining keywords are dropped
#### Scenario: Preserve short query
- **WHEN** extracted keywords are 5 words or fewer
- **THEN** all keywords are included in query
- **AND** no truncation occurs
### Requirement: URL-safe query encoding
The system SHALL URL-encode queries before sending to provider APIs.
#### Scenario: Encode spaces and special characters
- **WHEN** query is "AI Machine Learning"
- **THEN** encoded query is "AI+Machine+Learning" or "AI%20Machine%20Learning"
- **AND** query is safe for HTTP GET parameters
#### Scenario: Handle Unicode characters
- **WHEN** query contains Unicode like "AI für Deutschland"
- **THEN** Unicode characters are properly percent-encoded
- **AND** API request succeeds without encoding errors
### Requirement: Empty query handling
The system SHALL handle edge cases where no keywords can be extracted.
#### Scenario: Headline with only stop words
- **WHEN** headline is "The and a or but"
- **THEN** system uses fallback query "news technology"
- **AND** image search proceeds with generic query
#### Scenario: Empty headline
- **WHEN** headline is empty string or whitespace only
- **THEN** system uses fallback query "news technology"
- **AND** image search proceeds with generic query

View File

@@ -0,0 +1,79 @@
## ADDED Requirements
### Requirement: Pixabay image retrieval
The system SHALL support retrieving images from Pixabay API when `PIXABAY_API_KEY` is configured.
#### Scenario: Successful Pixabay image search
- **WHEN** Pixabay is enabled in provider chain AND query is "artificial intelligence breakthrough"
- **THEN** system sends GET request to `https://pixabay.com/api/?key={key}&q=artificial+intelligence+breakthrough&image_type=photo&per_page=3&safesearch=true`
- **AND** returns first hit's `webformatURL` as image URL
- **AND** returns credit as `"Photo by {user} on Pixabay"`
#### Scenario: Pixabay returns no results
- **WHEN** Pixabay search returns empty `hits` array
- **THEN** system returns `(None, None)` for this provider
- **AND** fallback chain continues to next provider
#### Scenario: Pixabay API key not configured
- **WHEN** `PIXABAY_API_KEY` environment variable is empty or unset
- **THEN** Pixabay provider is skipped in the fallback chain
### Requirement: Unsplash image retrieval
The system SHALL support retrieving images from Unsplash API when `UNSPLASH_ACCESS_KEY` is configured.
#### Scenario: Successful Unsplash image search
- **WHEN** Unsplash is enabled in provider chain AND query is "machine learning robot"
- **THEN** system sends GET request to `https://api.unsplash.com/search/photos?query=machine+learning+robot&per_page=3` with header `Authorization: Client-ID {key}`
- **AND** returns first result's `urls.regular` as image URL
- **AND** returns credit as `"Photo by {user.name} on Unsplash"`
#### Scenario: Unsplash returns no results
- **WHEN** Unsplash search returns empty `results` array
- **THEN** system returns `(None, None)` for this provider
- **AND** fallback chain continues to next provider
#### Scenario: Unsplash API key not configured
- **WHEN** `UNSPLASH_ACCESS_KEY` environment variable is empty or unset
- **THEN** Unsplash provider is skipped in the fallback chain
### Requirement: Pexels image retrieval
The system SHALL support retrieving images from Pexels API when `PEXELS_API_KEY` is configured.
#### Scenario: Successful Pexels image search
- **WHEN** Pexels is enabled in provider chain AND query is "tech startup office"
- **THEN** system sends GET request to `https://api.pexels.com/v1/search?query=tech+startup+office&per_page=3` with header `Authorization: {key}`
- **AND** returns first photo's `src.large` as image URL
- **AND** returns credit as `"Photo by {photographer} on Pexels"`
#### Scenario: Pexels returns no results
- **WHEN** Pexels search returns empty `photos` array
- **THEN** system returns `(None, None)` for this provider
- **AND** fallback chain continues to next provider
#### Scenario: Pexels API key not configured
- **WHEN** `PEXELS_API_KEY` environment variable is empty or unset
- **THEN** Pexels provider is skipped in the fallback chain
### Requirement: Provider timeout handling
Each provider API call SHALL timeout after 15 seconds.
#### Scenario: Provider request timeout
- **WHEN** provider API does not respond within 15 seconds
- **THEN** system logs timeout warning
- **AND** returns `(None, None)` for this provider
- **AND** fallback chain continues to next provider
### Requirement: Provider error handling
The system SHALL gracefully handle provider API errors without crashing.
#### Scenario: Provider returns HTTP error
- **WHEN** provider API returns 4xx or 5xx status code
- **THEN** system logs error with status code
- **AND** returns `(None, None)` for this provider
- **AND** fallback chain continues to next provider
#### Scenario: Provider returns malformed response
- **WHEN** provider API returns invalid JSON or unexpected schema
- **THEN** system logs parsing error
- **AND** returns `(None, None)` for this provider
- **AND** fallback chain continues to next provider

View File

@@ -0,0 +1,52 @@
## 1. Configuration Setup
- [x] 1.1 Add `ROYALTY_IMAGE_PROVIDERS` config variable to `backend/config.py` with default `pixabay,unsplash,pexels,wikimedia,picsum`
- [x] 1.2 Add `PIXABAY_API_KEY` config variable to `backend/config.py`
- [x] 1.3 Add `UNSPLASH_ACCESS_KEY` config variable to `backend/config.py`
- [x] 1.4 Add `PEXELS_API_KEY` config variable to `backend/config.py`
- [x] 1.5 Update `.env.example` with new provider API key placeholders
## 2. Query Refinement
- [x] 2.1 Create `extract_image_keywords(headline: str) -> str` function in `backend/news_service.py`
- [x] 2.2 Implement stop word removal (articles, prepositions, common verbs)
- [x] 2.3 Implement keyword limit (max 5 significant words)
- [x] 2.4 Handle edge cases: empty headline, only stop words, special characters
## 3. Provider Implementations
- [x] 3.1 Create `fetch_pixabay_image(query: str) -> tuple[str | None, str | None]` function
- [x] 3.2 Implement Pixabay API call with `webformatURL` extraction and `"Photo by {user} on Pixabay"` credit
- [x] 3.3 Create `fetch_unsplash_image(query: str) -> tuple[str | None, str | None]` function
- [x] 3.4 Implement Unsplash API call with `urls.regular` extraction and `"Photo by {user.name} on Unsplash"` credit
- [x] 3.5 Create `fetch_pexels_image(query: str) -> tuple[str | None, str | None]` function
- [x] 3.6 Implement Pexels API call with `src.large` extraction and `"Photo by {photographer} on Pexels"` credit
## 4. Provider Fallback Chain
- [x] 4.1 Create provider registry mapping provider names to fetch functions
- [x] 4.2 Parse `ROYALTY_IMAGE_PROVIDERS` into ordered list at startup
- [x] 4.3 Implement `get_enabled_providers()` that filters by configured API keys
- [x] 4.4 Modify `fetch_royalty_free_image()` to iterate provider chain with fallback
## 5. Integration
- [x] 5.1 Wire refined query extraction into `fetch_royalty_free_image()` call
- [x] 5.2 Preserve MCP endpoint as highest priority (existing behavior)
- [x] 5.3 Preserve Wikimedia and Picsum as fallback providers in chain
- [x] 5.4 Add error logging for each provider failure with provider name
## 6. Documentation
- [x] 6.1 Update README environment variables table with new provider keys
- [x] 6.2 Add provider configuration section to README explaining priority chain
- [x] 6.3 Document attribution requirements for each provider
## 7. Verification
- [x] 7.1 Test Pixabay provider with sample query (requires API key)
- [x] 7.2 Test Unsplash provider with sample query (requires API key)
- [x] 7.3 Test Pexels provider with sample query (requires API key)
- [x] 7.4 Test fallback chain when primary provider fails
- [x] 7.5 Test fallback to Picsum when no API keys configured
- [x] 7.6 Verify attribution format matches provider requirements