# ClawFort — AI News Aggregation One-Pager A stunning single-page website that automatically aggregates and displays AI news hourly using the Perplexity API. ## Quick Start ### Docker (Recommended) ```bash cp .env.example .env # Edit .env and set PERPLEXITY_API_KEY docker compose up --build ``` Open http://localhost:8000 ### Local Development ```bash pip install -r requirements.txt cp .env.example .env # Edit .env and set PERPLEXITY_API_KEY python -m uvicorn backend.main:app --reload --port 8000 ``` ## Force Fetch Command Use the force-fetch command to run one immediate news ingestion cycle outside the hourly scheduler. ```bash python -m backend.cli force-fetch ``` Common use cases: - **Bootstrap**: Populate initial content right after first deployment. - **Recovery**: Re-run ingestion after a failed provider/API cycle. Command behavior: - Reuses existing retry, fallback, dedup, image optimization, and persistence logic. - Prints success output with stored item count, for example: `force-fetch succeeded: stored=3 elapsed=5.1s` - Prints actionable error output on fatal failures and exits non-zero. Exit codes: - `0`: Command completed successfully (including runs that store zero new rows) - `1`: Fatal command failure (for example missing API keys or unrecoverable runtime error) ## Admin Maintenance Commands ClawFort includes an admin command suite to simplify operational recovery and maintenance. ```bash # List admin subcommands python -m backend.cli admin --help # Fetch n articles on demand python -m backend.cli admin fetch --count 10 # Refetch images for latest 30 articles (sequential queue + exponential backoff) python -m backend.cli admin refetch-images --limit 30 # Clean archived records older than N days python -m backend.cli admin clean-archive --days 60 --confirm # Clear optimized image cache files python -m backend.cli admin clear-cache --confirm # Clear existing news items (includes archived when requested) python -m backend.cli admin clear-news --include-archived --confirm # Rebuild content from scratch (clear + fetch) python -m backend.cli admin rebuild-site --count 10 --confirm # Regenerate translations for existing articles python -m backend.cli admin regenerate-translations --limit 100 ``` Safety guardrails: - Destructive commands require `--confirm`. - Dry-run previews are available for applicable commands via `--dry-run`. - Admin output follows a structured format like: `admin: status= ...`. ## Multilingual Support ClawFort supports English (`en`), Tamil (`ta`), and Malayalam (`ml`) content delivery. - New articles are stored in English and translated to Tamil and Malayalam during ingestion. - Translations are linked to the same base article and served by the existing news endpoints. - If a requested translation is unavailable, the API falls back to English. Language-aware API usage: ```bash # Latest hero item in Tamil curl "http://localhost:8000/api/news/latest?language=ta" # Feed page in Malayalam curl "http://localhost:8000/api/news?limit=10&language=ml" ``` Unsupported language codes default to English. Frontend language selector behavior: - Landing page includes a language selector (`English`, `Tamil`, `Malayalam`). - Selected language is persisted in `localStorage` and mirrored in a client cookie. - Returning users see content in their previously selected language. ## Summary Modal Each fetched article now includes a concise summary artifact and can be opened in a modal from the feed. Modal structure: ```text [Relevant Image] ## TL;DR [bullet points] ## Summary [Summarized article] ## Source and Citation [source of the news] Powered by Perplexity ``` Backend behavior: - Summary artifacts are generated during ingestion using Perplexity and persisted with article records. - If summary generation fails, ingestion still succeeds and a fallback summary artifact is stored. - Summary image enrichment prefers MCP image retrieval when configured, with deterministic fallback behavior. Umami events for summary modal: - `summary-modal-open` with `article_id` - `summary-modal-close` with `article_id` - `summary-modal-link-out` with `article_id` and `source_url` ## Policy Pages The footer now includes: - `Terms of Use` (`/terms`) - `Attribution` (`/attribution`) Both pages are served as static frontend documents through FastAPI routes. ## Theme Switcher The header includes an icon-based theme switcher with four modes: - `system` (default when unset) - `light` - `dark` - `contrast` (high contrast) Theme persistence behavior: - Primary: `localStorage` (`clawfort_theme`) - Fallback: cookie (`clawfort_theme`) Returning users get their previously selected theme. ## Cookie Consent and Analytics Gate Analytics loading is consent-gated: - Consent banner appears when no consent is stored. - Clicking `Accept` stores consent in localStorage and cookie (`clawfort_cookie_consent=accepted`). - Umami analytics script loads only after consent. ## Environment Variables | Variable | Required | Default | Description | |----------|----------|---------|-------------| | `PERPLEXITY_API_KEY` | Yes | — | Perplexity API key for news fetching | | `IMAGE_QUALITY` | No | `85` | JPEG compression quality (1-100) | | `OPENROUTER_API_KEY` | No | — | Fallback LLM provider API key | | `RETENTION_DAYS` | No | `30` | Days to keep news before archiving | | `UMAMI_SCRIPT_URL` | No | — | Umami analytics script URL | | `UMAMI_WEBSITE_ID` | No | — | Umami website tracking ID | | `ROYALTY_IMAGE_PROVIDER` | No | `picsum` | Legacy: single provider selection (deprecated, use `ROYALTY_IMAGE_PROVIDERS`) | | `ROYALTY_IMAGE_PROVIDERS` | No | `pixabay,unsplash,pexels,wikimedia,picsum` | Comma-separated provider priority chain | | `ROYALTY_IMAGE_MCP_ENDPOINT` | No | — | MCP endpoint for image retrieval (highest priority when set) | | `ROYALTY_IMAGE_API_KEY` | No | — | Optional API key for image provider integrations | | `PIXABAY_API_KEY` | No | — | Pixabay API key (get from pixabay.com/api/docs) | | `UNSPLASH_ACCESS_KEY` | No | — | Unsplash API access key (get from unsplash.com/developers) | | `PEXELS_API_KEY` | No | — | Pexels API key (get from pexels.com/api) | | `SUMMARY_LENGTH_SCALE` | No | `3` | Summary detail level from `1` (short) to `5` (long) | ## Image Provider Configuration ClawFort retrieves royalty-free images for news articles using a configurable provider chain. ### Provider Priority Providers are tried in order until one returns a valid image: 1. **MCP Endpoint** (if `ROYALTY_IMAGE_MCP_ENDPOINT` is set) — Custom endpoint, highest priority 2. **Provider Chain** (from `ROYALTY_IMAGE_PROVIDERS`) — Comma-separated list, tried in order Default chain: `pixabay,unsplash,pexels,wikimedia,picsum` ### Supported Providers | Provider | API Key Variable | Attribution Format | |----------|------------------|-------------------| | Pixabay | `PIXABAY_API_KEY` | "Photo by {user} on Pixabay" | | Unsplash | `UNSPLASH_ACCESS_KEY` | "Photo by {name} on Unsplash" (required by TOS) | | Pexels | `PEXELS_API_KEY` | "Photo by {photographer} on Pexels" | | Wikimedia | None (no key needed) | "Wikimedia Commons" | | Picsum | None (always available) | "Picsum Photos" | ### Configuration Examples ```bash # Use Unsplash as primary, fall back to Pexels, then Picsum ROYALTY_IMAGE_PROVIDERS=unsplash,pexels,picsum UNSPLASH_ACCESS_KEY=your-unsplash-key PEXELS_API_KEY=your-pexels-key # Use only Pixabay ROYALTY_IMAGE_PROVIDERS=pixabay PIXABAY_API_KEY=your-pixabay-key # Disable premium providers, use only free sources ROYALTY_IMAGE_PROVIDERS=wikimedia,picsum ``` Providers without configured API keys are automatically skipped. ## Architecture - **Backend**: Python (FastAPI) + SQLAlchemy + APScheduler - **Frontend**: Alpine.js + Tailwind CSS (CDN, no build step) - **Database**: SQLite with 30-day retention - **Container**: Single Docker container ## API Endpoints | Method | Path | Description | |--------|------|-------------| | `GET` | `/` | Serve frontend | | `GET` | `/api/news/latest` | Latest news item for hero block | | `GET` | `/api/news` | Paginated news feed | | `GET` | `/api/health` | Health check with news count | | `GET` | `/config` | Frontend config (analytics) | ## SEO and Structured Data Contract Homepage (`/`) contract: - Core metadata: `title`, `description`, `robots`, canonical link. - Social metadata: Open Graph (`og:type`, `og:site_name`, `og:title`, `og:description`, `og:url`, `og:image`) and Twitter (`twitter:card`, `twitter:title`, `twitter:description`, `twitter:image`). - JSON-LD graph includes: - `Newspaper` entity for site-level identity. - `NewsArticle` entities for hero and feed articles. Policy page contract (`/terms`, `/attribution`): - Page-specific `title` and `description`. - `robots` metadata. - Canonical link for the route. - Open Graph and Twitter preview metadata. Structured data field baseline for each `NewsArticle`: - `headline`, `description`, `image`, `datePublished`, `dateModified`, `url`, `mainEntityOfPage`, `inLanguage`, `publisher`, `author`. ## Delivery Performance Header Contract FastAPI middleware applies route-specific cache and compression behavior: | Route Class | Cache-Control | Notes | |-------------|---------------|-------| | `/static/*` | `public, max-age=604800, immutable` | Long-lived static assets | | `/api/*` | `public, max-age=60, stale-while-revalidate=120` | Short-lived feed data | | `/`, `/terms`, `/attribution` | `public, max-age=300, stale-while-revalidate=600` | HTML routes | Additional headers: - `Vary: Accept-Encoding` for API responses. - `X-Content-Type-Options: nosniff` for all responses. - Gzip compression enabled via `GZipMiddleware` for eligible payloads. ## SEO and Performance Verification Checklist Run after local startup (`python -m uvicorn backend.main:app --reload --port 8000`): ```bash # HTML route cache checks curl -I http://localhost:8000/ curl -I http://localhost:8000/terms curl -I http://localhost:8000/attribution # API cache + vary checks curl -I "http://localhost:8000/api/news/latest?language=en" curl -I "http://localhost:8000/api/news?limit=5&language=en" # Compression check (expect Content-Encoding: gzip for eligible payloads) curl -s -H "Accept-Encoding: gzip" -D - "http://localhost:8000/api/news?limit=10&language=en" -o /dev/null ``` Manual acceptance checks: 1. Homepage source contains one canonical link and Open Graph/Twitter metadata fields. 2. Homepage JSON-LD contains one `Newspaper` entity and deduplicated `NewsArticle` entries. 3. Hero/feed/modal images show shimmer placeholders until load/fallback completion. 4. Feed and modal images use `loading="lazy"`, explicit `width`/`height`, and `decoding="async"`. 5. Smooth scrolling behavior is enabled for in-page navigation interactions. Structured-data validation: - Validate JSON-LD output using schema-aware validators (e.g., Schema.org validator or equivalent tooling) and confirm `Newspaper` + `NewsArticle` entities pass required field checks. Regression checks: - Verify homepage rendering (hero, feed, modal). - Verify policy-page metadata output. - Verify cache/compression headers remain unchanged after SEO-related edits. ### GET /api/news Query parameters: - `cursor` (int, optional): Last item ID for pagination - `limit` (int, default 10): Items per page (max 50) - `exclude_hero` (int, optional): Hero item ID to exclude Response: ```json { "items": [{ "id": 1, "headline": "...", "summary": "...", "source_url": "...", "image_url": "...", "image_credit": "...", "published_at": "...", "created_at": "..." }], "next_cursor": 5, "has_more": true } ``` ## Deployment ```bash docker build -t clawfort . docker run -d \ -e PERPLEXITY_API_KEY=pplx-xxx \ -v clawfort-data:/app/data \ -v clawfort-images:/app/backend/static/images \ -p 8000:8000 \ clawfort ``` Data persists across restarts via Docker volumes. ## Scheduled Jobs - **Hourly**: Fetch latest AI news from Perplexity API - **Nightly (3 AM)**: Archive news older than 30 days, delete archived items older than 60 days