379 lines
13 KiB
Markdown
379 lines
13 KiB
Markdown
# ClawFort — AI News Aggregation One-Pager
|
|
|
|
A stunning single-page website that automatically aggregates and displays AI news hourly using the Perplexity API.
|
|
|
|
## Quick Start
|
|
|
|
### Docker (Recommended)
|
|
|
|
```bash
|
|
cp .env.example .env
|
|
# Edit .env and set PERPLEXITY_API_KEY
|
|
|
|
docker compose up --build
|
|
```
|
|
|
|
Open http://localhost:8000
|
|
|
|
### Local Development
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
cp .env.example .env
|
|
# Edit .env and set PERPLEXITY_API_KEY
|
|
|
|
python -m uvicorn backend.main:app --reload --port 8000
|
|
```
|
|
|
|
## Force Fetch Command
|
|
|
|
Use the force-fetch command to run one immediate news ingestion cycle outside the hourly scheduler.
|
|
|
|
```bash
|
|
python -m backend.cli force-fetch
|
|
```
|
|
|
|
Common use cases:
|
|
- **Bootstrap**: Populate initial content right after first deployment.
|
|
- **Recovery**: Re-run ingestion after a failed provider/API cycle.
|
|
|
|
Command behavior:
|
|
- Reuses existing retry, fallback, dedup, image optimization, and persistence logic.
|
|
- Prints success output with stored item count, for example: `force-fetch succeeded: stored=3 elapsed=5.1s`
|
|
- Prints actionable error output on fatal failures and exits non-zero.
|
|
|
|
Exit codes:
|
|
- `0`: Command completed successfully (including runs that store zero new rows)
|
|
- `1`: Fatal command failure (for example missing API keys or unrecoverable runtime error)
|
|
|
|
## Quality and Test Suite
|
|
|
|
Run local quality gates:
|
|
|
|
```bash
|
|
pip install -e .[dev]
|
|
pytest
|
|
ruff check backend tests
|
|
```
|
|
|
|
CI quality gates are defined in `.github/workflows/quality-gates.yml`.
|
|
|
|
Monitoring baseline, thresholds, and alert runbook are documented in `docs/quality-and-monitoring.md`.
|
|
|
|
## Admin Maintenance Commands
|
|
|
|
ClawFort includes an admin command suite to simplify operational recovery and maintenance.
|
|
|
|
```bash
|
|
# List admin subcommands
|
|
python -m backend.cli admin --help
|
|
|
|
# Fetch n articles on demand
|
|
python -m backend.cli admin fetch --count 10
|
|
|
|
# Refetch images for latest 30 articles (sequential queue + exponential backoff)
|
|
python -m backend.cli admin refetch-images --limit 30
|
|
|
|
# Clean archived records older than N days
|
|
python -m backend.cli admin clean-archive --days 60 --confirm
|
|
|
|
# Clear optimized image cache files
|
|
python -m backend.cli admin clear-cache --confirm
|
|
|
|
# Clear existing news items (includes archived when requested)
|
|
python -m backend.cli admin clear-news --include-archived --confirm
|
|
|
|
# Rebuild content from scratch (clear + fetch)
|
|
python -m backend.cli admin rebuild-site --count 10 --confirm
|
|
|
|
# Regenerate translations for existing articles
|
|
python -m backend.cli admin regenerate-translations --limit 100
|
|
```
|
|
|
|
Safety guardrails:
|
|
- Destructive commands require `--confirm`.
|
|
- Dry-run previews are available for applicable commands via `--dry-run`.
|
|
- Admin output follows a structured format like: `admin:<command> status=<ok|error|blocked> ...`.
|
|
|
|
## Multilingual Support
|
|
|
|
ClawFort supports English (`en`), Tamil (`ta`), and Malayalam (`ml`) content delivery.
|
|
|
|
- New articles are stored in English and translated to Tamil and Malayalam during ingestion.
|
|
- Translations are linked to the same base article and served by the existing news endpoints.
|
|
- If a requested translation is unavailable, the API falls back to English.
|
|
|
|
### Typography and Fonts
|
|
|
|
**Font Strategy:**
|
|
- **English (Default)**: Inter font family
|
|
- **Tamil**: Baloo 2 (primary), Noto Sans Tamil (fallback), Inter (system fallback)
|
|
- **Malayalam**: Baloo Chettan 2 (primary), Noto Sans Malayalam (fallback), Inter (system fallback)
|
|
|
|
Fonts are applied via CSS based on the `data-lang` attribute on the HTML element, ensuring English text always uses Inter while Indic languages use their respective smooth fonts.
|
|
|
|
**Mobile Typography Adjustments:**
|
|
For Tamil and Malayalam content on mobile viewports (≤640px):
|
|
- Hero title: 1.5rem with 1.4 line-height
|
|
- Hero summary: 0.9rem with 1.5 line-height
|
|
- Hero pills: Smaller padding and font-size to ensure visibility
|
|
- Hero image height: 200px (reduced from 300px)
|
|
- Hero content padding: Reduced to 1rem
|
|
|
|
These adjustments prevent content overflow and ensure the hero block (including LATEST pill and time indicator) remains fully visible on smaller screens while maintaining readability.
|
|
|
|
Language-aware API usage:
|
|
|
|
```bash
|
|
# Latest hero item in Tamil
|
|
curl "http://localhost:8000/api/news/latest?language=ta"
|
|
|
|
# Feed page in Malayalam
|
|
curl "http://localhost:8000/api/news?limit=10&language=ml"
|
|
```
|
|
|
|
Unsupported language codes default to English.
|
|
|
|
Frontend language selector behavior:
|
|
- Landing page includes a language selector (`English`, `Tamil`, `Malayalam`).
|
|
- Selected language is persisted in `localStorage` and mirrored in a client cookie.
|
|
- Returning users see content in their previously selected language.
|
|
|
|
## Summary Modal
|
|
|
|
Each fetched article now includes a concise summary artifact and can be opened in a modal from the feed.
|
|
|
|
Modal structure:
|
|
|
|
```text
|
|
[Relevant Image]
|
|
|
|
## TL;DR
|
|
[bullet points]
|
|
|
|
## Summary
|
|
[Summarized article]
|
|
|
|
## Source and Citation
|
|
[source of the news]
|
|
|
|
Powered by Perplexity
|
|
```
|
|
|
|
Backend behavior:
|
|
- Summary artifacts are generated during ingestion using Perplexity and persisted with article records.
|
|
- If summary generation fails, ingestion still succeeds and a fallback summary artifact is stored.
|
|
- Summary image enrichment prefers MCP image retrieval when configured, with deterministic fallback behavior.
|
|
|
|
Umami events for summary modal:
|
|
- `summary-modal-open` with `article_id`
|
|
- `summary-modal-close` with `article_id`
|
|
- `summary-modal-link-out` with `article_id` and `source_url`
|
|
|
|
## Policy Pages
|
|
|
|
The footer now includes:
|
|
- `Terms of Use` (`/terms`)
|
|
- `Attribution` (`/attribution`)
|
|
|
|
Both pages are served as static frontend documents through FastAPI routes.
|
|
|
|
## Theme Switcher
|
|
|
|
The header includes an icon-based theme switcher with four modes:
|
|
- `system` (default when unset)
|
|
- `light`
|
|
- `dark`
|
|
- `contrast` (high contrast)
|
|
|
|
Theme persistence behavior:
|
|
- Primary: `localStorage` (`clawfort_theme`)
|
|
- Fallback: cookie (`clawfort_theme`)
|
|
|
|
Returning users get their previously selected theme.
|
|
|
|
## Cookie Consent and Analytics Gate
|
|
|
|
Analytics loading is consent-gated:
|
|
- Consent banner appears when no consent is stored.
|
|
- Clicking `Accept` stores consent in localStorage and cookie (`clawfort_cookie_consent=accepted`).
|
|
- Umami analytics script loads only after consent.
|
|
|
|
## Environment Variables
|
|
|
|
| Variable | Required | Default | Description |
|
|
|----------|----------|---------|-------------|
|
|
| `PERPLEXITY_API_KEY` | Yes | — | Perplexity API key for news fetching |
|
|
| `IMAGE_QUALITY` | No | `85` | JPEG compression quality (1-100) |
|
|
| `OPENROUTER_API_KEY` | No | — | Fallback LLM provider API key |
|
|
| `RETENTION_DAYS` | No | `30` | Days to keep news before archiving |
|
|
| `UMAMI_SCRIPT_URL` | No | — | Umami analytics script URL |
|
|
| `UMAMI_WEBSITE_ID` | No | — | Umami website tracking ID |
|
|
| `ROYALTY_IMAGE_PROVIDER` | No | `picsum` | Legacy: single provider selection (deprecated, use `ROYALTY_IMAGE_PROVIDERS`) |
|
|
| `ROYALTY_IMAGE_PROVIDERS` | No | `pixabay,unsplash,pexels,wikimedia,picsum` | Comma-separated provider priority chain |
|
|
| `ROYALTY_IMAGE_MCP_ENDPOINT` | No | — | MCP endpoint for image retrieval (highest priority when set) |
|
|
| `ROYALTY_IMAGE_API_KEY` | No | — | Optional API key for image provider integrations |
|
|
| `PIXABAY_API_KEY` | No | — | Pixabay API key (get from pixabay.com/api/docs) |
|
|
| `UNSPLASH_ACCESS_KEY` | No | — | Unsplash API access key (get from unsplash.com/developers) |
|
|
| `PEXELS_API_KEY` | No | — | Pexels API key (get from pexels.com/api) |
|
|
| `SUMMARY_LENGTH_SCALE` | No | `3` | Summary detail level from `1` (short) to `5` (long) |
|
|
|
|
## Image Provider Configuration
|
|
|
|
ClawFort retrieves royalty-free images for news articles using a configurable provider chain.
|
|
|
|
### Provider Priority
|
|
|
|
Providers are tried in order until one returns a valid image:
|
|
|
|
1. **MCP Endpoint** (if `ROYALTY_IMAGE_MCP_ENDPOINT` is set) — Custom endpoint, highest priority
|
|
2. **Provider Chain** (from `ROYALTY_IMAGE_PROVIDERS`) — Comma-separated list, tried in order
|
|
|
|
Default chain: `pixabay,unsplash,pexels,wikimedia,picsum`
|
|
|
|
### Supported Providers
|
|
|
|
| Provider | API Key Variable | Attribution Format |
|
|
|----------|------------------|-------------------|
|
|
| Pixabay | `PIXABAY_API_KEY` | "Photo by {user} on Pixabay" |
|
|
| Unsplash | `UNSPLASH_ACCESS_KEY` | "Photo by {name} on Unsplash" (required by TOS) |
|
|
| Pexels | `PEXELS_API_KEY` | "Photo by {photographer} on Pexels" |
|
|
| Wikimedia | None (no key needed) | "Wikimedia Commons" |
|
|
| Picsum | None (always available) | "Picsum Photos" |
|
|
|
|
### Configuration Examples
|
|
|
|
```bash
|
|
# Use Unsplash as primary, fall back to Pexels, then Picsum
|
|
ROYALTY_IMAGE_PROVIDERS=unsplash,pexels,picsum
|
|
UNSPLASH_ACCESS_KEY=your-unsplash-key
|
|
PEXELS_API_KEY=your-pexels-key
|
|
|
|
# Use only Pixabay
|
|
ROYALTY_IMAGE_PROVIDERS=pixabay
|
|
PIXABAY_API_KEY=your-pixabay-key
|
|
|
|
# Disable premium providers, use only free sources
|
|
ROYALTY_IMAGE_PROVIDERS=wikimedia,picsum
|
|
```
|
|
|
|
Providers without configured API keys are automatically skipped.
|
|
|
|
## Architecture
|
|
|
|
- **Backend**: Python (FastAPI) + SQLAlchemy + APScheduler
|
|
- **Frontend**: Alpine.js + Tailwind CSS (CDN, no build step)
|
|
- **Database**: SQLite with 30-day retention
|
|
- **Container**: Single Docker container
|
|
|
|
## API Endpoints
|
|
|
|
| Method | Path | Description |
|
|
|--------|------|-------------|
|
|
| `GET` | `/` | Serve frontend |
|
|
| `GET` | `/api/news/latest` | Latest news item for hero block |
|
|
| `GET` | `/api/news` | Paginated news feed |
|
|
| `GET` | `/api/health` | Health check with news count |
|
|
| `GET` | `/config` | Frontend config (analytics) |
|
|
|
|
## SEO and Structured Data Contract
|
|
|
|
Homepage (`/`) contract:
|
|
- Core metadata: `title`, `description`, `robots`, canonical link.
|
|
- Social metadata: Open Graph (`og:type`, `og:site_name`, `og:title`, `og:description`, `og:url`, `og:image`) and Twitter (`twitter:card`, `twitter:title`, `twitter:description`, `twitter:image`).
|
|
- JSON-LD graph includes:
|
|
- `Newspaper` entity for site-level identity.
|
|
- `NewsArticle` entities for hero and feed articles.
|
|
|
|
Policy page contract (`/terms`, `/attribution`):
|
|
- Page-specific `title` and `description`.
|
|
- `robots` metadata.
|
|
- Canonical link for the route.
|
|
- Open Graph and Twitter preview metadata.
|
|
|
|
Structured data field baseline for each `NewsArticle`:
|
|
- `headline`, `description`, `image`, `datePublished`, `dateModified`, `url`, `mainEntityOfPage`, `inLanguage`, `publisher`, `author`.
|
|
|
|
## Delivery Performance Header Contract
|
|
|
|
FastAPI middleware applies route-specific cache and compression behavior:
|
|
|
|
| Route Class | Cache-Control | Notes |
|
|
|-------------|---------------|-------|
|
|
| `/static/*` | `public, max-age=604800, immutable` | Long-lived static assets |
|
|
| `/api/*` | `public, max-age=60, stale-while-revalidate=120` | Short-lived feed data |
|
|
| `/`, `/terms`, `/attribution` | `public, max-age=300, stale-while-revalidate=600` | HTML routes |
|
|
|
|
Additional headers:
|
|
- `Vary: Accept-Encoding` for API responses.
|
|
- `X-Content-Type-Options: nosniff` for all responses.
|
|
- Gzip compression enabled via `GZipMiddleware` for eligible payloads.
|
|
|
|
## SEO and Performance Verification Checklist
|
|
|
|
Run after local startup (`python -m uvicorn backend.main:app --reload --port 8000`):
|
|
|
|
```bash
|
|
# HTML route cache checks
|
|
curl -I http://localhost:8000/
|
|
curl -I http://localhost:8000/terms
|
|
curl -I http://localhost:8000/attribution
|
|
|
|
# API cache + vary checks
|
|
curl -I "http://localhost:8000/api/news/latest?language=en"
|
|
curl -I "http://localhost:8000/api/news?limit=5&language=en"
|
|
|
|
# Compression check (expect Content-Encoding: gzip for eligible payloads)
|
|
curl -s -H "Accept-Encoding: gzip" -D - "http://localhost:8000/api/news?limit=10&language=en" -o /dev/null
|
|
```
|
|
|
|
Manual acceptance checks:
|
|
1. Homepage source contains one canonical link and Open Graph/Twitter metadata fields.
|
|
2. Homepage JSON-LD contains one `Newspaper` entity and deduplicated `NewsArticle` entries.
|
|
3. Hero/feed/modal images show shimmer placeholders until load/fallback completion.
|
|
4. Feed and modal images use `loading="lazy"`, explicit `width`/`height`, and `decoding="async"`.
|
|
5. Smooth scrolling behavior is enabled for in-page navigation interactions.
|
|
|
|
Structured-data validation:
|
|
- Validate JSON-LD output using schema-aware validators (e.g., Schema.org validator or equivalent tooling) and confirm `Newspaper` + `NewsArticle` entities pass required field checks.
|
|
|
|
Regression checks:
|
|
- Verify homepage rendering (hero, feed, modal).
|
|
- Verify policy-page metadata output.
|
|
- Verify cache/compression headers remain unchanged after SEO-related edits.
|
|
|
|
### GET /api/news
|
|
|
|
Query parameters:
|
|
- `cursor` (int, optional): Last item ID for pagination
|
|
- `limit` (int, default 10): Items per page (max 50)
|
|
- `exclude_hero` (int, optional): Hero item ID to exclude
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"items": [{ "id": 1, "headline": "...", "summary": "...", "source_url": "...", "image_url": "...", "image_credit": "...", "published_at": "...", "created_at": "..." }],
|
|
"next_cursor": 5,
|
|
"has_more": true
|
|
}
|
|
```
|
|
|
|
## Deployment
|
|
|
|
```bash
|
|
docker build -t clawfort .
|
|
docker run -d \
|
|
-e PERPLEXITY_API_KEY=pplx-xxx \
|
|
-v clawfort-data:/app/data \
|
|
-v clawfort-images:/app/backend/static/images \
|
|
-p 8000:8000 \
|
|
clawfort
|
|
```
|
|
|
|
Data persists across restarts via Docker volumes.
|
|
|
|
## Scheduled Jobs
|
|
|
|
- **Hourly**: Fetch latest AI news from Perplexity API
|
|
- **Nightly (3 AM)**: Archive news older than 30 days, delete archived items older than 60 days
|