First deployment

2026-02-13 09:14:04 -05:00
parent 0e21e035f5
commit 679561bcdb
128 changed files with 3479 additions and 120 deletions
--- a/docs/monitoring-dashboard-config.md
+++ b/docs/monitoring-dashboard-config.md
@@ -0,0 +1,26 @@
+# Monitoring Dashboard Configuration
+
+## Objective
+
+Define baseline dashboards and alert thresholds for reliability and freshness checks.
+
+## Dashboard Panels
+
+1. API p95 latency for `/api/news` and `/api/news/latest`
+2. API error rate (`5xx`) by route
+3. Scheduler success/failure count per hour
+4. Feed freshness lag (minutes since latest published item)
+
+## Alert Thresholds
+
+- API latency alert: p95 > 750 ms for 10 minutes
+- API error-rate alert: `5xx` > 3% for 5 minutes
+- Scheduler alert: 2 consecutive failed fetch cycles
+- Freshness alert: latest item older than 120 minutes
+
+## Test Trigger Plan
+
+- Latency trigger: run stress test against `/api/news` with 50 concurrent requests in staging.
+- Error-rate trigger: simulate upstream timeout and confirm 5xx alert path.
+- Scheduler trigger: disable upstream API key in staging and verify consecutive failure alert.
+- Freshness trigger: pause scheduler for >120 minutes in staging and confirm lag alert.
--- a/docs/p15-code-review-findings.md
+++ b/docs/p15-code-review-findings.md
@@ -0,0 +1,23 @@
+# P15 Code Review Findings
+
+Date: 2026-02-13
+
+## High
+
+- owner=backend area=translations finding=Machine translation output is accepted without strict language validation in runtime flow, allowing occasional script mismatch/gibberish.
+
+## Medium
+
+- owner=frontend area=policy-disclosures finding=Terms and Attribution links previously required route navigation, reducing continuity and causing context loss.
+- owner=backend area=admin-cli finding=Image refetch previously lacked permalink-targeted repair mode, forcing broad batch operations.
+
+## Low
+
+- owner=frontend area=sharing finding=Text-based icon actions in compact surfaces reduced visual consistency on small screens.
+
+## Remediation Status
+
+- translations-quality-gate: fixed-in-progress
+- policy-modal-surface: fixed-in-progress
+- permalink-targeted-refetch: fixed-in-progress
+- icon-consistency: fixed-in-progress
--- a/docs/quality-and-monitoring.md
+++ b/docs/quality-and-monitoring.md
@@ -0,0 +1,64 @@
+# Quality and Monitoring Baseline
+
+## CI Quality Gates
+
+Pipeline file: `.github/workflows/quality-gates.yml`
+
+Stages:
+- `lint-and-test`: Ruff + pytest (coverage threshold enforced).
+- `security-scan`: `pip-audit` dependency vulnerability scan.
+
+Failure policy:
+- Any failed stage blocks merge.
+- Coverage floor below threshold blocks merge.
+
+## Coverage and Test Scope
+
+Current baseline suites:
+- API contracts: `tests/test_api_contracts.py`
+- DB lifecycle workflows: `tests/test_db_workflows.py`
+- Accessibility contracts: `tests/test_accessibility_contract.py`
+- Security/performance smoke checks: `tests/test_security_and_performance.py`
+
+## UX Validation Checklist
+
+Run manually on desktop + mobile viewport:
+1. Hero loads with image and CTA visible.
+2. Feed cards render with source and TL;DR CTA.
+3. Modal opens/closes with Escape and backdrop click.
+4. Share controls are visible in light and dark themes.
+5. Floating back-to-top appears after scrolling and returns to top.
+
+## Production Metrics and Alert Thresholds
+
+| Metric | Target | Alert Threshold |
+|---|---|---|
+| API p95 latency (`/api/news`) | < 350 ms | > 750 ms for 10 min |
+| API error rate (`5xx`) | < 1% | > 3% for 5 min |
+| Scheduler success rate | 100% hourly runs | 2 consecutive failures |
+| Feed freshness lag | < 75 min | > 120 min |
+
+## Alert Runbook
+
+### Incident: Elevated API latency
+1. Confirm DB file I/O and host CPU saturation.
+2. Inspect recent release diff for expensive queries.
+3. Roll back latest deploy if regression is confirmed.
+
+### Incident: Scheduler failures
+1. Check API key and upstream provider status.
+2. Run `python -m backend.cli force-fetch` for repro.
+3. Review logs for provider fallback exhaustion.
+
+### Incident: Error-rate spike
+1. Check `/api/health` response and DB availability.
+2. Identify top failing routes and common status codes.
+3. Mitigate with rollback or feature flag disablement.
+
+## Review/Remediation Log Template
+
+Use this structure for each cycle:
+
+```text
+severity=<high|medium|low> owner=<name> area=<frontend|backend|infra> finding=<summary> status=<open|fixed>
+```