Files
didnt-read/IMPLEMENTATION_PLAN.md
2026-01-27 13:24:03 -05:00

714 lines
22 KiB
Markdown

# Privacy Policy Analyzer - Implementation Plan
## Overview
A self-hosted web application that analyzes privacy policies using AI (ChatGPT) and provides easy-to-understand ratings and summaries.
## Tech Stack
- **Runtime**: Bun (JavaScript)
- **Database**: PostgreSQL
- **Search**: Meilisearch
- **Cache**: Redis
- **Templating**: EJS
- **AI**: OpenAI API (GPT-4o/GPT-4-turbo)
- **Containerization**: Docker Compose
## Project Structure
```
privacy-policy-analyzer/
├── docker-compose.yml # Multi-service orchestration
├── Dockerfile # Bun app container
├── .env.example # Environment variables template
├── .env # Actual environment variables (gitignored)
├── package.json # Bun dependencies
├── src/
│ ├── app.js # Main application entry
│ ├── config/
│ │ ├── database.js # PostgreSQL connection
│ │ ├── redis.js # Redis connection
│ │ ├── meilisearch.js # Meilisearch client
│ │ └── openai.js # OpenAI client
│ ├── models/
│ │ ├── Service.js # Service/site model
│ │ ├── PolicyVersion.js # Policy version model
│ │ └── Analysis.js # Analysis results model
│ ├── routes/
│ │ ├── public.js # Public-facing routes
│ │ └── admin.js # Admin panel routes
│ ├── controllers/
│ │ ├── publicController.js
│ │ └── adminController.js
│ ├── services/
│ │ ├── aiAnalyzer.js # OpenAI analysis logic
│ │ ├── policyFetcher.js # Fetch policy from URL
│ │ ├── scheduler.js # Cron jobs
│ │ └── searchIndexer.js # Meilisearch indexing
│ ├── middleware/
│ │ ├── auth.js # Admin authentication
│ │ └── errorHandler.js # Global error handling
│ ├── views/
│ │ ├── layouts/
│ │ │ └── main.ejs
│ │ ├── public/
│ │ │ ├── index.ejs # Service listing
│ │ │ └── service.ejs # Service detail page
│ │ └── admin/
│ │ ├── login.ejs
│ │ ├── dashboard.ejs
│ │ ├── add-service.ejs
│ │ └── edit-service.ejs
│ └── utils/
│ ├── logger.js
│ └── validators.js
└── migrations/
└── 001_initial.sql # Database schema
```
## Database Schema
### Services Table
```sql
CREATE TABLE services (
id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
url VARCHAR(500) NOT NULL,
logo_url VARCHAR(500),
policy_url VARCHAR(500),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
```
### Policy Versions Table
```sql
CREATE TABLE policy_versions (
id SERIAL PRIMARY KEY,
service_id INTEGER REFERENCES services(id) ON DELETE CASCADE,
content TEXT NOT NULL,
content_hash VARCHAR(64) NOT NULL, -- SHA-256 hash for change detection
fetched_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
```
### Analyses Table
```sql
CREATE TABLE analyses (
id SERIAL PRIMARY KEY,
service_id INTEGER REFERENCES services(id) ON DELETE CASCADE,
policy_version_id INTEGER REFERENCES policy_versions(id) ON DELETE CASCADE,
overall_score VARCHAR(1) NOT NULL, -- A, B, C, D, or E
findings JSONB NOT NULL, -- Structured analysis results
raw_analysis TEXT, -- Full AI response for debugging
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, -- When this analysis was created (used as "last analyzed" date)
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
```
**Note**: The `created_at` field in the analyses table represents when the policy was last analyzed. This date must be displayed prominently on every service page so users know the freshness of the analysis.
### Admin Sessions Table
```sql
CREATE TABLE admin_sessions (
id SERIAL PRIMARY KEY,
session_token VARCHAR(255) UNIQUE NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
expires_at TIMESTAMP NOT NULL
);
```
## AI Analysis Structure
### Scoring Parameters
The AI will analyze privacy policies based on these weighted categories:
1. **Data Collection (25%)**
- What personal data is collected
- Scope of collection (minimal vs excessive)
- Collection methods (active vs passive)
2. **Data Sharing (25%)**
- Third-party sharing practices
- Purposes for sharing
- Sale of personal data
3. **User Rights (20%)**
- Data access rights
- Deletion rights
- Portability rights
- Opt-out mechanisms
4. **Data Retention (15%)**
- Retention periods
- Deletion policies
- Post-account deletion handling
5. **Tracking & Security (15%)**
- Tracking technologies used
- Security measures mentioned
- Encryption practices
### AI Output Schema (JSON)
```json
{
"overall_score": "A|B|C|D|E",
"score_breakdown": {
"data_collection": "A|B|C|D|E",
"data_sharing": "A|B|C|D|E",
"user_rights": "A|B|C|D|E",
"data_retention": "A|B|C|D|E",
"tracking_security": "A|B|C|D|E"
},
"findings": {
"positive": [
{
"category": "user_rights",
"title": "Clear deletion process",
"description": "Users can delete their account and data easily",
"severity": "good"
}
],
"negative": [
{
"category": "data_sharing",
"title": "Data sold to third parties",
"description": "Personal data is sold to advertisers and partners",
"severity": "blocker"
}
],
"neutral": [
{
"category": "general",
"title": "Policy updated regularly",
"description": "Privacy policy is reviewed and updated annually",
"severity": "neutral"
}
]
},
"data_types_collected": [
"name",
"email",
"location",
"device_info"
],
"third_parties": [
{
"name": "Google Analytics",
"purpose": "analytics",
"data_shared": ["usage_data", "device_info"]
}
],
"summary": "Brief 2-3 sentence summary of the privacy policy"
}
```
### Severity Levels
- **blocker**: Critical privacy concerns (red icon)
- **bad**: Significant issues (orange icon)
- **neutral**: Informational (gray icon)
- **good**: Positive privacy practices (green icon)
## Features
### Phase 1: Foundation
1. **Docker Setup**
- Bun application container
- PostgreSQL container with persistent volume
- Meilisearch container
- Redis container
- Docker network for inter-service communication
2. **Database Layer**
- Migration system
- Connection pooling
- Basic CRUD operations for all models
3. **Basic Web Server**
- Bun HTTP server or lightweight framework
- EJS templating engine setup
- Static file serving
- Request logging
### Phase 2: Core Features
1. **Admin Authentication**
- Simple login form
- Session-based authentication (stored in Redis)
- Single admin user (credentials in .env)
- Protected admin routes
2. **Service Management**
- Add new service (name, URL, policy URL)
- Edit service details
- Delete service
- List all services in admin panel
3. **Policy Fetching**
- Fetch policy from URL (with timeout and error handling)
- Support for pasting policy text directly
- Content hash generation for change detection
- Store full policy text in database
4. **AI Analysis**
- Manual trigger from admin panel
- Structured prompt engineering
- JSON mode for consistent output
- Error handling and retry logic
- Store analysis results in database
5. **Public Pages**
- Homepage with service listing (A-E grades displayed, last analyzed dates shown)
- Search functionality via Meilisearch
- Individual service detail page with prominent "last analyzed" date display
- Filter by grade
### Phase 3: Enhancements
1. **Automated Policy Updates**
- Daily cron job to check all policy URLs
- Compare content hash with latest version
- Flag services with changed policies
- Admin notification of pending re-analysis
2. **Re-analysis Workflow**
- Bulk re-analysis of updated policies
- Historical analysis comparison
- Show policy change history on service page
3. **Search & Discovery**
- Full-text search via Meilisearch
- Filter by data types collected
- Filter by third parties
- Sort by grade, name, last analyzed
4. **Caching**
- Redis caching for public pages
- Cache analysis results
- Cache search results
- TTL-based cache invalidation
### Phase 4: Polish
1. **Error Handling**
- Global error handler middleware
- User-friendly error pages
- Graceful degradation when AI is unavailable
2. **Rate Limiting**
- Rate limit on AI analysis endpoint
- Rate limit on policy fetching
- Prevent abuse
3. **UI/UX**
- Clean, simple design
- Responsive layout
- Grade badges with colors
- Expandable finding details
4. **Monitoring**
- Basic logging
- Health check endpoint
- Analysis success/failure metrics
## API Endpoints
### Public Routes
- `GET /` - Homepage with service listing
- `GET /search?q=query` - Search services
- `GET /service/:id` - Service detail page
- `GET /api/health` - Health check
### Admin Routes
- `GET /admin/login` - Login page
- `POST /admin/login` - Authenticate
- `GET /admin/logout` - Logout
- `GET /admin/dashboard` - Admin dashboard
- `GET /admin/services/new` - Add service form
- `POST /admin/services` - Create service
- `GET /admin/services/:id/edit` - Edit service form
- `POST /admin/services/:id` - Update service
- `POST /admin/services/:id/delete` - Delete service
- `POST /admin/services/:id/analyze` - Trigger analysis
- `GET /admin/pending-updates` - Services with policy changes
## Environment Variables
```env
# Database
DATABASE_URL=postgresql://user:password@postgres:5432/privacy_analyzer
# Redis
REDIS_URL=redis://redis:6379
# Meilisearch
MEILISEARCH_URL=http://meilisearch:7700
MEILISEARCH_API_KEY=your_master_key
# OpenAI
OPENAI_API_KEY=sk-your-api-key
OPENAI_MODEL=gpt-4o
# Admin
ADMIN_USERNAME=admin
ADMIN_PASSWORD=secure_password_here
SESSION_SECRET=random_session_secret
# App
PORT=3000
NODE_ENV=production
```
## Docker Compose Configuration
```yaml
version: '3.8'
services:
app:
build: .
ports:
- "3000:3000"
environment:
- DATABASE_URL=postgresql://postgres:password@postgres:5432/privacy_analyzer
- REDIS_URL=redis://redis:6379
- MEILISEARCH_URL=http://meilisearch:7700
depends_on:
- postgres
- redis
- meilisearch
volumes:
- ./src:/app/src
postgres:
image: postgres:15-alpine
environment:
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=password
- POSTGRES_DB=privacy_analyzer
volumes:
- postgres_data:/var/lib/postgresql/data
ports:
- "5432:5432"
redis:
image: redis:7-alpine
volumes:
- redis_data:/data
ports:
- "6379:6379"
meilisearch:
image: getmeili/meilisearch:v1.6
environment:
- MEILI_MASTER_KEY=your_master_key
volumes:
- meilisearch_data:/meili_data
ports:
- "7700:7700"
volumes:
postgres_data:
redis_data:
meilisearch_data:
```
## Non-Functional Requirements
### 1. Search Engine Optimization (SEO)
#### In-Page SEO
- **Title Tags**: Dynamic `<title>` for each page
- Homepage: "Privacy Policy Analyzer | Compare Website Privacy Practices"
- Service page: "{Service Name} Privacy Policy Analysis | Grade {A-E}"
- Search results: "Search Results for '{query}' | Privacy Policy Analyzer"
- **Meta Descriptions**: 150-160 character descriptions for each page
- **Open Graph Tags**: og:title, og:description, og:image, og:url for social sharing
- **Twitter Cards**: Summary cards with large images
- **Canonical URLs**: Prevent duplicate content issues
- **Structured Data (Schema.org)**:
- Organization schema for the site
- Review/Rating schema for service pages
- BreadcrumbList for navigation
#### Technical SEO
- **Sitemap.xml**: Auto-generated daily, includes all public service pages
- Lastmod timestamps from analysis `created_at` dates
- Priority levels (homepage: 1.0, service pages: 0.8, search: 0.5)
- **Robots.txt**: Allow public pages, disallow admin routes
- **URL Structure**: Clean, descriptive URLs
- `/service/facebook`
- `/search?q=google`
- `/grade/A` (filter by grade)
- **Performance**: Fast page load times (affects SEO rankings)
- **Mobile-First**: Responsive design is crawled as mobile
#### Last Updated Display
- **Required on all service pages**: Display the `created_at` date from the latest analysis
- **Format**: "Last analyzed: January 27, 2026" or relative time ("Last analyzed: 3 days ago")
- **Location**: Prominently displayed near the service name/grade
- **Purpose**: Users must know when the analysis was performed to assess freshness
- **Example placement**:
```html
<div class="service-header">
<h1>Facebook Privacy Policy Analysis</h1>
<span class="grade grade-e">Grade E</span>
<p class="last-updated">Last analyzed: January 27, 2026</p>
</div>
```
#### Content SEO
- **Semantic HTML5**: Proper use of `<header>`, `<nav>`, `<main>`, `<article>`, `<footer>`
- **Heading Hierarchy**: Single H1 per page, logical H2-H6 structure
- **Alt Text**: Descriptive alt text for all images (logo, grade badges)
- **Internal Linking**: Link between related services
- **Keywords**: Focus on "privacy policy analyzer", "{service} privacy", "privacy grade"
### 2. Performance Benchmarking
#### Target Metrics
| Metric | Target | Maximum |
|--------|--------|---------|
| First Contentful Paint (FCP) | < 1.0s | 1.8s |
| Largest Contentful Paint (LCP) | < 2.5s | 4.0s |
| Time to Interactive (TTI) | < 3.8s | 7.3s |
| Cumulative Layout Shift (CLS) | < 0.1 | 0.25 |
| Total Blocking Time (TBT) | < 200ms | 600ms |
| First Input Delay (FID) | < 100ms | 300ms |
#### Optimization Strategies
- **Redis Caching**:
- Public pages: 1 hour TTL
- Analysis results: 24 hour TTL
- API responses: 5 minute TTL
- Meilisearch queries: 10 minute TTL
- **Compression**: Brotli + Gzip for all text responses
- **CDN**: Serve static assets (CSS, JS, images) via CDN
- **Lazy Loading**: Load images and heavy content on-demand
- **Database Optimization**:
- Indexed columns: service.name, analysis.overall_score, policy_versions.service_id
- Query optimization with EXPLAIN ANALYZE
- Connection pooling (max 20 connections)
- **Asset Optimization**:
- Minified CSS/JS
- Optimized images (WebP format)
- Critical CSS inline
- Async/defer for non-critical scripts
#### Monitoring
- **Lighthouse CI**: Automated testing in CI/CD pipeline
- **Real User Monitoring (RUM)**: Track actual user performance
- **Uptime Monitoring**: Pingdom or similar for availability
- **Alerting**: Notify when response time > 2s or error rate > 1%
### 3. Security Standards
#### OWASP Top 10 Mitigation
1. **Injection**: Parameterized queries for all database operations
2. **Broken Authentication**:
- bcrypt for password hashing (cost factor 12)
- Secure session tokens (128-bit random)
- Session expiration (24 hours)
- Rate limiting on login (5 attempts per 15 minutes)
3. **Sensitive Data Exposure**:
- HTTPS only (HSTS header)
- Secure cookies (HttpOnly, Secure, SameSite=Strict)
- No sensitive data in URLs
- Encrypted env vars
4. **XML External Entities (XXE)**: Not applicable (no XML parsing)
5. **Broken Access Control**:
- Authentication middleware on all admin routes
- Principle of least privilege
- No directory traversal
6. **Security Misconfiguration**:
- Remove default passwords
- Disable unnecessary features
- Security headers (see below)
7. **Cross-Site Scripting (XSS)**:
- EJS auto-escaping enabled
- Content Security Policy (CSP)
- Input validation and sanitization
8. **Insecure Deserialization**: Not applicable
9. **Using Components with Known Vulnerabilities**:
- Regular dependency audits (`bun audit`)
- Automated security updates
- Container image scanning
10. **Insufficient Logging and Monitoring**:
- Log all authentication attempts
- Log all AI analysis requests
- Log errors with context
- Never log sensitive data
#### Security Headers
```javascript
// Middleware to add security headers
app.use((req, res, next) => {
res.setHeader('Strict-Transport-Security', 'max-age=31536000; includeSubDomains');
res.setHeader('Content-Security-Policy', "default-src 'self'; script-src 'self' 'unsafe-inline'; style-src 'self' 'unsafe-inline'");
res.setHeader('X-Content-Type-Options', 'nosniff');
res.setHeader('X-Frame-Options', 'DENY');
res.setHeader('X-XSS-Protection', '1; mode=block');
res.setHeader('Referrer-Policy', 'strict-origin-when-cross-origin');
res.setHeader('Permissions-Policy', 'geolocation=(), microphone=(), camera=()');
next();
});
```
#### Additional Security Measures
- **Rate Limiting**:
- Public API: 100 requests per 15 minutes per IP
- Admin endpoints: 30 requests per 15 minutes per IP
- AI analysis: 10 requests per hour per admin session
- **Input Validation**:
- Validate all user inputs with Joi or Zod
- Sanitize HTML if allowing rich text
- Max length limits on all text fields
- **CORS**: Restrict to specific origins
- **Dependency Scanning**: Run `bun audit` before each deploy
- **Container Security**:
- Non-root user in Docker
- Read-only filesystem where possible
- Minimal base image (distroless or alpine)
### 4. WCAG 2.1 AA Compliance
#### Perceivable (1)
- **1.1 Text Alternatives**:
- Alt text for all images (service logos, grade badges, icons)
- Decorative images have empty alt (alt="")
- **1.2 Time-based Media**: Not applicable (no video/audio)
- **1.3 Adaptable**:
- Semantic HTML structure
- Proper heading hierarchy (H1 → H2 → H3)
- ARIA labels where needed
- Table headers with proper scope attributes
- Form labels associated with inputs
- **1.4 Distinguishable**:
- Color contrast ratio ≥ 4.5:1 for normal text
- Color contrast ratio ≥ 3:1 for large text (18pt+) and UI components
- Text resizing up to 200% without loss of content
- No images of text (use actual text)
- Focus indicators visible (2px solid outline)
#### Operable (2)
- **2.1 Keyboard Accessible**:
- All functionality available via keyboard
- Logical tab order
- No keyboard traps
- Skip to main content link
- **2.2 Enough Time**: Not applicable (no time limits)
- **2.3 Seizures and Physical Reactions**:
- No flashing content (>3 flashes per second)
- **2.4 Navigable**:
- Descriptive page titles
- Breadcrumb navigation
- Multiple ways to find pages (search, browse by grade)
- Focus order matches visual order
- Link text describes destination (no "click here")
- **2.5 Input Modalities**:
- Touch targets minimum 44x44px
- No motion-based interactions required
#### Understandable (3)
- **3.1 Readable**:
- Primary language declared (lang="en")
- Simple, clear language
- Abbreviations explained on first use
- **3.2 Predictable**:
- Consistent navigation across pages
- No unexpected changes on focus/input
- Error prevention for destructive actions
- **3.3 Input Assistance**:
- Form labels and instructions
- Error messages identify field and suggest fix
- Confirmation for important actions (delete)
#### Robust (4)
- **4.1 Compatible**:
- Valid HTML5
- ARIA roles, states, and properties used correctly
- Status messages announced to screen readers
#### Implementation Checklist
- [ ] All images have alt text
- [ ] Color contrast verified (use WebAIM Contrast Checker)
- [ ] Keyboard navigation tested
- [ ] Screen reader tested (NVDA, VoiceOver, JAWS)
- [ ] Focus indicators visible
- [ ] Forms have labels and error handling
- [ ] Page titles are descriptive
- [ ] Semantic HTML5 structure
- [ ] ARIA landmarks (banner, main, navigation, contentinfo)
- [ ] Skip link implemented
#### Accessibility Testing Tools
- **Automated**: axe-core, Lighthouse, WAVE
- **Manual**: Keyboard-only navigation, screen reader testing
- **Browser**: Firefox Accessibility Inspector, Chrome DevTools
## Implementation Order
### Phase 1: Foundation
1. Create project structure and Docker Compose setup
2. Set up database and migrations
3. Create models and basic CRUD
4. Implement admin authentication with bcrypt
5. Add security headers middleware
### Phase 2: Core Features
6. Build admin panel UI (add/edit services)
7. Implement policy fetching with validation
8. Integrate OpenAI analysis with rate limiting
9. Build public pages (listing and detail) with SEO tags
10. Add sitemap.xml generation
### Phase 3: Enhancements
11. Add Meilisearch indexing
12. Implement Redis caching layer
13. Add cron jobs for policy updates
14. Optimize assets (minification, compression)
### Phase 4: Non-Functional Requirements
15. Implement accessibility features (WCAG 2.1 AA)
16. Add structured data (Schema.org)
17. Performance testing and optimization
18. Security audit and penetration testing
19. Final accessibility audit
20. Documentation
## Future Enhancements (Post-MVP)
- GDPR/CCPA compliance badges
- Browser extension for quick checks
- Policy comparison tool
- RSS feed for policy changes
- API for third-party integrations
- Multi-language support
- Export reports as PDF
- Email notifications for policy changes
## Notes
- Keep AI prompts versioned for reproducibility
- Log all AI analysis attempts (success and failure)
- Consider rate limiting on OpenAI API calls
- Store raw AI responses for debugging
- Implement graceful degradation if AI service is down
- Regular backups of PostgreSQL database
- Monitor Meilisearch disk usage