Files
headroom/openspec/changes/headroom-foundation/design.md

560 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
## Context
Headroom is a greenfield web application to replace manual spreadsheet-based capacity planning for engineering teams. The current spreadsheet approach is error-prone, lacks validation, provides no audit trail, and wastes manager time (2+ hours monthly on allocation work).
**Current State:**
- No existing system to migrate from
- Team uses spreadsheets for capacity planning and resource allocation
- No automation, no validation, no visibility into team headroom
**Constraints:**
- MVP must be production-ready with >70% test coverage
- Must run containerized (Docker Compose) from day 1
- Existing Nginx Proxy Manager in environment (no Caddy/Traefik)
- Must support 10-15 developers across 10-12 concurrent projects
- Monthly capacity planning cycle (not real-time)
**Stakeholders:**
- Engineering managers (primary users)
- Team members (log hours, view allocations)
- Top brass (view reports only)
- Superuser/admin (system configuration)
---
## Goals / Non-Goals
**Goals:**
- Automate capacity calculations (holidays, PTO, availability)
- Validate allocations against capacity and approved estimates
- Prevent billing errors (over/under-allocation detection)
- Provide clear visibility into team headroom
- Track planned vs actual hours for utilization analysis
- Generate 5 core reports with customizable filters
- Reduce manager allocation time from 2+ hours to <30 minutes per month
- Enforce role-based access control (4 personas)
- Maintain >70% test coverage with comprehensive E2E tests
**Non-Goals:**
- Real-time notifications (deferred to Phase 2, polling is acceptable for MVP)
- PDF/CSV report exports (deferred to Phase 2, on-screen only for MVP)
- Time-tracking tool integration (manual entry only for MVP)
- Multi-tenancy (single-tenant MVP, add tenant_id later)
- Mobile app (desktop web only)
- AI-powered forecasting (rule-based validation sufficient)
---
## Decisions
### Decision 1: Two-Container Architecture (Laravel API + SvelteKit Frontend)
**Choice:** Separate Laravel API backend and SvelteKit frontend in different containers.
**Rationale:**
- Clean separation of concerns (API vs UI)
- Easier to scale independently in future
- SvelteKit is modern and great for dashboards, worth learning curve
- Laravel provides robust API development (owner has PHP background)
**Alternatives Considered:**
- Laravel + Blade templates: Rejected (less interactive UI, harder for dashboards)
- Laravel + Vue (Inertia): Rejected (owner preferred Svelte over Vue)
- SvelteKit full-stack: Rejected (owner has PHP background, prefer Laravel for API)
**Implementation:**
- Frontend: SvelteKit (port 5173), Tailwind CSS + DaisyUI, Recharts, TanStack Table
- Backend: Laravel 12 (latest) (port 3000), PostgreSQL (latest), Redis (latest)
- Communication: REST API with Laravel API Resources for consistent JSON
- Reverse proxy: Existing Nginx Proxy Manager routes `/api/*` → Laravel, `/*` → SvelteKit
---
### Decision 2: PostgreSQL from Day 1
**Choice:** Use PostgreSQL in production and development (no SQLite).
**Rationale:**
- Avoid migration pain later (SQLite → PostgreSQL is error-prone)
- Production-grade features (JSON operators, full-text search, advanced indexing)
- Better for reporting queries (complex aggregations, window functions)
- Docker volume mount preserves portability
**Alternatives Considered:**
- SQLite for local dev, PostgreSQL for prod: Rejected (migration pain, feature parity issues)
**Implementation:**
- PostgreSQL (latest, Alpine container)
- Volume-mounted to `./data/postgres` for backup/portability
- Migrations from day 1 (Laravel migrations)
- UUIDs for primary keys (prevents ID enumeration, easier distributed systems later)
---
### Decision 3: Redis Caching from Day 1
**Choice:** Implement query and response caching with Redis from the start.
**Rationale:**
- Owner insisted: "No need to sweat on refactoring it everywhere"
- Prevents technical debt accumulation
- Expensive queries (capacity calculations, reports) benefit immediately
- Easy automatic cache invalidation with Laravel
**Alternatives Considered:**
- Defer caching to Phase 2: Rejected (owner's preference for avoiding future refactoring)
**Implementation:**
- Redis (latest, Alpine container)
- Cache keys pattern: `allocations:month:{YYYY-MM}`, `reports:forecast:{from}:{to}:{hash}`
- TTL: 1 hour (allocations), 15 min (reports), 24 hours (master data)
- Automatic invalidation on mutations (create/update/delete triggers cache flush)
- Laravel cache facade for consistency
---
### Decision 4: JWT Authentication (Token-Based)
**Choice:** JWT tokens instead of session-based authentication.
**Rationale:**
- Stateless (better for API-first architecture)
- Suitable for SPA frontend
- Easier to add mobile app later (future-proofing)
- Industry standard for REST APIs
**Alternatives Considered:**
- Laravel sessions: Rejected (owner preferred JWT for future mobile support)
**Implementation:**
- tymon/jwt-auth package
- Access token: 60 minute TTL
- Refresh token: 7 day TTL (stored in Redis, one-time use with rotation)
- Token claims: user UUID, role, permissions array
- Refresh endpoint rotates tokens on each use
---
### Decision 5: SvelteKit Frontend Stack
**Choice:** SvelteKit + Tailwind CSS + DaisyUI + Recharts + TanStack Table + Superforms + Zod
**Rationale:**
- **DaisyUI**: Fast development, opinionated but speeds up dashboard creation
- **Recharts**: Good balance of power and simplicity for charts
- **TanStack Table**: Industry standard for data grids, powerful filtering/sorting
- **Superforms + Zod**: Type-safe validation, seamless SvelteKit Form Actions integration
**Alternatives Considered:**
- Shadcn/ui: Rejected (DaisyUI faster for MVP)
- Chart.js: Rejected (Recharts more powerful)
- Custom table component: Rejected (TanStack is proven, owner unfamiliar but trusts recommendation)
**Implementation:**
- Svelte stores for minimal UI state only (filters, modals)
- Fetch API for HTTP (no Axios, native is sufficient)
- Vitest for unit tests, Playwright for E2E tests
---
### Decision 6: Allocation Validation Strategy
**Choice:** Soft validation with visual indicators (GREEN/YELLOW/RED), not hard blocks.
**Rationale:**
- Managers sometimes need flexibility to over-allocate temporarily
- Hard blocks would frustrate workflow
- Visual warnings catch errors while allowing override
- "This money is my salary!" — both over and under-allocation must be flagged
**Validation Rules:**
- GREEN: Allocation = Approved estimate (100%, within ±5% tolerance)
- YELLOW: Under-allocation (<95% of approved estimate) — will undercharge
- RED: Over-allocation (>105% of approved estimate) — will overcharge
- Person capacity: YELLOW warning at >100%, RED alert at >120%
**Implementation:**
- API returns validation status with each allocation response
- Frontend displays color-coded indicators in allocation matrix
- Tooltip shows exact variance ("Over by 20 hours, will overcharge client")
---
### Decision 7: Monthly Aggregate Actuals (Not Daily)
**Choice:** Track actual hours as monthly totals, allowing incremental weekly updates.
**Rationale:**
- Monthly planning cycle doesn't require daily granularity
- Simplifies data model and UI
- Team members can update weekly and system accumulates
- No time-tracking integration for MVP (manual entry)
**Alternatives Considered:**
- Daily time logging: Rejected (over-engineering for MVP, adds complexity)
- Weekly buckets: Rejected (monthly is sufficient given monthly planning cycle)
**Implementation:**
- Actuals table: project_id, team_member_id, month (YYYY-MM), hours_logged
- UI allows replacing or incrementing monthly total
- Utilization calculated as: (Actual hours / Capacity) × 100%
---
### Decision 8: Defer Real-Time Notifications to Phase 2
**Choice:** No WebSocket notifications in MVP, users refresh to see changes.
**Rationale:**
- Allocations are planned monthly, not time-critical
- WebSocket setup adds 6 hours of dev time
- Polling every 30s is acceptable alternative but also deferred
- Focus MVP on core allocation/reporting functionality
**Alternatives Considered:**
- WebSocket + 1 notification PoC (6 hours): Rejected (not critical for monthly planning)
- Polling-based notifications (2 hours): Rejected (also deferred, users can refresh)
**Implementation (Phase 2):**
- Laravel Broadcasting with Redis adapter
- SvelteKit WebSocket client
- Events: AllocationCreated, AllocationUpdated, EstimateApproved
---
### Decision 9: Database Schema Design
**Choice:** Normalized schema with master data tables, JSON for forecasted effort, UUIDs for primary keys.
**Key Tables:**
- `team_members`: id (UUID), name, role_id (FK), hourly_rate, active
- `projects`: id (UUID), code (unique), title, status_id (FK), type_id (FK), approved_estimate, forecasted_effort (JSON)
- `allocations`: id (UUID), project_id (FK), team_member_id (FK), month (YYYY-MM), allocated_hours
- `actuals`: id (UUID), project_id (FK), team_member_id (FK), month (YYYY-MM), hours_logged
- `roles`, `project_statuses`, `project_types`: Master data tables
- `holidays`, `ptos`: Calendar data
**Design Rationale:**
- **UUIDs**: Prevent ID enumeration attacks, easier distributed systems later
- **Normalized master data**: Roles/statuses/types in separate tables for dynamic configuration
- **Month as string (YYYY-MM)**: Simplifies queries, index-friendly, human-readable
- **JSON for forecasted effort**: Flexible structure `{"2026-02": 40, "2026-03": 60}`, easy to extend
- **Soft deletes for projects**: `deleted_at` timestamp for audit trail
- **Active flag for team members**: Preserve historical allocations when person leaves
**Indexes:**
- `allocations`: composite index on (project_id, month), (team_member_id, month)
- `actuals`: composite index on (project_id, month), (team_member_id, month)
- `team_members`: index on (role_id, active)
- `projects`: index on (status_id, type_id), unique on (code)
---
### Decision 10: API Design Pattern
**Choice:** REST API with Laravel API Resources for consistent JSON responses.
**Rationale:**
- REST is simpler than GraphQL for this use case
- Laravel API Resources provide consistent transformation layer
- Standard HTTP verbs (GET, POST, PUT, DELETE)
- Easy to document with Laravel Scribe (SwaggerUI)
**Endpoint Structure:**
```
/api/auth/login (POST)
/api/auth/logout (POST)
/api/auth/refresh (POST)
/api/team-members (GET, POST)
/api/team-members/:id (GET, PUT, DELETE)
/api/projects (GET, POST)
/api/projects/:id (GET, PUT, DELETE)
/api/allocations?month=YYYY-MM (GET, POST)
/api/allocations/bulk (POST)
/api/allocations/:id (PUT, DELETE)
/api/actuals?month=YYYY-MM (GET, POST)
/api/actuals/bulk (POST)
/api/actuals/:id (PUT)
/api/reports/forecast?from=YYYY-MM&to=YYYY-MM (GET)
/api/reports/utilization?month=YYYY-MM (GET)
/api/reports/cost?month=YYYY-MM (GET)
/api/reports/allocation?month=YYYY-MM (GET)
/api/reports/variance?month=YYYY-MM (GET)
/api/master-data/roles (GET)
/api/master-data/statuses (GET)
/api/master-data/types (GET)
```
**Response Format (Laravel API Resources):**
```json
{
"data": { /* resource */ },
"meta": { /* pagination, counts */ },
"links": { /* HATEOAS links */ }
}
```
**Error Format:**
```json
{
"message": "Validation failed",
"errors": {
"allocated_hours": ["Must be greater than 0"]
}
}
```
---
### Decision 11: Testing Strategy
**Choice:** >70% code coverage with unit + E2E + regression tests on every change.
**Test Layers:**
- **Backend Unit (PHPUnit)**: Model methods, service classes, utilities
- **Backend Feature (Pest)**: API endpoints, authentication, authorization
- **Frontend Unit (Vitest)**: Svelte components, stores, utilities
- **E2E (Playwright)**: Critical user flows (login → allocate → view reports)
**Coverage Targets:**
- Backend: >80% (easier to test server-side logic)
- Frontend: >70% (UI testing is harder)
- Overall: >70% (enforced in `/opsx-verify`)
**Test Data Strategy:**
- Database seeders for test data (Laravel seeders)
- Factories for model generation (Laravel factories)
- Test fixtures for E2E tests (Playwright fixtures)
**Regression Test Approach (MVP):**
- Run full test suite on every change
- E2E tests cover happy paths + critical error cases
- Phase 2: Issue-driven loop (E2E failure → create GitHub issue → fix → retest)
**Implementation:**
- Pre-commit hooks: Run linters + unit tests
- CI/CD: Run full suite (unit + E2E) before merge
- `openspec verify` command: Check coverage, run tests, lint
---
## Risks / Trade-offs
### Risk: SvelteKit Learning Curve
**Impact:** Owner and associate unfamiliar with Svelte, may slow initial development.
**Mitigation:**
- SvelteKit has excellent documentation
- Simpler than React/Vue (less boilerplate)
- TanStack Table is framework-agnostic (owner unfamiliar but AI will guide)
- Start with simple components, iterate
---
### Risk: Two-Container Complexity
**Impact:** More moving parts than single monolith, deployment overhead.
**Mitigation:**
- Docker Compose handles orchestration
- Code-mounted volumes for hot reload (no rebuild needed)
- Owner comfortable with Docker
- Cleaner architecture worth the overhead
---
### Risk: Over-Allocation Soft Validation
**Impact:** Managers could ignore RED flags and over-allocate anyway.
**Mitigation:**
- Visual warnings are prominent (RED color, tooltip with exact impact)
- Reports show over-allocation clearly
- Phase 2: Add email notifications when allocations exceed threshold
- Manager discipline assumed (this is their job)
---
### Risk: Manual Time Entry Accuracy
**Impact:** Team members may forget to log hours or log inaccurately.
**Mitigation:**
- Utilization reports highlight under-logging (planned > actual consistently)
- Manager can follow up with team members showing low actuals
- Phase 2: Integrate with time-tracking tools (Jira, Harvest, Toggl)
- Incremental weekly updates reduce forgetting
---
### Risk: Cache Invalidation Bugs
**Impact:** Stale data shown if cache invalidation logic fails.
**Mitigation:**
- Comprehensive test coverage for cache invalidation logic
- Redis TTL ensures eventual consistency (max 1 hour stale)
- Cache keys are scoped (project, month, person)
- Invalidation triggered on all mutations (create/update/delete)
- Manual cache flush available for admins
---
### Risk: JWT Token Security
**Impact:** Stolen tokens could allow unauthorized access.
**Mitigation:**
- Refresh token rotation (one-time use)
- Short access token TTL (60 minutes)
- Refresh tokens stored in Redis (revocable)
- HTTPS enforced via Nginx Proxy Manager
- Logout invalidates refresh token
- Token includes user role for authorization checks
---
### Trade-off: No Real-Time Notifications
**Benefit:** Saves 6 hours of dev time, keeps MVP scope tight.
**Cost:** Users must manually refresh to see allocation changes.
**Justification:** Allocations are monthly planning activity, not time-critical. Acceptable for MVP.
---
### Trade-off: No PDF/CSV Exports
**Benefit:** Faster MVP, avoids report formatting complexity.
**Cost:** Users cannot export reports for offline viewing or stakeholder sharing.
**Justification:** On-screen reports are primary value, exports are nice-to-have for Phase 2.
---
### Trade-off: Manual Time Entry
**Benefit:** Avoids vendor lock-in, no integration complexity.
**Cost:** Team members must manually enter hours monthly.
**Justification:** Monthly aggregate is low overhead (~5 minutes per person per month).
---
## Migration Plan
**Deployment Steps:**
1. **Initial Setup:**
- Run `docker-compose up` (creates 4 containers)
- Laravel migrations create database schema
- Database seeders populate master data (roles, statuses, types)
- Create superuser account via Laravel seeder
2. **Data Import (Optional):**
- If team has historical spreadsheet data, create import script
- Import team members (name, role, hourly rate)
- Import active projects (code, title, approved estimate)
- Do NOT import historical allocations (start fresh)
3. **User Onboarding:**
- Train managers on allocation workflow (1 hour session)
- Demo: capacity planning → project setup → allocation → reports
- Provide Quick Start guide (Markdown doc)
4. **Go-Live:**
- Managers create February 2026 capacity plan (holidays, PTO, availability)
- Managers allocate resources for February
- Team members log February actuals mid-month (incremental updates)
- Month-end: Review utilization reports, adjust March allocations
**Rollback Strategy:**
- MVP is greenfield (no data migration to revert)
- If critical bug discovered, roll back to previous container image
- Docker Compose down/up with previous image tag
- PostgreSQL data persisted in volume (safe across container restarts)
- Zero-downtime rollback: Blue/green deployment (Phase 2, not needed for MVP)
**Monitoring (Phase 2):**
- Application logs (Laravel log files)
- Database performance (PostgreSQL slow query log)
- Cache hit rate (Redis INFO stats)
- API response times (Laravel Telescope or custom middleware)
---
## Open Questions
### Question 1: Hourly Rate Visibility
Should developers see their own hourly rate, or only managers/top brass?
**Options:**
- A) Developers can see their own rate (transparency)
- B) Developers cannot see rates (only allocations)
**Recommendation:** A (transparency fosters trust, rate is not secret in most orgs)
**Decision:** To be finalized with owner before implementation.
---
### Question 2: Hours Per Day Configuration
Is "1.0 availability = 8 hours" globally configured, or per-project?
**Options:**
- A) Global setting (e.g., 1.0 = 8 hours for everyone)
- B) Per-team member (some people work 6-hour days)
- C) Per-project (different billing rates for different project types)
**Recommendation:** A (global setting, simplest for MVP)
**Decision:** Owner mentioned "configurable per project" but likely meant per team. Clarify.
---
### Question 3: PTO Approval Workflow
Is PTO auto-approved, or does it require manager approval?
**Options:**
- A) Auto-approved (capacity reduced immediately)
- B) Requires approval (pending state until manager approves)
**Recommendation:** B (manager approval, prevents abuse)
**Decision:** Owner likely expects approval workflow. Confirm.
---
### Question 4: Support Projects in Revenue Forecast
Should "Support" type projects appear in revenue forecasts?
**Options:**
- A) Exclude from revenue (they're ongoing ops, not billable)
- B) Include in revenue (still billable internally)
**Recommendation:** Ask owner's preference (may vary by org)
**Decision:** To be confirmed during implementation.
---
### Question 5: Allocation Tolerance Threshold
What's the tolerance for "within estimate" (GREEN indicator)?
**Current assumption:** ±5% (e.g., 100-hour project allocated 95-105 hours is GREEN)
**Confirm:** Is 5% the right threshold, or should it be configurable?
**Decision:** Start with 5%, make configurable in Phase 2 if needed.
---
**End of Design Document**
**Next Steps:**
1. Review open questions with owner
2. Finalize database schema (ERD diagram)
3. Create tasks.md (implementation checklist)
4. Begin Sprint 1: Docker Compose setup + database migrations