# agents.md ## Project: Local AI Video Generation from Text Storyboards (Windows + RTX 5070 12GB) ### 0) Who this is for The owner (user) is not an ML expert. The system must: - be reproducible (conda + requirements) - have guardrails (configs, logs, validation) - be test-driven (pytest) - maintain docs (developer + user) --- ## 1) High-Level Goal Build a local pipeline that converts **text-only storyboards** into **15–30 second videos** by: 1) converting storyboard -> shot plan 2) generating shot clips (T2V or I2V when possible) 3) assembling clips into a final MP4 4) upscaling to 2K/4K if desired This is a **shot-based** system, not “one prompt makes a whole movie”. --- ## 2) Hard Constraints (Hardware & OS) Target system: - Windows 11 - NVIDIA RTX 5070 (12GB VRAM) - Must use GPU. - 32GB RAM - 2TB SSD - Anaconda available Design must be stable under 12GB VRAM using: - fp16/bf16 - attention slicing - xFormers / SDPA where supported - optional CPU offload --- ## 3) Output Targets (Realistic) - Native generation: 720p–1080p (preferred) - Final delivery: 1080p required; 2K/4K via upscaling - Duration: 15–30s per video (may be segmented) - FPS: 24 default - Output: MP4 (H.264/H.265) --- ## 4) CUDA 13.1 Reality & PyTorch Plan (Critical) User has CUDA Toolkit 13.1 installed. Current PyTorch builds generally ship with and target CUDA 12.x runtimes. We must NOT assume PyTorch will build/run against local CUDA 13.1 toolkit. **Plan:** - Use **PyTorch prebuilt binaries that bundle CUDA runtime** (e.g., cu121 / cu124). - Rely on NVIDIA driver compatibility rather than local CUDA toolkit version. - Avoid compiling custom CUDA extensions unless necessary. Implementation notes: - Prefer installing PyTorch via conda or pip using official CUDA 12.x builds. - If xFormers causes build issues, use PyTorch SDPA and disable xFormers. --- ## 5) Approved Stack (Do Not Deviate) ### Core - Python 3.10 or 3.11 (conda env) - PyTorch (CUDA 12.x build: cu121 or cu124) - diffusers + transformers + accelerate + safetensors - ffmpeg for assembly - opencv-python for frame IO (if needed) - pydantic for config/schema validation - rich / loguru for logs ### Testing - pytest - pytest-cov - snapshot-ish tests where feasible (metadata + shapes, not visual perfection) ### Docs - /docs/developer.md (developer documentation) - /docs/user.md (user manual) - Keep docs updated alongside code changes. --- ## 6) Video Models (Pragmatic Choices) ### Primary (target) - WAN 2.x family (T2V; optional I2V if supported in chosen pipeline) Goal: best possible quality on consumer VRAM with chunking. ### Secondary / fallback - Stable Video Diffusion (SVD) if WAN is unstable - LTX-Video (only if it fits and is stable in our stack) All model backends must implement the same interface: - generate_shot(shot_spec) -> video_file + metadata --- ## 7) Canonical Input: Storyboard JSON Storyboard source is text-only (often AI-generated). We will store and validate it as JSON. A template exists at: `templates/storyboard.template.json` We will later build a utility script: - input: plain text fields or a simple text format - output: valid storyboard JSON --- ## 8) Pipeline Modules (Required) ### A) Storyboard parsing & validation - Load storyboard JSON - Validate schema - Expand defaults (fps, resolution, global style) - Produce normalized shot list ### B) Prompt compilation - Merge global style + shot prompt + camera notes - Produce positive + negative prompts - Keep deterministic via seeds ### C) Generation runner (per shot) - For each shot: generate clip - Support: - seed control - chunking (e.g., generate 4–6 seconds then continue) - optional init frame handoff between shots ### D) Assembly - Use ffmpeg concat to build final video - Optionally add: - transitions - temp audio - burn-in shot IDs for debugging mode ### E) Upscaling (optional) - Upscale final to 2K/4K (post step) - Keep this modular so user can skip. --- ## 9) Determinism & Logging (Must Have) For each shot and final render, save: - prompts (positive/negative) - seed(s) - model + revision/hash info if available - inference params (steps, cfg, sampler, resolution, fps, frames) - timing + VRAM notes if possible Every run produces a folder: - outputs/// - shots/ - assembled/ - metadata/ --- ## 10) Testing Rules (Hard Requirement) - Tests must be written alongside features. - Whenever a file/function is modified, corresponding tests MUST be updated. - Prefer tests that verify: - schema validation works - prompt compiler output is stable - shot planner expands durations -> frame counts - assembly command lines are correct - metadata is generated correctly Do not require “visual quality” assertions. Test structure and determinism. --- ## 11) Documentation Rules (Hard Requirement) Maintain these continuously: - docs/developer.md - architecture - install steps - how to run tests - how to add a new model backend - docs/user.md - quickstart - how to create storyboard JSON - how to run generation - where outputs go - troubleshooting (VRAM, drivers, ffmpeg) Docs must be updated whenever CLI flags, file formats, or workflows change. --- ## 12) Project Files to Maintain Required: - requirements.txt (pip deps) - environment.yml (conda env) - templates/storyboard.template.json - docs/developer.md - docs/user.md - src/ (implementation) - tests/ (pytest) --- ## 13) Definition of Done A feature is “done” only if: - implemented - tests added/updated - docs updated - reproducible install instructions remain valid End of file.