5.7 KiB
agents.md
Project: Local AI Video Generation from Text Storyboards (Windows + RTX 5070 12GB)
0) Who this is for
The owner (user) is not an ML expert. The system must:
- be reproducible (conda + requirements)
- have guardrails (configs, logs, validation)
- be test-driven (pytest)
- maintain docs (developer + user)
1) High-Level Goal
Build a local pipeline that converts text-only storyboards into 15-30 second videos by:
- converting storyboard -> shot plan
- generating shot clips (T2V or I2V when possible)
- assembling clips into a final MP4
- upscaling to 2K/4K if desired
This is a shot-based system, not "one prompt makes a whole movie".
2) Hard Constraints (Hardware & OS)
Target system:
- Windows 11
- NVIDIA RTX 5070 (12GB VRAM) - Must use GPU.
- 32GB RAM
- 2TB SSD
- Anaconda available
Design must be stable under 12GB VRAM using:
- fp16/bf16
- attention slicing
- xFormers / SDPA where supported
- optional CPU offload
3) Output Targets (Realistic)
- Native generation: 720p-1080p (preferred)
- Final delivery: 1080p required; 2K/4K via upscaling
- Duration: 15-30s per video (may be segmented)
- FPS: 24 default
- Output: MP4 (H.264/H.265)
4) CUDA 13.1 Reality & PyTorch Plan (Critical)
User has CUDA Toolkit 13.1 installed. Current PyTorch builds generally ship with and target CUDA 12.x runtimes. We must NOT assume PyTorch will build/run against local CUDA 13.1 toolkit.
Plan:
- Use PyTorch prebuilt binaries that bundle CUDA runtime (cu121/cu124/cu128).
- Rely on NVIDIA driver compatibility rather than local CUDA toolkit version.
- Avoid compiling custom CUDA extensions unless necessary.
Implementation notes:
- For RTX 5070 (sm_120), use CUDA 12.8 wheels via pip:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 - Prefer conda for Python, ffmpeg, and general deps; use pip for torch if sm_120 support is required.
- If xFormers causes build issues, use PyTorch SDPA and disable xFormers.
5) Approved Stack (Do Not Deviate)
Core
- Python 3.10 or 3.11 (conda env)
- PyTorch (CUDA 12.x build, cu121/cu124/cu128)
- diffusers + transformers + accelerate + safetensors
- ffmpeg for assembly
- opencv-python for frame IO (if needed)
- pydantic for config/schema validation
- rich / loguru for logs
- ftfy for text normalization (required by WAN)
Testing
- pytest
- pytest-cov
- snapshot-ish tests where feasible (metadata + shapes, not visual perfection)
Docs
- /docs/developer.md (developer documentation)
- /docs/user.md (user manual)
- Keep docs updated alongside code changes.
6) Video Models (Pragmatic Choices)
Primary (target)
- WAN 2.x family (T2V; optional I2V if supported in chosen pipeline) Goal: best possible quality on consumer VRAM with chunking.
Secondary / fallback
- Stable Video Diffusion (SVD) if WAN is unstable
- LTX-Video (only if it fits and is stable in our stack)
All model backends must implement the same interface:
- generate_shot(shot_spec) -> video_file + metadata
7) Canonical Input: Storyboard JSON
Storyboard source is text-only (often AI-generated). We will store and validate it as JSON.
A template exists at: templates/storyboard.template.json
We will later build a utility script:
- input: plain text fields or a simple text format
- output: valid storyboard JSON
8) Pipeline Modules (Required)
A) Storyboard parsing & validation
- Load storyboard JSON
- Validate schema
- Expand defaults (fps, resolution, global style)
- Produce normalized shot list
B) Prompt compilation
- Merge global style + shot prompt + camera notes
- Produce positive + negative prompts
- Keep deterministic via seeds
C) Generation runner (per shot)
- For each shot: generate clip
- Support:
- seed control
- chunking (e.g., generate 4-6 seconds then continue)
- optional init frame handoff between shots
D) Assembly
- Use ffmpeg concat to build final video
- Optionally add:
- transitions
- temp audio
- burn-in shot IDs for debugging mode
E) Upscaling (optional)
- Upscale final to 2K/4K (post step)
- Keep this modular so user can skip.
9) Determinism & Logging (Must Have)
For each shot and final render, save:
- prompts (positive/negative)
- seed(s)
- model + revision/hash info if available
- inference params (steps, cfg, sampler, resolution, fps, frames)
- timing + VRAM notes if possible
Every run produces a folder:
- outputs//
- shots/
- assembled/
- metadata/
10) Testing Rules (Hard Requirement)
- Tests must be written alongside features.
- Whenever a file/function is modified, corresponding tests MUST be updated.
- Prefer tests that verify:
- schema validation works
- prompt compiler output is stable
- shot planner expands durations -> frame counts
- assembly command lines are correct
- metadata is generated correctly
Do not require visual quality assertions. Test structure and determinism.
11) Documentation Rules (Hard Requirement)
Maintain these continuously:
- docs/developer.md
- architecture
- install steps
- how to run tests
- how to add a new model backend
- docs/user.md
- quickstart
- how to create storyboard JSON
- how to run generation
- where outputs go
- troubleshooting (VRAM, drivers, ffmpeg)
Docs must be updated whenever CLI flags, file formats, or workflows change.
12) Project Files to Maintain
Required:
- requirements.txt (pip deps)
- environment.yml (conda env)
- templates/storyboard.template.json
- docs/developer.md
- docs/user.md
- src/ (implementation)
- tests/ (pytest)
13) Definition of Done
A feature is "done" only if:
- implemented
- tests added/updated
- docs updated
- reproducible install instructions remain valid
End of file.