Files
video-gen/AGENTS.MD
2026-02-03 23:06:28 -05:00

211 lines
5.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# agents.md
## Project: Local AI Video Generation from Text Storyboards (Windows + RTX 5070 12GB)
### 0) Who this is for
The owner (user) is not an ML expert. The system must:
- be reproducible (conda + requirements)
- have guardrails (configs, logs, validation)
- be test-driven (pytest)
- maintain docs (developer + user)
---
## 1) High-Level Goal
Build a local pipeline that converts **text-only storyboards** into **1530 second videos** by:
1) converting storyboard -> shot plan
2) generating shot clips (T2V or I2V when possible)
3) assembling clips into a final MP4
4) upscaling to 2K/4K if desired
This is a **shot-based** system, not “one prompt makes a whole movie”.
---
## 2) Hard Constraints (Hardware & OS)
Target system:
- Windows 11
- NVIDIA RTX 5070 (12GB VRAM) - Must use GPU.
- 32GB RAM
- 2TB SSD
- Anaconda available
Design must be stable under 12GB VRAM using:
- fp16/bf16
- attention slicing
- xFormers / SDPA where supported
- optional CPU offload
---
## 3) Output Targets (Realistic)
- Native generation: 720p1080p (preferred)
- Final delivery: 1080p required; 2K/4K via upscaling
- Duration: 1530s per video (may be segmented)
- FPS: 24 default
- Output: MP4 (H.264/H.265)
---
## 4) CUDA 13.1 Reality & PyTorch Plan (Critical)
User has CUDA Toolkit 13.1 installed. Current PyTorch builds generally ship with and target CUDA 12.x runtimes.
We must NOT assume PyTorch will build/run against local CUDA 13.1 toolkit.
**Plan:**
- Use **PyTorch prebuilt binaries that bundle CUDA runtime** (e.g., cu121 / cu124).
- Rely on NVIDIA driver compatibility rather than local CUDA toolkit version.
- Avoid compiling custom CUDA extensions unless necessary.
Implementation notes:
- Prefer installing PyTorch via conda or pip using official CUDA 12.x builds.
- If xFormers causes build issues, use PyTorch SDPA and disable xFormers.
---
## 5) Approved Stack (Do Not Deviate)
### Core
- Python 3.10 or 3.11 (conda env)
- PyTorch (CUDA 12.x build: cu121 or cu124)
- diffusers + transformers + accelerate + safetensors
- ffmpeg for assembly
- opencv-python for frame IO (if needed)
- pydantic for config/schema validation
- rich / loguru for logs
### Testing
- pytest
- pytest-cov
- snapshot-ish tests where feasible (metadata + shapes, not visual perfection)
### Docs
- /docs/developer.md (developer documentation)
- /docs/user.md (user manual)
- Keep docs updated alongside code changes.
---
## 6) Video Models (Pragmatic Choices)
### Primary (target)
- WAN 2.x family (T2V; optional I2V if supported in chosen pipeline)
Goal: best possible quality on consumer VRAM with chunking.
### Secondary / fallback
- Stable Video Diffusion (SVD) if WAN is unstable
- LTX-Video (only if it fits and is stable in our stack)
All model backends must implement the same interface:
- generate_shot(shot_spec) -> video_file + metadata
---
## 7) Canonical Input: Storyboard JSON
Storyboard source is text-only (often AI-generated). We will store and validate it as JSON.
A template exists at: `templates/storyboard.template.json`
We will later build a utility script:
- input: plain text fields or a simple text format
- output: valid storyboard JSON
---
## 8) Pipeline Modules (Required)
### A) Storyboard parsing & validation
- Load storyboard JSON
- Validate schema
- Expand defaults (fps, resolution, global style)
- Produce normalized shot list
### B) Prompt compilation
- Merge global style + shot prompt + camera notes
- Produce positive + negative prompts
- Keep deterministic via seeds
### C) Generation runner (per shot)
- For each shot: generate clip
- Support:
- seed control
- chunking (e.g., generate 46 seconds then continue)
- optional init frame handoff between shots
### D) Assembly
- Use ffmpeg concat to build final video
- Optionally add:
- transitions
- temp audio
- burn-in shot IDs for debugging mode
### E) Upscaling (optional)
- Upscale final to 2K/4K (post step)
- Keep this modular so user can skip.
---
## 9) Determinism & Logging (Must Have)
For each shot and final render, save:
- prompts (positive/negative)
- seed(s)
- model + revision/hash info if available
- inference params (steps, cfg, sampler, resolution, fps, frames)
- timing + VRAM notes if possible
Every run produces a folder:
- outputs/<project>/<timestamp>/
- shots/
- assembled/
- metadata/
---
## 10) Testing Rules (Hard Requirement)
- Tests must be written alongside features.
- Whenever a file/function is modified, corresponding tests MUST be updated.
- Prefer tests that verify:
- schema validation works
- prompt compiler output is stable
- shot planner expands durations -> frame counts
- assembly command lines are correct
- metadata is generated correctly
Do not require “visual quality” assertions. Test structure and determinism.
---
## 11) Documentation Rules (Hard Requirement)
Maintain these continuously:
- docs/developer.md
- architecture
- install steps
- how to run tests
- how to add a new model backend
- docs/user.md
- quickstart
- how to create storyboard JSON
- how to run generation
- where outputs go
- troubleshooting (VRAM, drivers, ffmpeg)
Docs must be updated whenever CLI flags, file formats, or workflows change.
---
## 12) Project Files to Maintain
Required:
- requirements.txt (pip deps)
- environment.yml (conda env)
- templates/storyboard.template.json
- docs/developer.md
- docs/user.md
- src/ (implementation)
- tests/ (pytest)
---
## 13) Definition of Done
A feature is “done” only if:
- implemented
- tests added/updated
- docs updated
- reproducible install instructions remain valid
End of file.