Files
video-gen/AGENTS.MD

5.7 KiB

agents.md

Project: Local AI Video Generation from Text Storyboards (Windows + RTX 5070 12GB)

0) Who this is for

The owner (user) is not an ML expert. The system must:

  • be reproducible (conda + requirements)
  • have guardrails (configs, logs, validation)
  • be test-driven (pytest)
  • maintain docs (developer + user)

1) High-Level Goal

Build a local pipeline that converts text-only storyboards into 15-30 second videos by:

  1. converting storyboard -> shot plan
  2. generating shot clips (T2V or I2V when possible)
  3. assembling clips into a final MP4
  4. upscaling to 2K/4K if desired

This is a shot-based system, not "one prompt makes a whole movie".


2) Hard Constraints (Hardware & OS)

Target system:

  • Windows 11
  • NVIDIA RTX 5070 (12GB VRAM) - Must use GPU.
  • 32GB RAM
  • 2TB SSD
  • Anaconda available

Design must be stable under 12GB VRAM using:

  • fp16/bf16
  • attention slicing
  • xFormers / SDPA where supported
  • optional CPU offload

3) Output Targets (Realistic)

  • Native generation: 720p-1080p (preferred)
  • Final delivery: 1080p required; 2K/4K via upscaling
  • Duration: 15-30s per video (may be segmented)
  • FPS: 24 default
  • Output: MP4 (H.264/H.265)

4) CUDA 13.1 Reality & PyTorch Plan (Critical)

User has CUDA Toolkit 13.1 installed. Current PyTorch builds generally ship with and target CUDA 12.x runtimes. We must NOT assume PyTorch will build/run against local CUDA 13.1 toolkit.

Plan:

  • Use PyTorch prebuilt binaries that bundle CUDA runtime (cu121/cu124/cu128).
  • Rely on NVIDIA driver compatibility rather than local CUDA toolkit version.
  • Avoid compiling custom CUDA extensions unless necessary.

Implementation notes:

  • For RTX 5070 (sm_120), use CUDA 12.8 wheels via pip: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
  • Prefer conda for Python, ffmpeg, and general deps; use pip for torch if sm_120 support is required.
  • If xFormers causes build issues, use PyTorch SDPA and disable xFormers.

5) Approved Stack (Do Not Deviate)

Core

  • Python 3.10 or 3.11 (conda env)
  • PyTorch (CUDA 12.x build, cu121/cu124/cu128)
  • diffusers + transformers + accelerate + safetensors
  • ffmpeg for assembly
  • opencv-python for frame IO (if needed)
  • pydantic for config/schema validation
  • rich / loguru for logs
  • ftfy for text normalization (required by WAN)

Testing

  • pytest
  • pytest-cov
  • snapshot-ish tests where feasible (metadata + shapes, not visual perfection)

Docs

  • /docs/developer.md (developer documentation)
  • /docs/user.md (user manual)
  • Keep docs updated alongside code changes.

6) Video Models (Pragmatic Choices)

Primary (target)

  • WAN 2.x family (T2V; optional I2V if supported in chosen pipeline) Goal: best possible quality on consumer VRAM with chunking.

Secondary / fallback

  • Stable Video Diffusion (SVD) if WAN is unstable
  • LTX-Video (only if it fits and is stable in our stack)

All model backends must implement the same interface:

  • generate_shot(shot_spec) -> video_file + metadata

7) Canonical Input: Storyboard JSON

Storyboard source is text-only (often AI-generated). We will store and validate it as JSON.

A template exists at: templates/storyboard.template.json

We will later build a utility script:

  • input: plain text fields or a simple text format
  • output: valid storyboard JSON

8) Pipeline Modules (Required)

A) Storyboard parsing & validation

  • Load storyboard JSON
  • Validate schema
  • Expand defaults (fps, resolution, global style)
  • Produce normalized shot list

B) Prompt compilation

  • Merge global style + shot prompt + camera notes
  • Produce positive + negative prompts
  • Keep deterministic via seeds

C) Generation runner (per shot)

  • For each shot: generate clip
  • Support:
    • seed control
    • chunking (e.g., generate 4-6 seconds then continue)
    • optional init frame handoff between shots

D) Assembly

  • Use ffmpeg concat to build final video
  • Optionally add:
    • transitions
    • temp audio
    • burn-in shot IDs for debugging mode

E) Upscaling (optional)

  • Upscale final to 2K/4K (post step)
  • Keep this modular so user can skip.

9) Determinism & Logging (Must Have)

For each shot and final render, save:

  • prompts (positive/negative)
  • seed(s)
  • model + revision/hash info if available
  • inference params (steps, cfg, sampler, resolution, fps, frames)
  • timing + VRAM notes if possible

Every run produces a folder:

  • outputs//
    • shots/
    • assembled/
    • metadata/

10) Testing Rules (Hard Requirement)

  • Tests must be written alongside features.
  • Whenever a file/function is modified, corresponding tests MUST be updated.
  • Prefer tests that verify:
    • schema validation works
    • prompt compiler output is stable
    • shot planner expands durations -> frame counts
    • assembly command lines are correct
    • metadata is generated correctly

Do not require visual quality assertions. Test structure and determinism.


11) Documentation Rules (Hard Requirement)

Maintain these continuously:

  • docs/developer.md
    • architecture
    • install steps
    • how to run tests
    • how to add a new model backend
  • docs/user.md
    • quickstart
    • how to create storyboard JSON
    • how to run generation
    • where outputs go
    • troubleshooting (VRAM, drivers, ffmpeg)

Docs must be updated whenever CLI flags, file formats, or workflows change.


12) Project Files to Maintain

Required:

  • requirements.txt (pip deps)
  • environment.yml (conda env)
  • templates/storyboard.template.json
  • docs/developer.md
  • docs/user.md
  • src/ (implementation)
  • tests/ (pytest)

13) Definition of Done

A feature is "done" only if:

  • implemented
  • tests added/updated
  • docs updated
  • reproducible install instructions remain valid

End of file.