santhoshj/video-gen

Fork 0

Files

Santhosh Janardhanan c33c1d2f36 adjust the resolution based on available VRAM. add elapsed time.

2026-02-04 03:06:31 -05:00

5.7 KiB

Raw Permalink Blame History

agents.md

Project: Local AI Video Generation from Text Storyboards (Windows + RTX 5070 12GB)

0) Who this is for

The owner (user) is not an ML expert. The system must:

be reproducible (conda + requirements)
have guardrails (configs, logs, validation)
be test-driven (pytest)
maintain docs (developer + user)

1) High-Level Goal

Build a local pipeline that converts text-only storyboards into 15-30 second videos by:

converting storyboard -> shot plan
generating shot clips (T2V or I2V when possible)
assembling clips into a final MP4
upscaling to 2K/4K if desired

This is a shot-based system, not "one prompt makes a whole movie".

2) Hard Constraints (Hardware & OS)

Target system:

Windows 11
NVIDIA RTX 5070 (12GB VRAM) - Must use GPU.
32GB RAM
2TB SSD
Anaconda available

Design must be stable under 12GB VRAM using:

fp16/bf16
attention slicing
xFormers / SDPA where supported
optional CPU offload

3) Output Targets (Realistic)

Native generation: 720p-1080p (preferred)
Final delivery: 1080p required; 2K/4K via upscaling
Duration: 15-30s per video (may be segmented)
FPS: 24 default
Output: MP4 (H.264/H.265)

4) CUDA 13.1 Reality & PyTorch Plan (Critical)

User has CUDA Toolkit 13.1 installed. Current PyTorch builds generally ship with and target CUDA 12.x runtimes. We must NOT assume PyTorch will build/run against local CUDA 13.1 toolkit.

Plan:

Use PyTorch prebuilt binaries that bundle CUDA runtime (cu121/cu124/cu128).
Rely on NVIDIA driver compatibility rather than local CUDA toolkit version.
Avoid compiling custom CUDA extensions unless necessary.

Implementation notes:

For RTX 5070 (sm_120), use CUDA 12.8 wheels via pip: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
Prefer conda for Python, ffmpeg, and general deps; use pip for torch if sm_120 support is required.
If xFormers causes build issues, use PyTorch SDPA and disable xFormers.

5) Approved Stack (Do Not Deviate)

Core

Python 3.10 or 3.11 (conda env)
PyTorch (CUDA 12.x build, cu121/cu124/cu128)
diffusers + transformers + accelerate + safetensors
ffmpeg for assembly
opencv-python for frame IO (if needed)
pydantic for config/schema validation
rich / loguru for logs
ftfy for text normalization (required by WAN)

Testing

pytest
pytest-cov
snapshot-ish tests where feasible (metadata + shapes, not visual perfection)

Docs

/docs/developer.md (developer documentation)
/docs/user.md (user manual)
Keep docs updated alongside code changes.

6) Video Models (Pragmatic Choices)

Primary (target)

WAN 2.x family (T2V; optional I2V if supported in chosen pipeline) Goal: best possible quality on consumer VRAM with chunking.

Secondary / fallback

Stable Video Diffusion (SVD) if WAN is unstable
LTX-Video (only if it fits and is stable in our stack)

All model backends must implement the same interface:

generate_shot(shot_spec) -> video_file + metadata

7) Canonical Input: Storyboard JSON

Storyboard source is text-only (often AI-generated). We will store and validate it as JSON.

A template exists at: templates/storyboard.template.json

We will later build a utility script:

input: plain text fields or a simple text format
output: valid storyboard JSON

8) Pipeline Modules (Required)

A) Storyboard parsing & validation

Load storyboard JSON
Validate schema
Expand defaults (fps, resolution, global style)
Produce normalized shot list

B) Prompt compilation

Merge global style + shot prompt + camera notes
Produce positive + negative prompts
Keep deterministic via seeds

C) Generation runner (per shot)

For each shot: generate clip
Support:
- seed control
- chunking (e.g., generate 4-6 seconds then continue)
- optional init frame handoff between shots

D) Assembly

Use ffmpeg concat to build final video
Optionally add:
- transitions
- temp audio
- burn-in shot IDs for debugging mode

E) Upscaling (optional)

Upscale final to 2K/4K (post step)
Keep this modular so user can skip.

9) Determinism & Logging (Must Have)

For each shot and final render, save:

prompts (positive/negative)
seed(s)
model + revision/hash info if available
inference params (steps, cfg, sampler, resolution, fps, frames)
timing + VRAM notes if possible

Every run produces a folder:

outputs//
- shots/
- assembled/
- metadata/

10) Testing Rules (Hard Requirement)

Tests must be written alongside features.
Whenever a file/function is modified, corresponding tests MUST be updated.
Prefer tests that verify:
- schema validation works
- prompt compiler output is stable
- shot planner expands durations -> frame counts
- assembly command lines are correct
- metadata is generated correctly

Do not require visual quality assertions. Test structure and determinism.

11) Documentation Rules (Hard Requirement)

Maintain these continuously:

docs/developer.md
- architecture
- install steps
- how to run tests
- how to add a new model backend
docs/user.md
- quickstart
- how to create storyboard JSON
- how to run generation
- where outputs go
- troubleshooting (VRAM, drivers, ffmpeg)

Docs must be updated whenever CLI flags, file formats, or workflows change.

12) Project Files to Maintain

Required:

requirements.txt (pip deps)
environment.yml (conda env)
templates/storyboard.template.json
docs/developer.md
docs/user.md
src/ (implementation)
tests/ (pytest)

13) Definition of Done

A feature is "done" only if:

implemented
tests added/updated
docs updated
reproducible install instructions remain valid

End of file.

5.7 KiB Raw Permalink Blame History