Merge pull request #384 from Stability-AI/vikram/sv4d

Adds SV4D code
Stability-AI · Jul 24, 2024 · 31fe459 · 31fe459
2 parents fbdc58c + abe9ed3
commit 31fe459
Show file tree

Hide file tree

Showing 16 changed files with 3,174 additions and 23 deletions.
diff --git a/README.md b/README.md
@@ -4,6 +4,30 @@
 
 ## News
 
+
+**July 24, 2024**
+- We are releasing **[Stable Video 4D (SV4D)](https://huggingface.co/stabilityai/sv4d)**, a video-to-4D diffusion model for novel-view video synthesis. For research purposes:
+    - **SV4D** was trained to generate 40 frames (5 video frames x 8 camera views) at 576x576 resolution, given 5 context frames (the input video), and 8 reference views (synthesised from the first frame of the input video, using a multi-view diffusion model like SV3D) of the same size, ideally white-background images with one object.
+    - To generate longer novel-view videos (21 frames), we propose a novel sampling method using SV4D, by first sampling 5 anchor frames and then densely sampling the remaining frames while maintaining temporal consistency.
+    - Please check our [project page](), [tech report]() and [video summary]() for more details.
+
+**QUICKSTART** : `python scripts/sampling/simple_video_sample_4d.py --input_path assets/test_video1.mp4 --output_folder outputs/sv4d` (after downloading [SV4D](https://huggingface.co/stabilityai/sv4d) and [SV3D_u]((https://huggingface.co/stabilityai/sv3d)) from HuggingFace)
+
+To run **SV4D** on a single input video of 21 frames:
+- Download SV3D models (`sv3d_u.safetensors` and `sv3d_p.safetensors`) from [here](https://huggingface.co/stabilityai/sv3d) and SV4D model (`sv4d.safetensors`) from [here](https://huggingface.co/stabilityai/sv4d) to `checkpoints/`
+- Run `python scripts/sampling/simple_video_sample_4d.py --input_path <path/to/video>`
+    - `input_path` : The input video `<path/to/video>` can be
+      - a single video file in `gif` or `mp4` format, such as `assets/test_video1.mp4`, or
+      - a folder containing images of video frames in `.jpg`, `.jpeg`, or `.png` format, or
+      - a file name pattern matching images of video frames.
+    - `num_steps` : default is 20, can increase to 50 for better quality but longer sampling time.
+    - `sv3d_version` : To specify the SV3D model to generate reference multi-views, set `--sv3d_version=sv3d_u` for SV3D_u or `--sv3d_version=sv3d_p` for SV3D_p.
+    - `elevations_deg` : To generate novel-view videos at a specified elevation (default elevation is 10) using SV3D_p (default is SV3D_u), run `python scripts/sampling/simple_video_sample_4d.py --input_path test_video1.mp4 --sv3d_version sv3d_p --elevations_deg 30.0`
+    - **Background removal** : For input videos with plain background, (optionally) use [rembg](https://github.com/danielgatis/rembg) to remove background and crop video frames by setting `--remove_bg=True`. To obtain higher quality outputs on real-world input videos (with noisy background), try segmenting the foreground object using [Cliipdrop](https://clipdrop.co/) before running SV4D.
+
+  ![tile](assets/sv4d.gif)
+
+
 **March 18, 2024**
 - We are releasing **[SV3D](https://huggingface.co/stabilityai/sv3d)**, an image-to-video model for novel multi-view synthesis, for research purposes:
     - **SV3D** was trained to generate 21 frames at resolution 576x576, given 1 context frame of the same size, ideally a white-background image with one object.

diff --git a/assets/hiphop_parrot.mp4 b/assets/hiphop_parrot.mp4
diff --git a/assets/sv4d.gif b/assets/sv4d.gif
diff --git a/assets/test_video1.mp4 b/assets/test_video1.mp4
diff --git a/assets/test_video2.mp4 b/assets/test_video2.mp4