Skip to content

Latest commit

 

History

History
256 lines (163 loc) · 26.4 KB

Report-v1.1.0.md

File metadata and controls

256 lines (163 loc) · 26.4 KB

Report v1.1.0

In April 2024, we launched Open-Sora-Plan v1.0.0, featuring a simple and efficient design along with remarkable performance in text-to-video generation. It has already been adopted as a foundational model in numerous research projects, including its data and model.

Today, we are excited to present Open-Sora-Plan v1.1.0, which significantly improves video generation quality and duration.

Compared to the previous version, Open-Sora-Plan v1.1.0, the improvements include:

  1. Better compressed visual representations. We optimized the CausalVideoVAE architecture, which now has stronger performance and higher inference efficiency.
  2. Generate higher quality, longer videos. We used higher quality visual data and captions by ShareGPT4Video, enabling the model to better understand the workings of the world.

Along with performance improvements, Open-Sora-Plan v1.1.0 maintains the minimalist design and data efficiency of v1.0.0. Remarkably, we found that v1.1.0 exhibits similar performance to the Sora base model, indicating that our version's evolution aligns with the scaling law demonstrated by Sora.

Open-Source Release

We open-source the Open-Sora-Plan to facilitate future development of Video Generation in the community. Code, data, model will be made publicly available.

  • Demo: Hugging Face demo here.
  • Code: All training scripts and sample scripts.
  • Model: Both Diffusion Model and CasualVideoVAE here.
  • Data: Both raw videos and captions here.

Gallery

221×512×512 Text-to-Video Generation

221×512×512 (9.2s) 221×512×512 (9.2s) 221×512×512 (9.2s) 221×512×512 (9.2s)
This close-up shot of a Victoria crowned pigeon showcases its striking blue plumage ... a cat wearing sunglasses and working as a lifeguard at pool. Photorealistic closeup video of two pirate ships battling each other as they sail ... A movie trailer featuring the adventures ofthe 30 year old spacemanwearing a redwool ...
A snowy forest landscape with a dirt road running through it. The road is flanked by ... Drone shot along the Hawaii jungle coastline, sunny day. Kayaks in the water. Alpacas wearing knit wool sweaters, graffiti background, sunglasses. The camera rotates around a large stack of vintage televisions all showing different ...
A drone camera circles around a beautiful historic church built on a rocky outcropping ... Aerial view of Santorini during the blue hour, showcasing the stunning architecture ... A robot dog explores the surface of Mars, kicking up red dust as it investigates ... An aerial shot of a lighthouse standing tall on a rocky cliff, its beacon cutting ...
3D animation of a small, round, fluffy creature with big, expressive eyes explores ... A corgi vlogging itself in tropical Maui. A single drop of liquid metal falls from a floating orb, landing on a mirror-like ... The video presents an abstract composition centered around a hexagonal shape adorned ...

65×512×512 Text-to-Video Generation

65×512×512 (2.7s) 65×512×512 (2.7s) 65×512×512 (2.7s) 65×512×512 (2.7s)
Extreme close-up of chicken and green pepper kebabs grilling on a barbeque with flames. 3D animation of a small, round, fluffy creature with big, expressive eyes explores a ... A corgi vlogging itself in tropical Maui. In a studio, there is a painting depicting a ship sailing through the rough sea.
A robot dog trots down a deserted alley at night, its metallic paws clinking softly ... A solitary spider weaves its web in a quiet corner. The web shimmers and glows with ... A lone surfer rides a massive wave, skillfully maneuvering through the surf. The water ... A solitary cheetah sprints across the savannah, its powerful muscles propelling it ...
A solitary astronaut plants a flag on an alien planet covered in crystal formations ... At dawn's first light, a spaceship slowly exits the edge of the galaxy against a ... A dapper puppy in a miniature suit, basking in the afternoon sun, adjusting his tie ... A wise old elephant painting abstract art with its trunk, each stroke a burst of color ...
In an ornate, historical hall, a massive tidal wave peaks and begins to crash. Two ... A Shiba Inu dog wearing a beret and black turtleneck. A painting of a boat on water comes to life, with waves crashing and the boat becoming ... Many spotted jellyfish pulsating under water. Their bodies are transparent and glowing ...
An animated hedgehog with distinctive spiky hair and large eyes is seen exploring a ... An animated rabbit in a playful pink snowboarding outfit is carving its way down a ... A person clad in a space suit with a helmet and equipped with a chest light and arm ...

65×512×512 Video Editing

generated 65×512×512 (2.7s) edited 65×512×512 (2.7s)

512×512 Text-to-Image Generation

Detailed Technical Report

CasualVideoVAE

Model Structure

As the number of frames increases, the encoder overhead of CausalVideoVAE gradually rises. When training with 257 frames, 80GB of VRAM is insufficient for the VAE to encode the video. Therefore, we reduced the number of CausalConv3D layers, retaining only the last two stages of CausalConv3D in the encoder. This change significantly lowers the overhead while maintaining nearly the same performance. Note that we only modified the encoder; the decoder still retains all CausalConv3D layers, as training the Diffusion Model does not require the decoder.

vaemodel

We compare the computational overhead of the two versions by testing the forward inference of the encoder on the H100.

Version 129×256×256 257×256×256 513×256×256
Peak Mem. Speed Peak Mem. Speed Peak Mem. Speed
v1.0.0 22G 2.9 it/s OOM - OOM -
v1.1.0 18G 4.9 it/s 34G 2.5 it/s 61G 1.2 it/s

Temporal Module

vaemodel

In v1.0.0, our temporal module had only TemporalAvgPool. TemporalAvgPool leads to the loss of high-frequency information in the video, such as details and edges. To address this issue, we improved this module in v1.1.0. As shown in the figure below, we introduced convolution and added learnable weights, allowing different branches to decouple different features. When we omit CausalConv3D, the video is reconstructed very blurry. Similarly, when we omit TemporalAvgPool, the video becomes very sharp.

SSIM↑ LPIPS↓ PSNR↑
Base 0.850 0.091 28.047
+ Frames 0.868 0.070 28.829
+ Reset mixed factor 0.873 0.070 29.140

Training Details

Similar to v1.0.0, we initialized from the Latent Diffusion's VAE and used tail initialization. For CausalVideoVAE, we trained for 100k steps in the first stage with a video shape of 9×256×256. Subsequently, we increased the frame count from 9 to 25 and found that this significantly improved the model's performance. It is important to clarify that we enabled the mixed factor during both the first and second stages, with a value of a (sigmoid(mixed factor)) reaching 0.88 at the end of training, indicating the model's tendency to retain low-frequency information. In the third stage, we reinitialized the mixed factor to 0.5 (sigmoid(0.5) = 0.6225), which further enhanced the model's capabilities.

Loss Function

We found that using GAN loss helps retain high-frequency information and alleviates grid artifacts. Additionally, we observed that switching from 2D GAN to 3D GAN provides further improvements.

GAN Loss/Step SSIM↑ LPIPS↓ PSNR↑
2D/80k 0.879 0.068 29.480
3D/80k 0.882 0.067 29.890

Inference Tricks

Therefore, we introduced a method called temporal rollback tiled convolution, a tiling approach specifically designed for CausalVideoVAE. Specifically, all windows except the first one discard the first frame because the first frame in a window is treated as an image, while the remaining frames should be treated as video frames.

tiled_temp

We tested the speed on the H100 with a window size of 65×256×256.

Version 129×256×256 257×256×256 513×256×256
Peak Mem. Speed Peak Mem. Speed Peak Mem. Speed
4×8×8 10G 1.3 s/it 10G 2.6 s/it 10G 5.3 s/it

Data Construction

Since Open-Sora-Plan supports joint training of images and videos, our data collection is divided into two parts: images and videos. Images do not need to originate from videos; they are independent datasets. We spent approximately 32×240 H100 hours generating image and video captions, and all of this is open source!

Image-Text Collection Pipeline

We obtained 11 million image-text pairs from Pixart-Alpha, with captions generated by LLaVA. Additionally, we utilized the high-quality OCR dataset Anytext-3M, which pairs each image with corresponding OCR characters. However, these captions were insufficient to describe the entire image, so we used InternVL-1.5 for supplementary descriptions. Since T5 only supports English, we filtered for English data, which constitutes about half of the complete dataset. Furthermore, we selected high-quality images from Laion-5B to enhance human-like generation quality. The selection criteria included high resolution, high aesthetic scores, and watermark-free images containing people.

Here, we are open-sourcing the prompt used for InternVL-1.5:

# for anytext-3m
Combine this rough caption: "{}", analyze the image in a comprehensive and detailed manner. "{}" can be recognized in the image.
# for human-160k
Analyze the image in a comprehensive and detailed manner.
Name Image Source Text Captioner Num pair
SAM-11M SAM LLaVA 11,185,255
Anytext-3M-en Anytext InternVL-1.5 1,886,137
Human-160k Laion InternVL-1.5 162,094

Video-Text Collection Pipeline

In v1.0.0, we sampled one frame from each video to generate captions. However, as video length increased, a single frame could not adequately describe the entire video's content or temporal movements. Therefore, we used a video captioner to generate captions for the entire video clip. Specifically, we used ShareGPT4Video, which effectively covers temporal information and describes the entire video content. The v1.1.0 video dataset comprises approximately 3k hours, compared to only 300 hours in v1.0.0. As before, we have open-sourced all text annotations and videos (both under the CC0 license), which can be found here.

Name Hours Num frames Num pair
Mixkit 42.0h 65 54,735
513 1,997
Pixabay 353.3h 65 601,513
513 51,483
Pexel 2561.9h 65 3,832,666
513 271,782

Training Diffusion Model

Similar to our previous work, we employed a multi-stage cascaded training method. Below is our training card:

Stage 1

Surprisingly, we initially believed that the performance of the diffusion model would improve with longer training. However, by observing the logs, we found that videos generated at 50k steps were of higher quality than those at 70-100k steps. In fact, extensive sampling revealed that checkpoints at 40-60k steps outperformed those at 80-100k steps. Quantitatively, 50k steps correspond to approximately 2 epochs of training. It is currently unclear whether this is due to overfitting from a small dataset or the limited capacity of the 2+1D model.

Stage 2

In the second stage, we used Huawei Ascend computing power for training. This stage's training and inference were fully supported by Huawei. We conducted sequence parallel training and inference on a large-scale cluster, distributing one sample across eight ranks. Models trained on Huawei Ascend can also be loaded into GPUs and generate videos of the same quality.

Stage 3

In the third stage, we further increased the frame count to 513 frames, approximately 21 seconds at 24 FPS. However, this stage presents several challenges, such as ensuring temporal consistency in the 2+1D model over long durations and whether the current amount of data is sufficient. We are still training the model for this stage and continuously monitoring its progress.

Name Stage 1 Stage 2 Stage 3
Training Video Size 65×512×512 221×512×512 513×512×512
Compute (#Num x #Hours) 80 H100 × 72 512 Ascend × 72 Under Training
Checkpoint HF HF Under Training
Log wandb - -
Training Data ~3k hours videos + 13M images

Video Editing

The recently proposed ReVideo achieves accurate video editing by modifying the first frame and applying motion control within the edited area. Although it achieves excellent video editing performance, the editing length is limited by the base model SVD. Open-Sora, as a fundamental model for long-video generation, can compensate for this issue. Currently, we are collaborating with the ReVideo team to use Open-Sora as the base model for long video editing. Some preliminary results are shown here.

The initial version still needs improvement in several aspects. In the future, we will continue to explore integration with ReVideo to develop improved long-video editing models.

Failed Case and Discussion

Despite the promising results of v1.1.0, there remains a gap between our model and Sora. Here, we present some failure cases and discuss them.

CasualVideoVAE

Despite the significant performance improvement of VAE in v1.1.0 over v1.0.0, we still encounter failures in challenging cases, such as sand dunes and leaves. The video on the left shows the reconstructed video downsampled by a factor of 4 in time, while the video on the right is downsampled by a factor of 2. Both exhibit jitter when reconstructing fine-grained features. This indicates that reducing temporal downsampling alone cannot fully resolve the jitter issue.

288.mp4
288.mp4

Diffusion Model

Semantic distortion

On the left is a video generated by v1.1.0 showing a puppy in the snow. In this video, the puppy's head exhibits semantic distortion, indicating that the model struggles to correctly identify which head belongs to which dog. On the right is a video generated by Sora's base model. We observe that Sora's early base model also experienced semantic distortion issues. This suggests that we may achieve better results by scaling up the model and increasing the amount of training data.

Prompt:A litter of golden retriever puppies playing in the snow.Their heads pop out of the snow, covered in.

Our Sora Base×1 Sora Base×4 Sora Base×32

Limited dynamics

The primary difference between videos and images lies in their dynamic nature, where objects undergo a series of changes across consecutive frames. However, the videos generated by v1.1.0 still contain many instances of limited dynamics. Upon reviewing a large number of training videos, we found that while web-crawled videos have high visual quality, they are often filled with meaningless close-up shots. These close-ups typically show minimal movement or are even static. On the left, we present a generated video of a bird, while on the right is a training video we found, which is almost static. There are many similar videos in the dataset from stock footage sites.

Prompt:This close-up shot of a Victoria crowned pigeon showcases its striking blue plumage and red chest. Its crest is made of delicate, lacy feathers, while its eye is a striking red color. The bird's head is tilted slightly to the side,giving the impression of it looking regal and majestic. The background is blurred,drawing attention to the bird's striking appearance.

Our Raw video

Negative prompt

We found that using negative prompts can significantly improve video quality, even though we did not explicitly tag the training data with different labels. On the left is a video sampled using a negative prompt, while on the right is a video generated without a negative prompt. This suggests that we may need to incorporate more prior knowledge into the training data. For example, when a video has a watermark, we should note "watermark" in the corresponding caption. When a video's bitrate is too low, we should add more tags to distinguish it from high-quality videos, such as "low quality" or "blurry." We believe that explicitly injecting these priors can help the model differentiate between the vast amounts of pretraining data (low quality) and the smaller amounts of fine-tuning data (high quality), thereby generating higher quality videos.

Prompt:A litter of golden retriever puppies playing in the snow.Their heads pop out of the snow, covered in. Negative Prompt:distorted, discontinuous, ugly, blurry, low resolution, motionless, static, low quality

With Negative Prompt Without Negative Prompt

Future Work

In our future work, we will focus on two main areas: (1) data scaling and (2) model design. Once we have a robust baseline model, we will extend it to handle variable durations and conditional control models.

Data Scaling

Data source

As mentioned earlier, our dataset is entirely sourced from stock footage websites. Although these videos are of high quality, many consist of close-up shots of specific areas, resulting in slow motion in the videos. We believe this is one of the main reasons for the limited dynamics observed. Therefore, we will continue to collect datasets from diverse sources to address this issue.

Data volume

In v1.1.0, our dataset comprises only ~3k hours of video. We are actively collecting more data and anticipate that the video dataset for the next version will reach ~100k hours. We welcome recommendations from the open-source community for additional datasets.

Model Design

CasualVideoVAE

In our internal testing, even without downsampling in time, we found that it is not possible to completely resolve the jitter issue in reconstructing fine-grained features. Therefore, we need to reconsider how to mitigate video jitter to the greatest extent possible while simultaneously supporting both images and videos. We will introduce a more powerful CasualVideoVAE in the next version.

Diffusion Model

In v1.1.0, we found that 2+1D models can generate higher-quality videos in short durations. However, for long videos, they tend to exhibit discontinuities and inconsistencies. Therefore, we will explore more possibilities in model architecture to address this issue.