Skip to content

Release v1.2.0

Latest
Compare
Choose a tag to compare
@LinB203 LinB203 released this 25 Jul 06:28
· 12 commits to main since this release
adb2a20

v1.2.0 is here! Utilizing a 3D full attention architecture instead of 2+1D. We released a true 3D video diffusion model trained on 4s 720p.

  • Architecture shift from 2+1D model to 3D full attention architecture and no longer supports 2+1D.
  • Instead of joint image-video training, the image weights are trained first as the initialization for the video.
  • Release all data annotations, the data are filtered by aesthetic and motion.
  • Improve CasualVideoVAE performance and report performance on validation set of WebVid and Panda70M.

Although the 3D attention architecture excels in spatio-temporal consistency, it is so expensive to train that it is difficult to scale up. We hope to collaborate with the open-source community to optimize the 3D DiT architecture. For further details, please refer to our report.