SAT CogView3 & CogView-3-Plus

This folder contains the inference code using the SAT weights, as well as fine-tuning code for SAT weights.

The code is the framework used by the team during model training. There are few comments, so it requires careful study.

Step-by-step guide to running the model

1. Environment setup

Ensure you have installed the dependencies required by this folder:

pip install -r requirements.txt

2. Download model weights

The following links are for different model weights:

CogView-3-Plus-3B

transformer: https://cloud.tsinghua.edu.cn/d/f913eabd3f3b4e28857c
vae: https://cloud.tsinghua.edu.cn/d/af4cc066ce8a4cf2ab79

CogView-3-Base-3B

transformer:
- cogview3-base: https://cloud.tsinghua.edu.cn/d/242b66daf4424fa99bf0
- cogview3-base-distill-4step: https://cloud.tsinghua.edu.cn/d/d10032a94db647f5aa0e
- cogview3-base-distill-8step: https://cloud.tsinghua.edu.cn/d/1598d4fe4ebf4afcb6ae
These three versions are interchangeable. Choose the one that suits your needs and run it with the corresponding configuration file.
vae: https://cloud.tsinghua.edu.cn/d/c8b9497fc5124d71818a/

CogView-3-Base-3B-Relay

transformer:
- cogview3-relay: https://cloud.tsinghua.edu.cn/d/134951acced949c1a9e1/
- cogview3-relay-distill-2step: https://cloud.tsinghua.edu.cn/d/6a902976fcb94ac48402
- cogview3-relay-distill-1step: https://cloud.tsinghua.edu.cn/d/4d50ec092c64418f8418/
These three versions are interchangeable. Choose the one that suits your needs and run it with the corresponding configuration file.
vae: Same as CogView-3-Base-3B

Next, arrange the model files into the following format:

.cogview3-plus-3b
├── transformer
│   ├── 1
│   │   └── mp_rank_00_model_states.pt
│   └── latest
└── vae
    └── imagekl_ch16.pt

Clone the T5 model. This model is not used for training or fine-tuning but is necessary. You can download the T5 model separately, but it must be in safetensors format, not bin format (otherwise an error may occur).

Since we have uploaded the T5 model in safetensors format in CogVideoX, a simple way is to clone the model from the CogVideoX-2B model and move it to the corresponding folder.

git clone https://huggingface.co/THUDM/CogVideoX-2b.git
# git clone https://www.modelscope.cn/ZhipuAI/CogVideoX-2b.git
mkdir t5-v1_1-xxl
mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl

With this setup, you will have a safetensor format T5 file, ensuring no errors during Deepspeed fine-tuning.

├── added_tokens.json
├── config.json
├── model-00001-of-00002.safetensors
├── model-00002-of-00002.safetensors
├── model.safetensors.index.json
├── special_tokens_map.json
├── spiece.model
└── tokenizer_config.json

0 directories, 8 files

3. Modify the files in `configs`.

Here is an example using CogView3-Base, with explanations for some of the parameters:

args:
  mode: inference
  relay_model: False # Set to True when using CogView-3-Relay
  load: "cogview3_base/transformer" # Path to the transformer folder
  batch_size: 8 # Number of images per inference
  grid_num_columns: 2 # Number of columns in grid.png output
  input_type: txt # Input can be from command line or TXT file
  input_file: configs/test.txt # Not needed for command line input
  fp16: True # Set to bf16 for CogView-3-Plus inference
  # bf16: True
  sampling_image_size: 512 # Fixed size, supports 512x512 resolution images
  # For CogView-3-Plus, use the following:
  # sampling_image_size_x: 1024 (width)
  # sampling_image_size_y: 1024 (height)

  output_dir: "outputs/cogview3_base-512x512"
  # This section is for CogView-3-Relay. Set the input_dir to the folder with base model generated images.
  # input_dir: "outputs/cogview3_base-512x512" 
  deepspeed_config: { }

model:
  conditioner_config:
  target: sgm.modules.GeneralConditioner
  params:
    emb_models:
      - is_trainable: False
        input_key: txt
        target: sgm.modules.encoders.modules.FrozenT5Embedder
        params:
          model_dir: "google/t5-v1_1-xxl" # Path to T5 safetensors
          max_length: 225 # Maximum prompt length

  first_stage_config:
    target: sgm.models.autoencoder.AutoencodingEngine
    params:
      ckpt_path: "cogview3_base/vae/imagekl_ch16.pt" # Path to VAE PT file
      monitor: val/rec_loss

4. Running the model

Different models require different code for inference. Here are the inference commands for each model:

CogView-3Plus

python sample_dit.py --base configs/cogview3_plus.yaml

CogView-3-Base

Original model

python sample_unet.py --base configs/cogview3_base.yaml

Distilled model

python sample_unet.py --base configs/cogview3_base_distill_4step.yaml

CogView-3-Relay

Original model

python sample_unet.py --base configs/cogview3_relay.yaml

Distilled model

python sample_unet.py --base configs/cogview3_relay_distill_1step.yaml

The output image format will be a folder. The folder name will consist of the sequence number and the first 15 characters of the prompt, containing multiple images. The number of images is based on the batch parameter. The structure should look like this:

.
├── 000000000.png
├── 000000001.png
├── 000000002.png
├── 000000003.png
├── 000000004.png
├── 000000005.png
├── 000000006.png
├── 000000007.png
└── grid.png

1 directory, 9 files

In this example, the batch size is 8, so there are 8 images along with one grid.png.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SAT CogView3 & CogView-3-Plus

Step-by-step guide to running the model

1. Environment setup

2. Download model weights

CogView-3-Plus-3B

CogView-3-Base-3B

CogView-3-Base-3B-Relay

3. Modify the files in `configs`.

4. Running the model

CogView-3Plus

CogView-3-Base

CogView-3-Relay

Files

README.md

Latest commit

History

README.md

File metadata and controls

SAT CogView3 & CogView-3-Plus

Step-by-step guide to running the model

1. Environment setup

2. Download model weights

CogView-3-Plus-3B

CogView-3-Base-3B

CogView-3-Base-3B-Relay

3. Modify the files in configs.

4. Running the model

CogView-3Plus

CogView-3-Base

CogView-3-Relay

3. Modify the files in `configs`.