This folder contains the inference code using the SAT weights, as well as fine-tuning code for SAT weights.
The code is the framework used by the team during model training. There are few comments, so it requires careful study.
Ensure you have installed the dependencies required by this folder:
pip install -r requirements.txt
The following links are for different model weights:
- transformer: https://cloud.tsinghua.edu.cn/d/f913eabd3f3b4e28857c
- vae: https://cloud.tsinghua.edu.cn/d/af4cc066ce8a4cf2ab79
-
transformer:
- cogview3-base: https://cloud.tsinghua.edu.cn/d/242b66daf4424fa99bf0
- cogview3-base-distill-4step: https://cloud.tsinghua.edu.cn/d/d10032a94db647f5aa0e
- cogview3-base-distill-8step: https://cloud.tsinghua.edu.cn/d/1598d4fe4ebf4afcb6ae
These three versions are interchangeable. Choose the one that suits your needs and run it with the corresponding configuration file.
-
transformer:
- cogview3-relay: https://cloud.tsinghua.edu.cn/d/134951acced949c1a9e1/
- cogview3-relay-distill-2step: https://cloud.tsinghua.edu.cn/d/6a902976fcb94ac48402
- cogview3-relay-distill-1step: https://cloud.tsinghua.edu.cn/d/4d50ec092c64418f8418/
These three versions are interchangeable. Choose the one that suits your needs and run it with the corresponding configuration file.
-
vae: Same as CogView-3-Base-3B
Next, arrange the model files into the following format:
.cogview3-plus-3b
├── transformer
│ ├── 1
│ │ └── mp_rank_00_model_states.pt
│ └── latest
└── vae
└── imagekl_ch16.pt
Clone the T5 model. This model is not used for training or fine-tuning but is necessary. You can download the T5 model separately, but it must be in safetensors
format, not bin
format (otherwise an error may occur).
Since we have uploaded the T5 model in safetensors
format in CogVideoX
, a simple way is to clone the model from the CogVideoX-2B
model and move it to the corresponding folder.
git clone https://huggingface.co/THUDM/CogVideoX-2b.git
# git clone https://www.modelscope.cn/ZhipuAI/CogVideoX-2b.git
mkdir t5-v1_1-xxl
mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl
With this setup, you will have a safetensor format T5 file, ensuring no errors during Deepspeed fine-tuning.
├── added_tokens.json
├── config.json
├── model-00001-of-00002.safetensors
├── model-00002-of-00002.safetensors
├── model.safetensors.index.json
├── special_tokens_map.json
├── spiece.model
└── tokenizer_config.json
0 directories, 8 files
Here is an example using CogView3-Base
, with explanations for some of the parameters:
args:
mode: inference
relay_model: False # Set to True when using CogView-3-Relay
load: "cogview3_base/transformer" # Path to the transformer folder
batch_size: 8 # Number of images per inference
grid_num_columns: 2 # Number of columns in grid.png output
input_type: txt # Input can be from command line or TXT file
input_file: configs/test.txt # Not needed for command line input
fp16: True # Set to bf16 for CogView-3-Plus inference
# bf16: True
sampling_image_size: 512 # Fixed size, supports 512x512 resolution images
# For CogView-3-Plus, use the following:
# sampling_image_size_x: 1024 (width)
# sampling_image_size_y: 1024 (height)
output_dir: "outputs/cogview3_base-512x512"
# This section is for CogView-3-Relay. Set the input_dir to the folder with base model generated images.
# input_dir: "outputs/cogview3_base-512x512"
deepspeed_config: { }
model:
conditioner_config:
target: sgm.modules.GeneralConditioner
params:
emb_models:
- is_trainable: False
input_key: txt
target: sgm.modules.encoders.modules.FrozenT5Embedder
params:
model_dir: "google/t5-v1_1-xxl" # Path to T5 safetensors
max_length: 225 # Maximum prompt length
first_stage_config:
target: sgm.models.autoencoder.AutoencodingEngine
params:
ckpt_path: "cogview3_base/vae/imagekl_ch16.pt" # Path to VAE PT file
monitor: val/rec_loss
Different models require different code for inference. Here are the inference commands for each model:
python sample_dit.py --base configs/cogview3_plus.yaml
- Original model
python sample_unet.py --base configs/cogview3_base.yaml
- Distilled model
python sample_unet.py --base configs/cogview3_base_distill_4step.yaml
- Original model
python sample_unet.py --base configs/cogview3_relay.yaml
- Distilled model
python sample_unet.py --base configs/cogview3_relay_distill_1step.yaml
The output image format will be a folder. The folder name will consist of the sequence number and the first 15 characters of the prompt, containing multiple images. The number of images is based on the batch
parameter. The structure should look like this:
.
├── 000000000.png
├── 000000001.png
├── 000000002.png
├── 000000003.png
├── 000000004.png
├── 000000005.png
├── 000000006.png
├── 000000007.png
└── grid.png
1 directory, 9 files
In this example, the batch
size is 8, so there are 8 images along with one grid.png
.