Skip to content

Latest commit

 

History

History
187 lines (136 loc) · 5.68 KB

README.md

File metadata and controls

187 lines (136 loc) · 5.68 KB

SAT CogView3 & CogView-3-Plus

Read this in Chinese

This folder contains the inference code using the SAT weights, as well as fine-tuning code for SAT weights.

The code is the framework used by the team during model training. There are few comments, so it requires careful study.

Step-by-step guide to running the model

1. Environment setup

Ensure you have installed the dependencies required by this folder:

pip install -r requirements.txt

2. Download model weights

The following links are for different model weights:

CogView-3-Plus-3B

CogView-3-Base-3B

CogView-3-Base-3B-Relay

Next, arrange the model files into the following format:

.cogview3-plus-3b
├── transformer
│   ├── 1
│   │   └── mp_rank_00_model_states.pt
│   └── latest
└── vae
    └── imagekl_ch16.pt

Clone the T5 model. This model is not used for training or fine-tuning but is necessary. You can download the T5 model separately, but it must be in safetensors format, not bin format (otherwise an error may occur).

Since we have uploaded the T5 model in safetensors format in CogVideoX, a simple way is to clone the model from the CogVideoX-2B model and move it to the corresponding folder.

git clone https://huggingface.co/THUDM/CogVideoX-2b.git
# git clone https://www.modelscope.cn/ZhipuAI/CogVideoX-2b.git
mkdir t5-v1_1-xxl
mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl

With this setup, you will have a safetensor format T5 file, ensuring no errors during Deepspeed fine-tuning.

├── added_tokens.json
├── config.json
├── model-00001-of-00002.safetensors
├── model-00002-of-00002.safetensors
├── model.safetensors.index.json
├── special_tokens_map.json
├── spiece.model
└── tokenizer_config.json

0 directories, 8 files

3. Modify the files in configs.

Here is an example using CogView3-Base, with explanations for some of the parameters:

args:
  mode: inference
  relay_model: False # Set to True when using CogView-3-Relay
  load: "cogview3_base/transformer" # Path to the transformer folder
  batch_size: 8 # Number of images per inference
  grid_num_columns: 2 # Number of columns in grid.png output
  input_type: txt # Input can be from command line or TXT file
  input_file: configs/test.txt # Not needed for command line input
  fp16: True # Set to bf16 for CogView-3-Plus inference
  # bf16: True
  sampling_image_size: 512 # Fixed size, supports 512x512 resolution images
  # For CogView-3-Plus, use the following:
  # sampling_image_size_x: 1024 (width)
  # sampling_image_size_y: 1024 (height)

  output_dir: "outputs/cogview3_base-512x512"
  # This section is for CogView-3-Relay. Set the input_dir to the folder with base model generated images.
  # input_dir: "outputs/cogview3_base-512x512" 
  deepspeed_config: { }

model:
  conditioner_config:
  target: sgm.modules.GeneralConditioner
  params:
    emb_models:
      - is_trainable: False
        input_key: txt
        target: sgm.modules.encoders.modules.FrozenT5Embedder
        params:
          model_dir: "google/t5-v1_1-xxl" # Path to T5 safetensors
          max_length: 225 # Maximum prompt length

  first_stage_config:
    target: sgm.models.autoencoder.AutoencodingEngine
    params:
      ckpt_path: "cogview3_base/vae/imagekl_ch16.pt" # Path to VAE PT file
      monitor: val/rec_loss

4. Running the model

Different models require different code for inference. Here are the inference commands for each model:

CogView-3Plus

python sample_dit.py --base configs/cogview3_plus.yaml

CogView-3-Base

  • Original model
python sample_unet.py --base configs/cogview3_base.yaml
  • Distilled model
python sample_unet.py --base configs/cogview3_base_distill_4step.yaml

CogView-3-Relay

  • Original model
python sample_unet.py --base configs/cogview3_relay.yaml
  • Distilled model
python sample_unet.py --base configs/cogview3_relay_distill_1step.yaml 

The output image format will be a folder. The folder name will consist of the sequence number and the first 15 characters of the prompt, containing multiple images. The number of images is based on the batch parameter. The structure should look like this:

.
├── 000000000.png
├── 000000001.png
├── 000000002.png
├── 000000003.png
├── 000000004.png
├── 000000005.png
├── 000000006.png
├── 000000007.png
└── grid.png

1 directory, 9 files

In this example, the batch size is 8, so there are 8 images along with one grid.png.