PLSC/task/ssl/cae at master · PaddlePaddle/PLSC

Name	Name	Last commit message	Last commit date
parent directory ..
dall_e	dall_e
util	util
README.md	README.md
engine_finetune.py	engine_finetune.py
engine_pretrain.py	engine_pretrain.py
extract_model.py	extract_model.py
finetune.sh	finetune.sh
linprobe.sh	linprobe.sh
main_finetune.py	main_finetune.py
main_linprobe.py	main_linprobe.py
main_pretrain.py	main_pretrain.py
pretrain.sh	pretrain.sh

CAE

CAE is a novel masked image modeling (MIM) approach for self-supervised representation pretraining. The goal is to pretrain an encoder by solving the pretext task: estimate the masked patches from the visible patches in an image. We first feed the visible patches into the encoder, extracting the representations. Then, we make predictions from visible patches to masked patches in the encoded representation space. We introduce an alignment constraint, encouraging that the representations for masked patches, predicted from the encoded representations of visible patches, are aligned with the masked patch presentations computed from the encoder. In other words, the predicted representations are expected to lie in the encoded representation space, which empirically shows the benefit to representation learning. Last, the predicted masked patch representations are mapped to the targets of the pretext task through a decoder.

In comparison to previous MIM methods (e.g., BEiT) that couple the encoding and pretext task completion roles, our approach benefits the separation of the representation learning (encoding) role and the pretext task completion role, improving the representation learning capacity and accordingly helping more on downstream tasks. In addition, we present the explanations about why contrastive pretraining and supervised pretraining perform similarly and why MIM potentially performs better. We demonstrate the effectiveness of our CAE through superior transfer performance in downstream tasks: semantic segmentation, and object detection and instance segmentation.

This is a PaddlePaddle/GPU re-implementation of the paper Context Autoencoder for Self-Supervised Representation Learning.

Requirements

To enjoy some new features, PaddlePaddle 2.4 is required. For more installation tutorials refer to installation.md

Data Preparation

Prepare the data into the following directory:

dataset/
└── ILSVRC2012
    ├── train
    └── val

How to train

Pre-Training

A typical command to pre-train Vit-Base (recommended by default) with multi-nodes distributed training, run the following on 4 nodes with 8 GPUs each::

# unset PADDLE_TRAINER_ENDPOINTS
# export PADDLE_NNODES=4
# export PADDLE_MASTER="xxx.xxx.xxx.xxx:12538"
# export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
# export PADDLE_JOB_ID=CAE

tmp_my_name=ep800_fp16o1
my_name=${tmp_my_name%.*}
OUTPUT_DIR='./output/'$my_name
echo $OUTPUT_DIR
DATA_PATH='./dataset/ILSVRC2012/'
TOKENIZER_PATH=dalle-weights
FLAGS_cudnn_exhaustive_search=True
export FLAGS_gemm_use_half_precision_compute_type=False

python -m paddle.distributed.launch  \
  --nnodes=$PADDLE_NNODES \
  --master=$PADDLE_MASTER \
  --devices=$CUDA_VISIBLE_DEVICES \
  main_pretrain.py \
  --data_path ${DATA_PATH} \
  --output_dir ${OUTPUT_DIR} \
  --model cae_base_patch16_224_8k_vocab --discrete_vae_weight_path ${TOKENIZER_PATH} \
  --batch_size 64 --lr 1.5e-3 --warmup_epochs 10 --epochs 800 \
  --clip_grad 3.0 --layer_scale_init_value 0.1 \
  --imagenet_default_mean_and_std \
  --color_jitter 0 \
  --drop_path 0 \
  --sincos_pos_emb \
  --mask_generator block \
  --num_mask_patches 75 \
  --decoder_layer_scale_init_value 0.1 \
  --no_auto_resume \
  --save_ckpt_freq 50 \
  --exp_name $my_name \
  --regressor_depth 4 \
  --seed 0 \
  --log_dir vdl \
  --num_decoder_self_attention 4 \
  --dual_loss_weight 2 \
  --dual_path_ema 0

Note: The dalle-weights can be download from encoder_weight and decoder_weight

Fine-tuning

A typical command to finetune of Vit-Base (recommended by default) with multi-nodes distributed training, run the following on 4 nodes with 8 GPUs each:

# unset PADDLE_TRAINER_ENDPOINTS
# export PADDLE_NNODES=4
# export PADDLE_MASTER="xxx.xxx.xxx.xxx:12538"
# export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
# export PADDLE_JOB_ID=CAE

tmp_my_name=finetune_ep100_fp16o1
my_name=${tmp_my_name%.*}
OUTPUT_DIR='./output/'$my_name
echo $OUTPUT_DIR
DATA_PATH='./dataset/ILSVRC2012/'
MODEL_PATH='output/ep800_fp16o1/ep800_fp16o1_checkpoint-799.pd'
FLAGS_cudnn_exhaustive_search=True
export FLAGS_gemm_use_half_precision_compute_type=False

python -m paddle.distributed.launch  \
  --nnodes=$PADDLE_NNODES \
  --master=$PADDLE_MASTER \
  --devices=$CUDA_VISIBLE_DEVICES \
  main_finetune.py \
  --data_path ${DATA_PATH} \
  --output_dir ${OUTPUT_DIR} \
  --model cae_base_patch16_224 \
  --finetune $MODEL_PATH \
  --nb_classes 1000 \
  --batch_size 128 \
  --lr 8e-3 \
  --accum_iter 1 \
  --warmup_epochs 5 \
  --epochs 100 \
  --layer_decay 0.65 \
  --drop_path 0.1 \
  --weight_decay 0.05 \
  --mixup 0.8 \
  --cutmix 1.0 \
  --sin_pos_emb \
  --dist_eval \
  --no_auto_resume \
  --exp_name $my_name

Linear Probing

A typical command to run Linear Probing of Vit-Base (recommended by default) with multi-nodes distributed training, run the following on 4 nodes with 8 GPUs each:

# unset PADDLE_TRAINER_ENDPOINTS
# export PADDLE_NNODES=4
# export PADDLE_MASTER="xxx.xxx.xxx.xxx:12538"
# export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
# export PADDLE_JOB_ID=CAE

tmp_my_name=linprobe_ep90_fp16o1
my_name=${tmp_my_name%.*}
OUTPUT_DIR='./output/'$my_name
echo $OUTPUT_DIR
DATA_PATH='./dataset/ILSVRC2012/'
MODEL_PATH='output/ep800_fp16o1/ep800_fp16o1_checkpoint-799.pd'
FLAGS_cudnn_exhaustive_search=True
export FLAGS_gemm_use_half_precision_compute_type=False

python -m paddle.distributed.launch  \
  --nnodes=$PADDLE_NNODES \
  --master=$PADDLE_MASTER \
  --devices=$CUDA_VISIBLE_DEVICES \
  main_linprobe.py \
  --data_path ${DATA_PATH} \
  --output_dir ${OUTPUT_DIR} \
  --model cae_base_patch16_224 \
  --finetune $MODEL_PATH \
  --nb_classes 1000 \
  --batch_size 512 \
  --epochs 90 \
  --blr 0.1 \
  --weight_decay 0.0 \
  --dist_eval \
  --log_dir $OUTPUT_DIR \
  --enable_linear_eval \
  --use_cls \
  --save_freq 50 \
  --disable_rel_pos_bias \
  --linear_type standard \
  --exp_name $my_name

Models

Model	Phase	Dataset	Epochs	GPUs	Img/sec	Top1 acc@1(%)	Official	Checkpoint	Log
cae_base_patch16_224_8k_vocab	pretrain	ImageNet2012	800	A100*N4C32	4936	-	-	download	download
cae_base_patch16_224	finetune	ImageNet2012	100	A100*N4C32	1729	83.62	83.61	download	download
cae_base_patch16_224	linprobe	ImageNet2012	90	A100*N4C32	19713	68.32	68.32	download	download

Citations

@article{chen2022context,
  title={Context autoencoder for self-supervised representation learning},
  author={Chen, Xiaokang and Ding, Mingyu and Wang, Xiaodi and Xin, Ying and Mo, Shentong and Wang, Yunhao and Han, Shumin and Luo, Ping and Zeng, Gang and Wang, Jingdong},
  journal={arXiv preprint arXiv:2202.03026},
  year={2022}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cae

cae

README.md

CAE

Requirements

Data Preparation

How to train

Pre-Training

Fine-tuning

Linear Probing

Models

Citations

Files

cae

Directory actions

More options

Directory actions

More options

Latest commit

History

cae

Folders and files

parent directory

README.md

CAE

Requirements

Data Preparation

How to train

Pre-Training

Fine-tuning

Linear Probing

Models

Citations