Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

内存泄漏 #227

Open
zhuhui-in opened this issue Jun 2, 2024 · 7 comments
Open

内存泄漏 #227

zhuhui-in opened this issue Jun 2, 2024 · 7 comments

Comments

@zhuhui-in
Copy link

使用作者提供的脚本和数据训练蛋白小分子复合体模型,发现内存随着epoch逐渐增加,最后超出机器上限,请问大家有没有一样的问题,是怎么解决的

@Naplessss
Copy link
Collaborator

docking v1 or v2?

@zhuhui-in
Copy link
Author

zhuhui-in commented Jun 3, 2024

docking v1 or v2?

v1: 使用pdbbind预训练50个epoch, 内存逐渐增长到50G+

@ZhouGengmo
Copy link
Collaborator

请问可以提供训练脚本吗,以及修改过什么部分的代码吗

@zhuhui-in
Copy link
Author

训练脚本如下(修改了batch size=4, --all-gather-list-size 4096000), 数据是作者提供的pdbbind数据集, 预训练50个epoch,内存逐渐增加到50G,内存变化趋势如下图所示; 当使用更多的数据训练unimol的时候内存就是超出机器上限, 请问有什么办法可以让模型分批加载数据么

data_path="./protein_ligand_binding_pose_prediction"  # replace to your data path
save_dir="./save_pose"  # replace to your save path
n_gpu=4
MASTER_PORT=10086
finetune_mol_model="./weights/mol_checkpoint.pt"
finetune_pocket_model="./weights/pocket_checkpoint.pt"
lr=3e-4
**batch_size=4**
epoch=50
dropout=0.2
warmup=0.06
update_freq=1
dist_threshold=8.0
recycling=3

export NCCL_ASYNC_ERROR_HANDLING=1
export OMP_NUM_THREADS=1
python -m torch.distributed.launch --nproc_per_node=$n_gpu --master_port=$MASTER_PORT $(which unicore-train) $data_path --user-dir ./unimol --train-subset train --valid-subset valid \
       --num-workers 8 --ddp-backend=c10d \
       --task docking_pose --loss docking_pose --arch docking_pose  \
       --optimizer adam --adam-betas "(0.9, 0.99)" --adam-eps 1e-6 --clip-norm 1.0 \
       --lr-scheduler polynomial_decay --lr $lr --warmup-ratio $warmup --max-epoch $epoch --batch-size $batch_size \
       --mol-pooler-dropout $dropout --pocket-pooler-dropout $dropout \
       --fp16 --fp16-init-scale 4 --fp16-scale-window 256 --update-freq $update_freq --seed 1 \
       --tensorboard-logdir $save_dir/tsb \
       --log-interval 100 --log-format simple \
       --validate-interval 1 --keep-last-epochs 10 \
       --best-checkpoint-metric valid_loss  --patience 2000 **--all-gather-list-size 4096000** \
       --finetune-mol-model $finetune_mol_model \
       --finetune-pocket-model $finetune_pocket_model \
       --dist-threshold $dist_threshold --recycling $recycling \
       --save-dir $save_dir \
       --find-unused-parameters

20240603-164614

@zhuhui-in
Copy link
Author

请问可以提供训练脚本吗,以及修改过什么部分的代码吗

没有修改过代码

@ZhouGengmo
Copy link
Collaborator

ZhouGengmo commented Jun 5, 2024

我们正在复现你的实验
这里内存影响的因素可能是all-gather-list-size的大小
这个需要根据具体情况设置。如果batch size减小了,那么all gather的数据就少了,理应减少all-gather-list-size
这里相比原来扩大了一倍,可能是影响内存的主要因素

@ZhouGengmo
Copy link
Collaborator

训练脚本如下(修改了batch size=4, --all-gather-list-size 4096000), 数据是作者提供的pdbbind数据集

使用这个设置复现,在4卡32G V100,内存128G的机器上,对内存监控如下,内存占用比较稳定,最大占用19.32G,看上去没有内存泄漏问题
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants