train not success #51

xiyangyang99 · 2023-10-16T08:48:39Z

on 8*RTX 3090 cant train!
this is my train script :
CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m torch.distributed.launch --master_port=12000 --nnodes 1 --nproc_per_node 4 train.py --config /home/quchunguang/003-large-model/SAM-Adapter-PyTorch/configs/cod-sam-vit-h.yaml --tag exp1

this is train logs
/home/quchunguang/anaconda3/envs/SAM-Adapter/lib/python3.8/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

/home/quchunguang/anaconda3/envs/SAM-Adapter/lib/python3.8/site-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(
/home/quchunguang/anaconda3/envs/SAM-Adapter/lib/python3.8/site-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(
/home/quchunguang/anaconda3/envs/SAM-Adapter/lib/python3.8/site-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(
/home/quchunguang/anaconda3/envs/SAM-Adapter/lib/python3.8/site-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(

and always ........

not any next train output context .................

how can deal with this question?

The text was updated successfully, but these errors were encountered:

tianrun-chen · 2023-10-21T13:20:02Z

Greetings! As the current application will utilize over 30G of memory for batchsize=1, we suggest considering alternative graphics cards with greater memory capacity.

xiyangyang99 · 2023-10-21T14:40:15Z

Greetings! As the current application will utilize over 30G of memory for batchsize=1, we suggest considering alternative graphics cards with greater memory capacity.

Thank you for your reply. I am using 8 * 3090Nvidia and the computer memory is 188Gb. There was no log output during the training process. The graphics card didn't respond either

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train not success #51

train not success #51

xiyangyang99 commented Oct 16, 2023

tianrun-chen commented Oct 21, 2023

xiyangyang99 commented Oct 21, 2023

train not success #51

train not success #51

Comments

xiyangyang99 commented Oct 16, 2023

tianrun-chen commented Oct 21, 2023

xiyangyang99 commented Oct 21, 2023