How to use MPI to train distributed? #8978

FDInSky · 2022-02-15T07:46:50Z

FDInSky
Feb 15, 2022

How to use mpi to train distributed?

@ZwwWayne , hi, I was also wondering how to use MPI for multi-machine training. Could you give an example here?

An example for use MPI run for distributed training with 2 GPUS on 2 nodes (1 GPU per node).

mpirun \
--allow-run-as-root \
--npernode 1 --np 2 \
python tools/train.py ${CONFIG_FILE} --launcher mpi

Note: Should at least set MASTER_ADDR environment variable which is necessary for pytorch. Refers to
https://github.com/open-mmlab/mmcv/blob/master/mmcv/runner/dist_utils.py#L66

ZwwWayne · 2022-03-03T09:54:05Z

You can set the launcher equal to mpi, and use mpirun to launch the code.

0 replies

hyz-xmaster · 2022-04-29T09:47:45Z

@ZwwWayne , hi, I was also wondering how to use MPI for multi-machine training. Could you give an example here?

0 replies

yingfhu · 2022-05-18T12:21:23Z

@ZwwWayne , hi, I was also wondering how to use MPI for multi-machine training. Could you give an example here?

An example for use MPI run for distributed training with 2 GPUS on 2 nodes (1 GPU per node).

mpirun \
--allow-run-as-root \
--npernode 1 --np 2 \
python tools/train.py ${CONFIG_FILE} --launcher mpi

Note: Should at least set MASTER_ADDR environment variable which is necessary for pytorch. Refers to
https://github.com/open-mmlab/mmcv/blob/master/mmcv/runner/dist_utils.py#L66

0 replies

hyz-xmaster · 2022-05-20T03:31:46Z

@yingfhu Great, thanks for the help.

0 replies