What scale can mpi-operator support? #648

yxzhao6 · 2024-08-30T05:47:46Z

We are building supercomputing infra for an internal GPU cluster up to thousands of expensive GPUs.
We are looking to adopt mpi-operator, or slurm.

slurm is widely adopted in large-scale hpc computing, so its scalability is well tested.

Is there cases of mpi-operator's benchmark results on > 3000 gpus clsuter?

tenzen-y · 2024-08-30T10:09:55Z

@terrytangyuan @alculquicondor Do you have any benchmark results?

alculquicondor · 2024-08-30T14:19:22Z

At that scale, the limitations don't come from mpi-operator, but from the network and how pods will land on it.

Do you have more details?

terrytangyuan · 2024-08-30T14:36:51Z

Agreed. I don't think there's anything from the controller side that blocks the scale. I don't have any public benchmark.

Provide feedback