Replies: 1 comment
-
If you want to run vllm with data-parallel, use nginx as a load balancer (check 👉 vllm docs) BTW, if you are running vllm with one-node server, It is better to use tensor-parallel WRT concurrency and latency (only if your gpus can communicate with each other fast) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
In the doc, it is mentioned that
If I'm not mistaken, using
tp=4
, the layers of the model will be distributed across 4 GPU. So, eventually, this would be **model-parallel` inferencing. But having a model that could fit into a single GPU, I like to do data-parallel, where each GPU will get a replica of a model. And that will allow me to send 4 input to 4 GPU. Is it possible with vllm? I probably can use python multiprocessing but how to control GPU assignment?Beta Was this translation helpful? Give feedback.
All reactions