Does vLLM aim to place one expert entirely on a single device? #13019
Unanswered
Imagium719
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi there! I have a question regarding the behavior of vLLM during multi-GPU inference for MoE models. Specifically, does vLLM aim to place each expert entirely on a single device as much as possible, or does it slice experts and distribute them across multiple devices (using tensor parallelism)? I think this is important because the former approach has minimal requirements for GPU interconnect bandwidth, while the latter would require significantly higher bandwidth. Thanks in advance for your insights!
Beta Was this translation helpful? Give feedback.
All reactions