-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default refresh metrics interval could be too low #189
Comments
Looking at the metrics code in question, it's more optimistic than I previously thought. triton reads the metrics as it reads the data that the python backend writes to shm, and vllm's HTTP server is a multi-process service. They have a higher ceiling than a single Python process. Also load balancing related metrics are updated at the end of the request. Of course exactly how much interval should be set needs to be loading tested. |
Thanks for filing this @spacewander! One of the unwritten assumptions here is that each backend will be served by a single Gateway (or potentially 2 for a short time as a transition is occurring). We should really write that down in our docs as I think it helps provide context for other similar design decisions. |
Interesting. So there won't be too much Gateway for each backend. |
Filed #197 to follow up, thanks for pointing out the gap in our docs! |
gateway-api-inference-extension/pkg/ext-proc/main.go
Lines 58 to 62 in 1b1d139
The gateway-api-inference-extension collects metrics from inference engines for load balancing decisions in 50ms by default. Assuming 20 inference gateways, each inference engine needs to handle 400 requests per second. Taking the currently supported vllm as an example, Python code needs to be executed either through triton or by exposing the metric interface directly. A single Python thread can be under a lot of pressure to handle 400 requests per second.
Also here it's not just about being able to process the requests, but it also needs to be done within 50ms in the vast majority of cases. If 50ms is P90, then 10% will have load balancing decisions that are not expected. So it needs to be at least P99 in 50ms.
Fortunately, inference requests basically can't be completed within 50ms (even with PD disaggregation, 50ms is not enough to complete the prefill phase), so this default can be adjusted up a bit. After all, the number of tasks queued in this time period won't change much (kvcache usage metrics are another story).
The text was updated successfully, but these errors were encountered: