Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default refresh metrics interval could be too low #189

Open
spacewander opened this issue Jan 12, 2025 · 4 comments
Open

Default refresh metrics interval could be too low #189

spacewander opened this issue Jan 12, 2025 · 4 comments

Comments

@spacewander
Copy link
Contributor

refreshMetricsInterval = flag.Duration(
"refreshMetricsInterval",
50*time.Millisecond,
"interval to refresh metrics")

The gateway-api-inference-extension collects metrics from inference engines for load balancing decisions in 50ms by default. Assuming 20 inference gateways, each inference engine needs to handle 400 requests per second. Taking the currently supported vllm as an example, Python code needs to be executed either through triton or by exposing the metric interface directly. A single Python thread can be under a lot of pressure to handle 400 requests per second.

Also here it's not just about being able to process the requests, but it also needs to be done within 50ms in the vast majority of cases. If 50ms is P90, then 10% will have load balancing decisions that are not expected. So it needs to be at least P99 in 50ms.

Fortunately, inference requests basically can't be completed within 50ms (even with PD disaggregation, 50ms is not enough to complete the prefill phase), so this default can be adjusted up a bit. After all, the number of tasks queued in this time period won't change much (kvcache usage metrics are another story).

@spacewander
Copy link
Contributor Author

Looking at the metrics code in question, it's more optimistic than I previously thought. triton reads the metrics as it reads the data that the python backend writes to shm, and vllm's HTTP server is a multi-process service. They have a higher ceiling than a single Python process. Also load balancing related metrics are updated at the end of the request. Of course exactly how much interval should be set needs to be loading tested.

@robscott
Copy link
Member

Thanks for filing this @spacewander! One of the unwritten assumptions here is that each backend will be served by a single Gateway (or potentially 2 for a short time as a transition is occurring). We should really write that down in our docs as I think it helps provide context for other similar design decisions.

@spacewander
Copy link
Contributor Author

Interesting. So there won't be too much Gateway for each backend.

@robscott
Copy link
Member

Filed #197 to follow up, thanks for pointing out the gap in our docs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants