Default refresh metrics interval could be too low #189

spacewander · 2025-01-12T06:32:30Z

gateway-api-inference-extension/pkg/ext-proc/main.go

Lines 58 to 62 in 1b1d139

    
           refreshMetricsInterval = flag.Duration( 
        
           	"refreshMetricsInterval", 
        
           	50*time.Millisecond, 
        
           	"interval to refresh metrics")

The gateway-api-inference-extension collects metrics from inference engines for load balancing decisions in 50ms by default. Assuming 20 inference gateways, each inference engine needs to handle 400 requests per second. Taking the currently supported vllm as an example, Python code needs to be executed either through triton or by exposing the metric interface directly. A single Python thread can be under a lot of pressure to handle 400 requests per second.

Also here it's not just about being able to process the requests, but it also needs to be done within 50ms in the vast majority of cases. If 50ms is P90, then 10% will have load balancing decisions that are not expected. So it needs to be at least P99 in 50ms.

Fortunately, inference requests basically can't be completed within 50ms (even with PD disaggregation, 50ms is not enough to complete the prefill phase), so this default can be adjusted up a bit. After all, the number of tasks queued in this time period won't change much (kvcache usage metrics are another story).

spacewander · 2025-01-13T04:30:14Z

Looking at the metrics code in question, it's more optimistic than I previously thought. triton reads the metrics as it reads the data that the python backend writes to shm, and vllm's HTTP server is a multi-process service. They have a higher ceiling than a single Python process. Also load balancing related metrics are updated at the end of the request. Of course exactly how much interval should be set needs to be loading tested.

robscott · 2025-01-13T20:10:14Z

Thanks for filing this @spacewander! One of the unwritten assumptions here is that each backend will be served by a single Gateway (or potentially 2 for a short time as a transition is occurring). We should really write that down in our docs as I think it helps provide context for other similar design decisions.

spacewander · 2025-01-14T11:27:08Z

Interesting. So there won't be too much Gateway for each backend.

robscott · 2025-01-15T00:52:07Z

Filed #197 to follow up, thanks for pointing out the gap in our docs!

robscott mentioned this issue Jan 15, 2025

Clarify expected cardinality between Gateway, Endpoint Selection Extension, and Backend Pod #197

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default refresh metrics interval could be too low #189

Default refresh metrics interval could be too low #189

spacewander commented Jan 12, 2025

spacewander commented Jan 13, 2025

robscott commented Jan 13, 2025

spacewander commented Jan 14, 2025

robscott commented Jan 15, 2025

Default refresh metrics interval could be too low #189

Default refresh metrics interval could be too low #189

Comments

spacewander commented Jan 12, 2025

spacewander commented Jan 13, 2025

robscott commented Jan 13, 2025

spacewander commented Jan 14, 2025

robscott commented Jan 15, 2025