(7/5) [nexus-db-queries] Benchmark for VMM reservation #7498

smklein · 2025-02-06T22:12:47Z

Following-up on the affinity work, I wanted to validate that the additional logic for affinity groups does not make the performance of the instance reservation query any worse than it was before.

Results to be posted shortly.

smklein · 2025-02-11T22:03:29Z

I set up this PR to include the following variables:

"Number of VMMs to reserve" (1, 8, 16)
"Number of tasks to concurrently reserve those VMMs" (1, 4, 8)

Then I normalized the total time by both "VMM count" and "task count" - the end result should give me "the average cost to reserve a single VMM", which can be compared directly between test cases with different parameters.

(I would expect the average cost to stay stable as we increase tasks, if there is no contention. Conversely, if there is contention, I would expect the average cost to increase as there are more tasks)

With the following diff acting as a "before" state (basically, "skip all the affinity stuff, pick the first sled returned like we used to do before")

--- a/nexus/db-queries/src/db/datastore/sled.rs
+++ b/nexus/db-queries/src/db/datastore/sled.rs
@@ -565,27 +565,30 @@ impl DataStore {
                         "sled_ids" => ?sled_targets,
                     );
 
-                    let anti_affinity_sleds = lookup_anti_affinity_sleds_query(
-                        instance_id,
-                    ).get_results_async::<(AffinityPolicy, Uuid)>(&conn).await?;
-
-                    let affinity_sleds = lookup_affinity_sleds_query(
-                        instance_id,
-                    ).get_results_async::<(AffinityPolicy, Uuid)>(&conn).await?;
-
-                    let targets: HashSet<SledUuid> = sled_targets
-                        .into_iter()
-                        .map(|id| SledUuid::from_untyped_uuid(id))
-                        .collect();
-
-                    let sled_target = pick_sled_reservation_target(
-                        &opctx.log,
-                        targets,
-                        anti_affinity_sleds,
-                        affinity_sleds,
-                    ).map_err(|e| {
-                        err.bail(e)
-                    })?;
+//                    let anti_affinity_sleds = lookup_anti_affinity_sleds_query(
+//                        instance_id,
+//                    ).get_results_async::<(AffinityPolicy, Uuid)>(&conn).await?;
+//
+//                    let affinity_sleds = lookup_affinity_sleds_query(
+//                        instance_id,
+//                    ).get_results_async::<(AffinityPolicy, Uuid)>(&conn).await?;
+//
+//                    let targets: HashSet<SledUuid> = sled_targets
+//                        .into_iter()
+//                        .map(|id| SledUuid::from_untyped_uuid(id))
+//                        .collect();
+//
+//                    let sled_target = pick_sled_reservation_target(
+//                        &opctx.log,
+//                        targets,
+//                        anti_affinity_sleds,
+//                        affinity_sleds,
+//                    ).map_err(|e| {
+//                        err.bail(e)
+//                    })?;
+                    let Some(sled_target) = sled_targets.get(0).map(|id| SledUuid::from_untyped_uuid(*id)) else {
+                        return Err(err.bail(SledReservationError::NotFound));
+                    };
 
                     // Create a SledResourceVmm record, associate it with the target
                     // sled.

Here's my "before affinity" results:

vmm-reservation/1-tasks-1-vmms                                                                                                                                                                                                                                                                                                                                                                                                                                                                
                        time:   [24.820 ms 25.239 ms 26.022 ms]                                                                                                                                                                                                                                                                                                                                                                                                                               
vmm-reservation/1-tasks-8-vmms                                                                                                                                                                                                                                                                                                                                                                                                                                                                
                        time:   [25.511 ms 26.321 ms 27.123 ms]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
vmm-reservation/1-tasks-16-vmms                                                                                                                                                                                                                                                                                                                                                                                                                                                               
                        time:   [25.389 ms 26.101 ms 26.796 ms]

(so far, looks normal: ~25ms to provision a vmm, under no contention)

vmm-reservation/4-tasks-1-vmms                                                                                                                                                                                                                                                                                                                                                                                                                                                                
                        time:   [178.52 ms 194.84 ms 211.42 ms]                                                                                                                                                                                                                                                                                                                                                                                                                               
vmm-reservation/4-tasks-8-vmms                                                                                                                                                                                                                                                                                                                                                                                                                                                                
                        time:   [328.27 ms 351.31 ms 376.81 ms]                                                                                                                                                                                                                                                                                                                                                                                                                               
Indexes which are experiencing contention                                                                                                                                                                                                                                                                                                                                                                                                                                                     
 table_name        | index_name                  | num_contention_events |                                                                                                                                                                                                                                                                                                                                                                                                                    
--------------------------------------------------------------------------                                                                                                                                                                                                                                                                                                                                                                                                                    
 sled_resource_vmm | lookup_vmm_resource_by_sled | 48                    |                                                                                                                                                                                                                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
Tables which are experiencing contention                                                                                                                                                                                                                                                                                                                                                                                                                                                      
 table_name        | num_contention_events |                                                                                                                                                                                                                                                                                                                                                                                                                                                  
--------------------------------------------                                                                                                                                                                                                                                                                                                                                                                                                                                                  
 sled_resource_vmm | 48                    |                                                                           

Top ten longest contention events, grouped by table + index                                                            
 table_name        | index_name                  | events | time            |                                                                                                                                                                  
-----------------------------------------------------------------------------                                                                                                                                                                  
 sled_resource_vmm | lookup_vmm_resource_by_sled | 48     | 00:00:04.374975 |                                                                                                                                                                  

vmm-reservation/4-tasks-16-vmms                                                                                        
                        time:   [400.36 ms 425.86 ms 455.52 ms]                                                        
Indexes which are experiencing contention                                                                              
 table_name        | index_name                  | num_contention_events |                                                                                                                                                                     
--------------------------------------------------------------------------                                                                                                                                                                     
 sled_resource_vmm | lookup_vmm_resource_by_sled | 27                    |                                                                                                                                                                     

Tables which are experiencing contention                                                                               
 table_name        | num_contention_events |                                                                           
--------------------------------------------                                                                           
 sled_resource_vmm | 27                    |                                                                           

Top ten longest contention events, grouped by table + index                                                            
 table_name        | index_name                  | events | time            |                                                                                                                                                                  
-----------------------------------------------------------------------------                                                                                                                                                                  
 sled_resource_vmm | lookup_vmm_resource_by_sled | 27     | 00:00:02.758899 |                                                                                                                                                                  

vmm-reservation/8-tasks-1-vmms                                                                                         
                        time:   [608.84 ms 650.40 ms 697.09 ms]                                                        
Indexes which are experiencing contention                                                                              
 table_name        | index_name                  | num_contention_events |                                                                                                                                                                     
--------------------------------------------------------------------------                                                                                                                                                                     
 sled_resource_vmm | lookup_vmm_resource_by_sled | 6                     |                                                                                                                                                                     

Tables which are experiencing contention                                                                               
 table_name        | num_contention_events |                                                                           
--------------------------------------------                                                                           
 sled_resource_vmm | 6                     |                                                                           

Top ten longest contention events, grouped by table + index                                                            
 table_name        | index_name                  | events | time           |                                                                                                                                                                   
----------------------------------------------------------------------------                                                                                                                                                                   
 sled_resource_vmm | lookup_vmm_resource_by_sled | 6      | 00:00:00.55869 |                                                                                                                                                                   

vmm-reservation/8-tasks-8-vmms                                                                                         
                        time:   [1.4944 s 1.5739 s 1.6575 s]                                                           
Indexes which are experiencing contention                                                                              
 table_name        | index_name                  | num_contention_events |                                                                                                                                                                     
--------------------------------------------------------------------------                                                                                                                                                                     
 sled_resource_vmm | lookup_vmm_resource_by_sled | 85                    |                                                                                                                                                                     

Tables which are experiencing contention                                                                               
 table_name        | num_contention_events |                                                                           
--------------------------------------------                                                                           
 sled_resource_vmm | 85                    |                                                                           

Top ten longest contention events, grouped by table + index                                                            
 table_name        | index_name                  | events | time            |                                                                                                                                                                  
-----------------------------------------------------------------------------                                                                                                                                                                  
 sled_resource_vmm | lookup_vmm_resource_by_sled | 85     | 00:00:11.014805 |   
vmm-reservation/8-tasks-16-vmms                                                                                        
                        time:   [1.7373 s 1.7945 s 1.8472 s]                                                           
Indexes which are experiencing contention                                                                              
 table_name        | index_name                  | num_contention_events |                                                                                                                                                                     
--------------------------------------------------------------------------                                                                                                                                                                     
 sled_resource_vmm | lookup_vmm_resource_by_sled | 240                   |                                                                                                                                                                     

Tables which are experiencing contention                                                                               
 table_name        | num_contention_events |                                                                           
--------------------------------------------                                                                           
 sled_resource_vmm | 240                   |                                                                           

Top ten longest contention events, grouped by table + index                                                            
 table_name        | index_name                  | events | time          |                                                                                                                                                                    
---------------------------------------------------------------------------                                                                                                                                                                    
 sled_resource_vmm | lookup_vmm_resource_by_sled | 240    | 00:00:30.6663 |

This is not good, even without any affinity group queries. Under contention, we're seeing the average time to provision a VMM get significantly more expensive.

Here's what I'm seeing afterwards:

vmm-reservation/1-tasks-1-vmms                                                                                                                                                                                                                                                                                                                                                                                                                                                                
                        time:   [30.052 ms 30.831 ms 31.711 ms]                                                                                                                                                                                                                                                                                                                                                                                                                               
                        change: [+14.741% +20.115% +25.298%] (p = 0.00 < 0.05)                                                                                                                                                                                                                                                                                                                                                                                                                
                        Performance has regressed.                                                                                                                                                                                                                                                                                                                                                                                                                                            
vmm-reservation/1-tasks-8-vmms                                                                                                                                                                                                                                                                                                                                                                                                                                                                
                        time:   [30.808 ms 32.275 ms 33.270 ms]                                                                                                                                                                                                                                                                                                                                                                                                                               
                        change: [+16.103% +22.624% +28.348%] (p = 0.00 < 0.05)                                                                                                                                                                                                                                                                                                                                                                                                                
                        Performance has regressed.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
vmm-reservation/1-tasks-16-vmms                                                                                                                                                                                                                                                                                                                                                                                                                                                               
                        time:   [31.554 ms 32.854 ms 34.455 ms]                                                                                                                                                                                                                                                                                                                                                                                                                               
                        change: [+19.855% +25.872% +32.714%] (p = 0.00 < 0.05)                                                                                                                                                                                                                                                                                                                                                                                                                
                        Performance has regressed.                                                                                                                                                                                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                                                                                                                     
Indexes which are experiencing contention                                                                                                                                                                                                                                                                                                                                                                                                                                                     
 table_name | index_name   | num_contention_events |                                                                                                                                                                                                                                                                                                                                                                                                                                          
----------------------------------------------------                                                                                                                                                                                                                                                                                                                                                                                                                                          
 project    | project_pkey | 1                     |                                                                                                                                                                                                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
Tables which are experiencing contention                                                                                                                                                                                                                                                                                                                                                                                                                                                      
 table_name | num_contention_events |                                                                                                                                                                                                                                                                                                                                                                                                                                                         
-------------------------------------                                                                                                                                                                                                                                                                                                                                                                                                                                                         
 project    | 1                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
Top ten longest contention events, grouped by table + index                                                                                                                                                                                                                                                                                                                                                                                                                                   
 table_name | index_name   | events | time           |                                                                                                                                                                                                                                                                                                                                                                                                                                        
------------------------------------------------------                                                                                                                                                                                                                                                                                                                                                                                                                                        
 project    | project_pkey | 1      | 00:00:00.03319 |                                                                                                                                                                                                                                                                                                                                                                                                                                        
                                                               
vmm-reservation/4-tasks-1-vmms                                                                                                                                                                                                                                                                                                                                                                                                                                                                
                        time:   [217.80 ms 232.71 ms 246.85 ms]                                                                                                                                                                                                                                                                                                                                                                                                                               
                        change: [+7.3937% +19.440% +32.538%] (p = 0.00 < 0.05)                                                                                                                                                                 
                        Change within noise threshold.                                                                                                                                             
Indexes which are experiencing contention                                                                              
 table_name        | index_name                  | num_contention_events |                                                                                                                                                                     
--------------------------------------------------------------------------                                                                                                                                                                     
 sled_resource_vmm | lookup_vmm_resource_by_sled | 9                     |                                                                                                                                                                     

Tables which are experiencing contention                                                                               
 table_name        | num_contention_events |                                                                           
--------------------------------------------                                                                           
 sled_resource_vmm | 9                     |                                                                           

Top ten longest contention events, grouped by table + index                                                            
 table_name        | index_name                  | events | time            |                                                                                                                                                                  
-----------------------------------------------------------------------------                                                                                                                                                                  
 sled_resource_vmm | lookup_vmm_resource_by_sled | 9      | 00:00:00.612948 |                                                                                                                                                                  
                                                                                                                                         
vmm-reservation/4-tasks-8-vmms                                                                                         
                        time:   [369.34 ms 386.13 ms 402.15 ms]                                                        
                        change: [+0.9448% +9.9104% +19.396%] (p = 0.04 < 0.05)                                                                                                                                                                 
                        Change within noise threshold.                                                                 
Indexes which are experiencing contention                                                                              
 table_name        | index_name                  | num_contention_events |                                                                                                                                                                     
--------------------------------------------------------------------------                                                                                                                                                                     
 sled_resource_vmm | lookup_vmm_resource_by_sled | 7                     |                                                                                                                                                                     

Tables which are experiencing contention                                                                               
 table_name        | num_contention_events |                                                                           
--------------------------------------------                                                                           
 sled_resource_vmm | 7                     |                                                                           

Top ten longest contention events, grouped by table + index                                                            
 table_name        | index_name                  | events | time            |                                                                                                                                                                  
-----------------------------------------------------------------------------                                                                                                                                                                  
 sled_resource_vmm | lookup_vmm_resource_by_sled | 7      | 00:00:00.693713 |                                                                                                                                                                  

vmm-reservation/8-tasks-8-vmms                                                                                                                                                                                                                                                                                                                                                                                                                                                                
                        time:   [1.5965 s 1.6827 s 1.7774 s]                                                                                                                                                                                                                                                                                                                                                                                                                                  
                        change: [-0.4969% +6.9083% +15.201%] (p = 0.12 > 0.05)                                                                                                                                                                                                                                                                                                                                                                                                                
                        No change in performance detected.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
Indexes which are experiencing contention                                                                                                                                                                                                                                                                                                                                                                                                                                                     
 table_name        | index_name                  | num_contention_events |                                                                                                                                                                                                                                                                                                                                                                                                                    
--------------------------------------------------------------------------                                                                                                                                                                                                                                                                                                                                                                                                                    
 sled_resource_vmm | lookup_vmm_resource_by_sled | 166                   |                                                                                                                                                                                                                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
Tables which are experiencing contention                                                                                                                                                                                                                                                                                                                                                                                                                                                      
 table_name        | num_contention_events |                                                                                                                                                                                                                                                                                                                                                                                                                                                  
--------------------------------------------                                                                                                                                                                                                                                                                                                                                                                                                                                                  
 sled_resource_vmm | 166                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
Top ten longest contention events, grouped by table + index                                                                                                                                                                                                                                                                                                                                                                                                                                   
 table_name        | index_name                  | events | time           |                                                                                                                                                                                                                                                                                                                                                                                                                  
----------------------------------------------------------------------------                                                                                                                                                                                                                                                                                                                                                                                                                  
 sled_resource_vmm | lookup_vmm_resource_by_sled | 166    | 00:00:20.11064 |

vmm-reservation/8-tasks-16-vmms                                                                                                                                                                                                                                                                                                                                                                                                                                                               
                        time:   [1.7266 s 1.8141 s 1.9078 s]                                                                                                                                                                                                                                                                                                                                                                                                                                  
                        change: [-5.0187% +1.0892% +7.6023%] (p = 0.74 > 0.05)                                                                                                                                                                                                                                                                                                                                                                                                                
                        No change in performance detected.                                                                                                                                                                                                                                                                                                                                                                                                                                    
Indexes which are experiencing contention                                                                                                                                                                                                                                                                                                                                                                                                                                                     
 table_name        | index_name                  | num_contention_events |                                                                                                                                                                                                                                                                                                                                                                                                                    
--------------------------------------------------------------------------                                                                                                                                                                                                                                                                                                                                                                                                                    
 sled_resource_vmm | lookup_vmm_resource_by_sled | 117                   |                                                                                                                                                                                                                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
Tables which are experiencing contention                                                                                                                                                                                                                                                                                                                                                                                                                                                      
 table_name        | num_contention_events |                                                                                                                                                                                                                                                                                                                                                                                                                                                  
--------------------------------------------                                                                                                                                                                                                                                                                                                                                                                                                                                                  
 sled_resource_vmm | 117                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
Top ten longest contention events, grouped by table + index                                                                                                                                                                                                                                                                                                                                                                                                                                   
 table_name        | index_name                  | events | time            |                                                                                                                                                                                                                                                                                                                                                                                                                 
-----------------------------------------------------------------------------                                                                                                                                                                                                                                                                                                                                                                                                                 
 sled_resource_vmm | lookup_vmm_resource_by_sled | 117    | 00:00:14.988616 |

The results here seem to indicate the following to me:

The addition of "queries for affinity groups" introduces a small overhead to the reservation query. This is most clearly visible in the "single task, uncontended case". To some extent, I think this cost is unavoidable - we are asking the query to do more than it was before.
Independent of affinity groups, VMM resource reservation is vulnerable to major transaction contention, which can significantly increase the "average cost to provision a VMM" when multiple tasks are making such a request concurrently.
The addition of affinity group queries - at least in scenarios where we don't have affinity group work to do, which I suspect will be the common case - does not make these "high contention cases" much worse. However, they're already bad from the get-go, so, take that news as you will.

smklein · 2025-02-11T22:56:19Z

Some notes based on my attempts to reduce contention within the transaction:

What this transaction is doing

This transaction is roughly:

SELECT-ing all viable sleds for allocation, JOIN-ed with existing sled_resource_vmm rows
It then GROUPS BY the sleds, and uses HAVING clauses to only find sleds that could have space for a proposed sled_resource_vmm
(With the addition of affinity stuff) we also query for reservations within our affinity/anti-affinity groups, and only pick a sled target based on where those existing sled_resource_vmm records are placed
Finally, we INSERT a record into the sled_resource_vmm table, if we can make an allocation.

From a contention perspective:

We read from the sled, affinity_group, anti_affinity_group, and sled_resource_vmm tables. Of all of those, the sled_resource_vmm table seems most likely to change.
To understand the "free space" on a sled, we need to read effectively all sled_resource_vmm records.
We're experiencing contention on that table because all these transactions are reading from that sled_resource_vmm table, and later INSERT-ing into it, so the INSERT action invalidates concurrent reads, and can force all other transactions to restart.

Ways to reduce contention

SELECT FOR UPDATE OF sled_resource_vmm

https://www.cockroachlabs.com/docs/v22.1/select-for-update

"SELECT FOR UPDATE" can be used to lock the rows we're trying to access, to prevent concurrent transactions from thrashing in a retry loop.

It's possible to add the following SQL to the start of the transaction:

   SELECT 1
   FROM sled_resource_vmm
   INNER JOIN sled ON sled.id = sled_resource_vmm.sled_id
   WHERE
     sled.policy = 'active' AND
     sled.time_deleted IS NULL
  FOR UPDATE OF sled_resource_vmm

https://www.cockroachlabs.com/docs/stable/read-committed#when-to-use-locking-reads

Unfortunately, this query does not reduce contention because of an issue called phantom reads. As Cockroachdb documents:

Note that locking reads do not prevent phantom reads that are caused by the insertion of new rows, since only existing rows can be locked.

In other words, although we can lock all sled_resource_vmm rows, this isn't enough -- the INSERT of a new row invalidates prior reads, and does not prevent the contention we're seeing.

SELECT FOR UPDATE... but multiple tables

https://www.cockroachlabs.com/docs/v22.1/select-for-update

It's possible to add the following to the start of the "sled_reservation_create" function:

   SELECT 1
   FROM sled_resource_vmm
   INNER JOIN sled ON sled.id = sled_resource_vmm.sled_id
   WHERE
     sled.policy = 'active' AND
     sled.time_deleted IS NULL
  FOR UPDATE;

Note - this differs from the prior request by dropping the OF sled_resource_vmm -- this query locks that table, as well as the sled table.

This has mixed results - it causes the overhead to increase on the "low-contention" cases, but makes the "high-contention" cases a fair bit better. Our "worst case, high-contention" cases improve a bit - instead of taking ~1600ms / VMM allocation, they take ~600ms / allocation, which is better, but still slow. However, this path seems unfortunate, because it would also lock out all other concurrent traffic on the sled table, even though that isn't changing here.

Other options to consider?

Converting this entire transaction to a CTE (this will not be trivial, it has fairly complex logic which would need translation to SQL). This would reduce the round-trip time, and make it possible to retry within CRDB.
Restructure the underlying tables to make this less contentious. We could reduce the phantom read issue for the transaction if we didn't need to read all sled_resource_vmm records to find viable sleds - e.g., if we had a table representing "free space on a sled", which could be decremented, we wouldn't need to read other sled_resource_vmm records within this transaction (this is true for the not-using-affinity cases -- if we use affinity groups, I think this contention may still exist, since we need to explicitly look where other group members are located).

smklein added 9 commits February 6, 2025 14:09

[nexus-db-queries] Benchmark for VMM reservation

79b4252

Tweak usable hardware threads to make instance placement less flaky

04a4b98

Normalize reservation time, only benchmark creation pathway

db40d05

Normalize

b68239b

cleanup

ca8f890

Better contention info

5406dd4

Better contention information

bb0f349

restructure benchmark

6704be1

more refactoring

127285c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(7/5) [nexus-db-queries] Benchmark for VMM reservation #7498

(7/5) [nexus-db-queries] Benchmark for VMM reservation #7498

smklein commented Feb 6, 2025

smklein commented Feb 11, 2025

smklein commented Feb 11, 2025

(7/5) [nexus-db-queries] Benchmark for VMM reservation #7498

Are you sure you want to change the base?

(7/5) [nexus-db-queries] Benchmark for VMM reservation #7498

Conversation

smklein commented Feb 6, 2025

smklein commented Feb 11, 2025

smklein commented Feb 11, 2025

What this transaction is doing

Ways to reduce contention

SELECT FOR UPDATE OF sled_resource_vmm

SELECT FOR UPDATE... but multiple tables

Other options to consider?