-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(7/5) [nexus-db-queries] Benchmark for VMM reservation #7498
base: sled-resource-vmm
Are you sure you want to change the base?
Conversation
I set up this PR to include the following variables:
Then I normalized the total time by both "VMM count" and "task count" - the end result should give me "the average cost to reserve a single VMM", which can be compared directly between test cases with different parameters. (I would expect the average cost to stay stable as we increase tasks, if there is no contention. Conversely, if there is contention, I would expect the average cost to increase as there are more tasks) With the following diff acting as a "before" state (basically, "skip all the affinity stuff, pick the first sled returned like we used to do before")
Here's my "before affinity" results:
(so far, looks normal: ~25ms to provision a vmm, under no contention)
This is not good, even without any affinity group queries. Under contention, we're seeing the average time to provision a VMM get significantly more expensive. Here's what I'm seeing afterwards:
The results here seem to indicate the following to me:
|
Some notes based on my attempts to reduce contention within the transaction: What this transaction is doingThis transaction is roughly:
From a contention perspective:
Ways to reduce contentionSELECT FOR UPDATE OF sled_resource_vmmhttps://www.cockroachlabs.com/docs/v22.1/select-for-update "SELECT FOR UPDATE" can be used to lock the rows we're trying to access, to prevent concurrent transactions from thrashing in a retry loop. It's possible to add the following SQL to the start of the transaction: SELECT 1
FROM sled_resource_vmm
INNER JOIN sled ON sled.id = sled_resource_vmm.sled_id
WHERE
sled.policy = 'active' AND
sled.time_deleted IS NULL
FOR UPDATE OF sled_resource_vmm https://www.cockroachlabs.com/docs/stable/read-committed#when-to-use-locking-reads Unfortunately, this query does not reduce contention because of an issue called phantom reads. As Cockroachdb documents:
In other words, although we can lock all SELECT FOR UPDATE... but multiple tableshttps://www.cockroachlabs.com/docs/v22.1/select-for-update It's possible to add the following to the start of the "sled_reservation_create" function:
Note - this differs from the prior request by dropping the This has mixed results - it causes the overhead to increase on the "low-contention" cases, but makes the "high-contention" cases a fair bit better. Our "worst case, high-contention" cases improve a bit - instead of taking ~1600ms / VMM allocation, they take ~600ms / allocation, which is better, but still slow. However, this path seems unfortunate, because it would also lock out all other concurrent traffic on the Other options to consider?
|
Following-up on the affinity work, I wanted to validate that the additional logic for affinity groups does not make the performance of the instance reservation query any worse than it was before.
Results to be posted shortly.