-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve tablet types to wait #17622
base: main
Are you sure you want to change the base?
Improve tablet types to wait #17622
Conversation
Signed-off-by: Eduardo J. Ortega U. <[email protected]>
Signed-off-by: Eduardo J. Ortega U. <[email protected]>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17622 +/- ##
==========================================
+ Coverage 67.75% 67.77% +0.02%
==========================================
Files 1586 1587 +1
Lines 255726 255813 +87
==========================================
+ Hits 173261 173383 +122
+ Misses 82465 82430 -35 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Eduardo J. Ortega U. <[email protected]>
The test needed an extra tablet because the vtgate start-up script is waiting for primary and replica, but we were only provisioning primary. Signed-off-by: Eduardo J. Ortega U. <[email protected]>
@@ -30,8 +30,10 @@ fi | |||
CELL=zone1 ../common/scripts/vtctld-up.sh | |||
|
|||
# start unsharded keyspace and tablet | |||
CELL=zone1 TABLET_UID=100 ../common/scripts/mysqlctl-up.sh | |||
SHARD=0 CELL=zone1 KEYSPACE=main TABLET_UID=100 ../common/scripts/vttablet-up.sh | |||
for T_UID in 100 101; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need two tablets here because the vtgate
start-up script waits for primary,replica
and with a single tablet we end up with no replicas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change vtgate to only wait for PRIMARY?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script that starts the vtgate
(../common/vtgate-up.sh
) is common for all tests, not only region_sharding
, so changing it would affect all tests. It seemed less intrusive to modify this test only by having an extra vttablet
. But if you prefer, I can do it that way. LMK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that's fine. If people start running out of resources while running the example, we can revisit.
Signed-off-by: Eduardo J. Ortega U. <[email protected]>
This reverts commit bc1d135. Signed-off-by: Eduardo J. Ortega U. <[email protected]>
This test is no longer applicable as the vtgate won't start without tablets of the typs specified in --tablet_types_to_wait Signed-off-by: Eduardo J. Ortega U. <[email protected]>
) | ||
|
||
func TestLoadKeyspaceWithNoTablet(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case was removed as discussed on Slack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is an empty string an allowed value for --tablet_types_to_wait
? Or do we validate and reject it and default to PRIMARY, REPLICA
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I understood that default is empty but it fails the validation, so at least one tablet type have to be provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I understood that default is empty but it fails the validation, so at least one tablet type have to be provided.
Correct, the vtgate
refuses to start with empty string. Furthermore, it validates that the string corresponds to a serving tablet type.
case errors.Is(err, context.DeadlineExceeded): | ||
log.Warning("TabletGateway timed out waiting for tablets to become available - retrying.") | ||
|
||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would require a new context for retry otherwise it will keep failing with context.DeadlineExceeded
we should add a test for this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure I follow what you would like the behavior to be, could you please elaborate? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check the parent context sent to WaitForTablets to ensure it is still valid. If the context has already expired, the retry will continue to fail with the same error.
The key decision here is whether we should stop retrying and fail or attempt again with a new context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for the clarification. However, I am unsure this is necessary. The upstream context passed by Init()
to WaitForTablets()
does not seem to be have a timeout; it's just a context.WithCancel()
. It is WaitForTablets()
who creates a new context with the timeout given by --gateway_initial_tablet_timeout
. Given that, it seems to me that the retries would not fail with the same error - unless the timeout is hit again. Is that not the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is my expectation as well.
@ejortegau this still needs a note in the release notes summary file. |
Description
This PR improves handling of
vtgate --tablet_types_to_wait
so that timeouts waiting for tablets are interpretedas not having successfully gotten healthchecks for all tablet types in all keyspaces of interest, leading to
retries instead of simply logging the error and continuing.
Related Issue(s)
#17412
Checklist
Deployment Notes
N/A