Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clusterd stops accepting connections when Timely workers are unresponsive #29145

Closed
teskje opened this issue Aug 21, 2024 · 1 comment · Fixed by #29221
Closed

clusterd stops accepting connections when Timely workers are unresponsive #29145

teskje opened this issue Aug 21, 2024 · 1 comment · Fixed by #29221
Assignees
Labels
A-CLUSTER Topics related to the CLUSTER layer A-controller Area: controllers C-bug Category: something is broken

Comments

@teskje
Copy link
Contributor

teskje commented Aug 21, 2024

What version of Materialize are you using?

main

What is the issue?

When the Timely runtime is made unresponsive, e.g. due to a worker step_or_park that doesn't yield, the clusterd process also stops accepting new controller connections. This can result in state buildup in the controller (from commands not getting sent) and alerts by our monitoring infrastructure (from the failure to connect), both of which are undesirable given that the problematic state can be produced by any user installing a bad workload.

Ideally, the task that accepts controller connections in clusterd would be isolated enough from the Timely runtime to continue functioning even when Timely workers become unresponsive. It clearly isn't today, although we don't understand why.

Relevant Slack thread.

Reproduction

This issue can be reproduced in staging (somehow not locally?) by deploying an Mz version with this diff (or a similar one) applied:

diff --git a/src/compute/src/server.rs b/src/compute/src/server.rs
index cc36e0316b..1ad88aed0f 100644
--- a/src/compute/src/server.rs
+++ b/src/compute/src/server.rs
@@ -449,6 +449,10 @@ impl<'w, A: Allocate + 'static> Worker<'w, A> {
             // Enable trace compaction.
             if let Some(compute_state) = &mut self.compute_state {
                 compute_state.traces.maintenance();
+
+                if compute_state.collections.keys().any(|id| id.is_user()) {
+                    loop {}
+                }
             }

             let start = Instant::now();

Then running CREATE VIEW v AS SELECT 1; CREATE DEFAULT INDEX ON v; on some test cluster, then restarting the environmentd pod (possibly multiple times).

Note how the controllers won't be able to connect to the test cluster anymore, logging connection errors such as:

error connecting to replica: status: Unavailable, message: \"http2 error\", details: [], metadata: MetadataMap { headers: {} }: transport error: http2 error: keep-alive timed out: operation timed out
error connecting to replica, retrying in 1s: transport error: tcp connect error: deadline has elapsed: tcp connect error: deadline has elapsed: deadline has elapsed
@teskje teskje added C-bug Category: something is broken A-CLUSTER Topics related to the CLUSTER layer A-controller Area: controllers labels Aug 21, 2024
@teskje teskje self-assigned this Aug 21, 2024
@teskje
Copy link
Contributor Author

teskje commented Aug 23, 2024

There are two different issues at play here. The first was that I triggered the bad state by creating an MV, which left a persist_sink halfway initialized and in possession of a global persist lock that prevented the storage cluster from rendering new sources. This confused me for a while and prevented me from finding the actual bug, but I've now adjusted the repro above to not use an MV anymore, which prevents this issue. Relevant Slack thread.

The actual issue is that when a client disconnects, it drops the stream created in forward_bidi_stream. This the stream to be cancel safe, which it is not! In particular, it calls ClusterClient::send, which in turn calls ClusterClient::build, which in turn takes the TimelyContainer out of the cluster state, then awaits the activator channel, then puts the TimelyContainer into the cluster state again. If the future is canceled while awaiting the activator channel, as it is in the case of this repro, the TimelyContainer is dropped. This also drops the WorkerGuard it holds, which in turn causes the current task to block in a non-async manner waiting for the Timely worker to shut down, which it never does. So the whole tokio runtime becomes disabled, if it is single-threaded.

This is just the issue we stumble over with this repro. There might be other cancel-safety related issues with the GRPC server loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-CLUSTER Topics related to the CLUSTER layer A-controller Area: controllers C-bug Category: something is broken
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant