clusterd stops accepting connections when Timely workers are unresponsive #29145

teskje · 2024-08-21T12:43:35Z

What version of Materialize are you using?

main

What is the issue?

When the Timely runtime is made unresponsive, e.g. due to a worker step_or_park that doesn't yield, the clusterd process also stops accepting new controller connections. This can result in state buildup in the controller (from commands not getting sent) and alerts by our monitoring infrastructure (from the failure to connect), both of which are undesirable given that the problematic state can be produced by any user installing a bad workload.

Ideally, the task that accepts controller connections in clusterd would be isolated enough from the Timely runtime to continue functioning even when Timely workers become unresponsive. It clearly isn't today, although we don't understand why.

Relevant Slack thread.

Reproduction

This issue can be reproduced in staging (somehow not locally?) by deploying an Mz version with this diff (or a similar one) applied:

diff --git a/src/compute/src/server.rs b/src/compute/src/server.rs
index cc36e0316b..1ad88aed0f 100644
--- a/src/compute/src/server.rs
+++ b/src/compute/src/server.rs
@@ -449,6 +449,10 @@ impl<'w, A: Allocate + 'static> Worker<'w, A> {
             // Enable trace compaction.
             if let Some(compute_state) = &mut self.compute_state {
                 compute_state.traces.maintenance();
+
+                if compute_state.collections.keys().any(|id| id.is_user()) {
+                    loop {}
+                }
             }

             let start = Instant::now();

Then running CREATE VIEW v AS SELECT 1; CREATE DEFAULT INDEX ON v; on some test cluster, then restarting the environmentd pod (possibly multiple times).

Note how the controllers won't be able to connect to the test cluster anymore, logging connection errors such as:

error connecting to replica: status: Unavailable, message: \"http2 error\", details: [], metadata: MetadataMap { headers: {} }: transport error: http2 error: keep-alive timed out: operation timed out
error connecting to replica, retrying in 1s: transport error: tcp connect error: deadline has elapsed: tcp connect error: deadline has elapsed: deadline has elapsed

The text was updated successfully, but these errors were encountered:

teskje · 2024-08-23T12:25:10Z

There are two different issues at play here. The first was that I triggered the bad state by creating an MV, which left a persist_sink halfway initialized and in possession of a global persist lock that prevented the storage cluster from rendering new sources. This confused me for a while and prevented me from finding the actual bug, but I've now adjusted the repro above to not use an MV anymore, which prevents this issue. Relevant Slack thread.

The actual issue is that when a client disconnects, it drops the stream created in forward_bidi_stream. This the stream to be cancel safe, which it is not! In particular, it calls ClusterClient::send, which in turn calls ClusterClient::build, which in turn takes the TimelyContainer out of the cluster state, then awaits the activator channel, then puts the TimelyContainer into the cluster state again. If the future is canceled while awaiting the activator channel, as it is in the case of this repro, the TimelyContainer is dropped. This also drops the WorkerGuard it holds, which in turn causes the current task to block in a non-async manner waiting for the Timely worker to shut down, which it never does. So the whole tokio runtime becomes disabled, if it is single-threaded.

This is just the issue we stumble over with this repro. There might be other cancel-safety related issues with the GRPC server loop.

teskje added C-bug Category: something is broken A-CLUSTER Topics related to the CLUSTER layer A-controller Area: controllers labels Aug 21, 2024

teskje self-assigned this Aug 21, 2024

teskje mentioned this issue Aug 26, 2024

cluster: avoid accidentally dropping TimelyContainer #29221

Merged

5 tasks

teskje closed this as completed in #29221 Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clusterd stops accepting connections when Timely workers are unresponsive #29145

clusterd stops accepting connections when Timely workers are unresponsive #29145

teskje commented Aug 21, 2024 •

edited

Loading

teskje commented Aug 23, 2024 •

edited

Loading

clusterd stops accepting connections when Timely workers are unresponsive #29145

clusterd stops accepting connections when Timely workers are unresponsive #29145

Comments

teskje commented Aug 21, 2024 • edited Loading

What version of Materialize are you using?

What is the issue?

Reproduction

teskje commented Aug 23, 2024 • edited Loading

teskje commented Aug 21, 2024 •

edited

Loading

teskje commented Aug 23, 2024 •

edited

Loading