clusterd stops accepting connections when Timely workers are unresponsive #29145
Labels
A-CLUSTER
Topics related to the CLUSTER layer
A-controller
Area: controllers
C-bug
Category: something is broken
What version of Materialize are you using?
main
What is the issue?
When the Timely runtime is made unresponsive, e.g. due to a worker
step_or_park
that doesn't yield, the clusterd process also stops accepting new controller connections. This can result in state buildup in the controller (from commands not getting sent) and alerts by our monitoring infrastructure (from the failure to connect), both of which are undesirable given that the problematic state can be produced by any user installing a bad workload.Ideally, the task that accepts controller connections in clusterd would be isolated enough from the Timely runtime to continue functioning even when Timely workers become unresponsive. It clearly isn't today, although we don't understand why.
Relevant Slack thread.
Reproduction
This issue can be reproduced in staging (somehow not locally?) by deploying an Mz version with this diff (or a similar one) applied:
Then running
CREATE VIEW v AS SELECT 1; CREATE DEFAULT INDEX ON v;
on some test cluster, then restarting theenvironmentd
pod (possibly multiple times).Note how the controllers won't be able to connect to the test cluster anymore, logging connection errors such as:
The text was updated successfully, but these errors were encountered: