Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout in cargo test: mz-environmentd::sql test_utilization_hold #29299

Open
def- opened this issue Aug 30, 2024 · 3 comments
Open

Timeout in cargo test: mz-environmentd::sql test_utilization_hold #29299

def- opened this issue Aug 30, 2024 · 3 comments
Assignees
Labels
C-bug Category: something is broken ci-flake

Comments

@def-
Copy link
Contributor

def- commented Aug 30, 2024

What version of Materialize are you using?

4de94aa

What is the issue?

Seen in Cargo test:

2024-08-30T13:41:54.351864Z  INFO sql: since has not yet advanced to expected time, retrying now_millis=619388520000 since=616796520001 query="SELECT * FROM mz_internal.mz_cluster_replica_statuses"
2024-08-30T13:41:58.972781Z  INFO mz_environmentd::test_util: connection closed
2024-08-30T13:41:58.977524Z  INFO mz_environmentd::test_util: connection error: error communicating with the server: Connection reset by peer (os error 104)
2024-08-30T13:41:58.977562Z  INFO mz_environmentd::test_util: connection closed
test test_utilization_hold has been running for over 60 seconds
2024-08-30T13:42:29.525230Z  WARN mz_adapter::coord: coordinator stuck for 30s last_message_kind=group_commit_initiate last_message_sql=<none>

This looks interesting to me.
ci-regexp: TIMEOUT .* mz-environmentd::sql test_utilization_hold

@def- def- added C-bug Category: something is broken ci-flake labels Aug 30, 2024
@chaas
Copy link
Contributor

chaas commented Sep 5, 2024

@def- was this a recurring flake? Seems like something caused the coordinator to get stuck, and wondering if it was a transient issue or an issue with recent changes to the coordinator.

@def-
Copy link
Contributor Author

def- commented Sep 6, 2024

Checking ci-failures it first occurred on August 21 (in #27720) and then 10 times in other branches after that. Maybe that is already the responsible change?

@chaas
Copy link
Contributor

chaas commented Sep 6, 2024

@ParkMyCar Is there anything in that SHOW commands PR that could cause coord stalls? It's especially strange since we're only seeing it in this retain history test case, unless @def- have you seen any other test failures due to timeouts/ the coordinator stuck error?
The catalog object being tested (mz_cluster_replica_statuses) hasn't been modified recently, so I'm thinking perhaps it's an issue with the coord taking a long time when querying retained history objects. Not sure what would've caused that though, will have to ask around

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: something is broken ci-flake
Projects
None yet
Development

No branches or pull requests

3 participants