Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latency slowly creeps up when running same few SELECTs against a cluster #29526

Open
def- opened this issue Sep 13, 2024 · 6 comments
Open

Latency slowly creeps up when running same few SELECTs against a cluster #29526

def- opened this issue Sep 13, 2024 · 6 comments
Labels
C-bug Category: something is broken

Comments

@def-
Copy link
Contributor

def- commented Sep 13, 2024

What version of Materialize are you using?

v0.116.0

What is the issue?

Seen in https://buildkite.com/materialize/qa-canary/builds/228#0191eb33-4b0c-406a-af5e-1e5bf13c4413 on my PR introducing that test: #29524
QACanary_this
This is running a few simple SELECT queries against a cluster. Only the SELECT 1 is open loop with 100 queries per second (not affected), while the rest are closed loop (and strict serializable) and are getting slower slowly with time:

Statistics for SELECT 1 (pooled):
  queries: 60000
  qps:  100.00
  min:    2.02ms
  avg:   11.51ms
  p50:    4.21ms
  p95:   36.24ms
  p99:   79.82ms
  max:  496.96ms
  std:   19.93ms
  slope: -0.0129
Statistics for SELECT * FROM qa_canary_environment.public_table.table_mv (reuse connection):
  queries:  1713
  qps:    2.86
  min:   52.22ms
  avg:  350.28ms
  p50:  338.93ms
  p95:  482.84ms
  p99:  596.52ms
  max:  869.84ms
  std:   81.92ms
  slope: 0.0533
Statistics for SELECT COUNT(DISTINCT a_name) FROM qa_canary_environment.public_mysql_cdc.mysql_wmr WHERE degree > 1 (reuse connection):
  queries:  2013
  qps:    3.36
  min:   83.28ms
  avg:  298.07ms
  p50:  312.89ms
  p95:  360.71ms
  p99:  403.32ms
  max:  635.58ms
  std:   58.38ms
  slope: 0.0464
Statistics for SELECT min(c), max(c), count(*) FROM qa_canary_environment.public_table.table (reuse connection):
  queries:  1133
  qps:    1.89
  min:   95.89ms
  avg:  529.60ms
  p50:  538.24ms
  p95:  692.57ms
  p99:  787.96ms
  max:  923.38ms
  std:  124.02ms
  slope: 0.1532
Statistics for SELECT COUNT(DISTINCT a_name) FROM qa_canary_environment.public_pg_cdc.wmr WHERE degree > 1 (reuse connection):
  queries:  2020
  qps:    3.37
  min:   99.21ms
  avg:  297.04ms
  p50:  311.60ms
  p95:  357.89ms
  p99:  383.76ms
  max:  623.20ms
  std:   54.57ms
  slope: 0.0574
Statistics for SELECT COUNT(DISTINCT count_star) FROM qa_canary_environment.public_loadgen.sales_product_product_category WHERE count_distinct_product_id > 0 (reuse connection):
  queries:  2759
  qps:    4.60
  min:   12.42ms
  avg:  217.48ms
  p50:  224.08ms
  p95:  328.81ms
  p99:  352.85ms
  max:  497.71ms
  std:   91.04ms
  slope: 0.0305

The workload is running against the Materialize Production Sandbox (maybe I should move it to a dedicated staging env to be more isolated from other noise?), and since the first attempt was only 10 minutes I'm now retrying with 1 hour: https://buildkite.com/materialize/qa-canary/builds/229
The cluster itself (200cc, https://console.materialize.com/regions/aws-us-east-1/clusters/u3/qa_canary_environment_compute?timePeriod=180) always stayed at <=50% CPU usage. Since it's not overloaded, I expected the queries' performance to stay consistent over time.

@def- def- added the C-bug Category: something is broken label Sep 13, 2024
@def-
Copy link
Contributor Author

def- commented Sep 13, 2024

One thing that came to mind is that @antiguru mentioned that the query history currently doesn't support running many queries against Materialize (because it eventually OoMs), maybe that's related? I'm not even able to load the query history anymore now

@def-
Copy link
Contributor Author

def- commented Sep 13, 2024

The mz_catalog_server usage is going up and up, so I guess that is related and fits with the query history being responsible:
Screenshot 2024-09-13 at 15 48 00
Screenshot 2024-09-13 at 15 49 34
For OLTP-type workloads this is probably a blocker (and also for me enabling this test).

@antiguru
Copy link
Member

Could we run this against a staging environment that has statement logging disabled? This way we could unblock the test, which is important to have on its own.

@def-
Copy link
Contributor Author

def- commented Sep 13, 2024

Could we run this against a staging environment that has statement logging disabled? This way we could unblock the test, which is important to have on its own.

Yes, that will be my next step, I was hoping not to have to this early because of cost and for convenience of having to recreate sources etc.

@benesch
Copy link
Member

benesch commented Sep 14, 2024

I was hoping not to have to this early because of cost and for convenience of having to recreate sources etc.

What environment is the test running against now? We could disable statement logging against that environment instead.

@benesch
Copy link
Member

benesch commented Sep 14, 2024

Oh, just saw the conversation on Slack. Never mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: something is broken
Projects
None yet
Development

No branches or pull requests

3 participants