-
Notifications
You must be signed in to change notification settings - Fork 604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix iceberg data migration test timeout #25069
base: dev
Are you sure you want to change the base?
Fix iceberg data migration test timeout #25069
Conversation
CI test resultstest results on build#61767
test results on build#61891
test results on build#61893
|
Meh. It worked locally. Will debug. |
- wait for consuming all messages regardless translation state - avoid race conditions when stopping consumer
sometimes maximum throughput is not desired
1000 messages per second is not a crazy production rate, but a buffer of 5000 messages will only keep 5 seconds worth, which is less than unmount or translation delays
Production rate reduced as otherwise RPCN produces too much data while unmounting, so it takes unreasonable time to complete. Sleep removed as verifier is more robust. Time limit increased as sometimes unmount takes more time and it takes longer to get offline: - offline mode waits for consuming till migrations blocking offset - consume thread waits for query thread (limited comparison buffer) - query thread may lag because translation lags
162ed94
to
3b4aa86
Compare
/dt |
@@ -47,7 +47,8 @@ def __init__(self, | |||
topic: str, | |||
query_engine: QueryEngineBase, | |||
compacted: bool = False, | |||
table_override: Optional[str] = None): | |||
table_override: Optional[str] = None, | |||
buffer=5000): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe call this max_buffered_msgs
or somesuch?
|
||
connect.start_stream(name="ducky_stream", | ||
config=self.avro_stream_config( | ||
self.TOPIC_NAME, "verifier_schema", 1000000)) | ||
self.TOPIC_NAME, "verifier_schema", 1000000, | ||
1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe pull 1 out into some low_interval_ms
and explain in a comment why it's necessary?
verifier = DatalakeVerifier(self.redpanda, | ||
self.TOPIC_NAME, | ||
self.dl.spark(), | ||
buffer=50000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe pull 50000 out into some high_buffered_msgs
and explain in a comment why it's necessary?
@@ -127,8 +130,7 @@ def test_simple_unmount(self, cloud_storage_type): | |||
# the topic goes read-only during this wait | |||
self.wait_for_migration_states(out_migration_id, ['executed']) | |||
connect.stop_stream("ducky_stream", should_finish=False) | |||
time.sleep(1) # just it case: let verifier consume remaining messages | |||
verifier.go_offline() | |||
verifier.go_offline(600) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: comment explaining why this is necessary?
@@ -73,14 +73,14 @@ def __init__(self, test_context): | |||
redpanda=self.redpanda, | |||
include_query_engines=[QueryEngineType.SPARK]) | |||
|
|||
def avro_stream_config(self, topic, subject, cnt=3000): | |||
def avro_stream_config(self, topic, subject, cnt=3000, interval_ms=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Production rate reduced as otherwise RPCN produces too much data while unmounting
Hmm I thought we blocked writes while we unmounted. Are we sure we eventually actually quiesce? The tests I saw were hanging around for 12 minutes without completing -- I don't imagine unmount would take that long, would it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we do, but it takes some time since we block it.
The 12 minutes you saw was a combination of many problems, swallowed errors in particular -- see 1st commit.
The test is broken: it swallows errors, timeout ones in particular.
Make it propagate errors. To avoid timeouts:
Backports Required
Release Notes