Fix iceberg data migration test timeout #25069

bashtanov · 2025-02-09T23:45:17Z

The test is broken: it swallows errors, timeout ones in particular.
Make it propagate errors. To avoid timeouts:

make sure verifier offline mode only waits for consumption not for anything in the querying thread
reduce production rate
allow more time to complete

Backports Required

Release Notes

none

vbotbuildovich · 2025-02-10T03:37:08Z

CI test results

test results on build#61767

test_id	test_kind	job_url	test_status	passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery	ducktape	https://buildkite.com/redpanda/redpanda/builds/61767#0194ed78-5b46-4f61-a4d1-c0b9f80c8fe8	FLAKY	1/3
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade	ducktape	https://buildkite.com/redpanda/redpanda/builds/61767#0194ed78-5b46-4a7a-b743-4765ba61ffe9	FLAKY	1/2
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_HADOOP	ducktape	https://buildkite.com/redpanda/redpanda/builds/61767#0194ed65-07a0-4a9d-b85f-052e159a33fa	FLAKY	1/2
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_JDBC	ducktape	https://buildkite.com/redpanda/redpanda/builds/61767#0194ed65-079e-47f9-8066-e78270290712	FLAKY	1/2
rptest.tests.datalake.mount_unmount_test.MountUnmountIcebergTest.test_simple_unmount.cloud_storage_type=CloudStorageType.S3	ducktape	https://buildkite.com/redpanda/redpanda/builds/61767#0194ed65-079f-4549-a7c5-cf74a8afb220	FLAKY	1/2
rptest.tests.e2e_shadow_indexing_test.ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True.cloud_storage_type=CloudStorageType.ABS	ducktape	https://buildkite.com/redpanda/redpanda/builds/61767#0194ed65-079f-43ff-85ad-4a407350be3a	FLAKY	1/2
rptest.tests.log_compaction_test.LogCompactionTest.compaction_stress_test.cleanup_policy=compact.delete.key_set_cardinality=1000.storage_compaction_key_map_memory_kb=10	ducktape	https://buildkite.com/redpanda/redpanda/builds/61767#0194ed65-079f-43ff-85ad-4a407350be3a	FLAKY	1/2
rptest.tests.log_compaction_test.LogCompactionTest.compaction_stress_test.cleanup_policy=compact.delete.key_set_cardinality=1000.storage_compaction_key_map_memory_kb=3	ducktape	https://buildkite.com/redpanda/redpanda/builds/61767#0194ed65-079e-47f9-8066-e78270290712	FLAKY	1/2
rptest.tests.log_compaction_test.LogCompactionTest.compaction_stress_test.cleanup_policy=compact.key_set_cardinality=1000.storage_compaction_key_map_memory_kb=3	ducktape	https://buildkite.com/redpanda/redpanda/builds/61767#0194ed65-079f-43ff-85ad-4a407350be3a	FLAKY	1/2
rptest.tests.scaling_up_test.ScalingUpTest.test_scaling_up_with_recovered_topic	ducktape	https://buildkite.com/redpanda/redpanda/builds/61767#0194ed65-079f-43ff-85ad-4a407350be3a	FLAKY	1/3

test results on build#61891

test_id	test_kind	job_url	test_status	passed
rptest.tests.availability_test.AvailabilityTests.test_recovery_after_catastrophic_failure	ducktape	https://buildkite.com/redpanda/redpanda/builds/61891#01950c9d-c81b-4cd0-b9c8-8b55cd0d47f5	FLAKY	1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery	ducktape	https://buildkite.com/redpanda/redpanda/builds/61891#01950c9d-c81d-4026-9b9d-23174d15a298	FLAKY	1/2
rptest.tests.delete_records_test.DeleteRecordsTest.test_delete_records_concurrent_truncations.cloud_storage_enabled=True.truncate_point=start_offset	ducktape	https://buildkite.com/redpanda/redpanda/builds/61891#01950c9d-c81c-4acb-a996-8092df40b022	FLAKY	1/2
rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=3.unclean_abort=True.recovery=no_recovery.compacted=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61891#01950c9d-c81b-4cd0-b9c8-8b55cd0d47f5	FLAKY	1/2
storage_single_thread_rpunit.storage_single_thread_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/61891#01950c5a-10e6-4d7d-8288-a8b7815d2517	FLAKY	1/2

test results on build#61893

test_id	test_kind	job_url	test_status	passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery	ducktape	https://buildkite.com/redpanda/redpanda/builds/61893#01950e4e-95ab-4976-b770-c4685c51f5e9	FLAKY	1/2
rptest.tests.log_compaction_test.LogCompactionTest.compaction_stress_test.cleanup_policy=compact.delete.key_set_cardinality=1000.storage_compaction_key_map_memory_kb=3	ducktape	https://buildkite.com/redpanda/redpanda/builds/61893#01950e68-4471-4097-9721-039f49ff8226	FLAKY	1/5
storage_single_thread_rpunit.storage_single_thread_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/61893#01950e09-c5a2-43c5-9445-1a61bce8346c	FLAKY	1/2

bashtanov · 2025-02-11T16:28:58Z

Meh. It worked locally. Will debug.

- wait for consuming all messages regardless translation state - avoid race conditions when stopping consumer

sometimes maximum throughput is not desired

1000 messages per second is not a crazy production rate, but a buffer of 5000 messages will only keep 5 seconds worth, which is less than unmount or translation delays

Production rate reduced as otherwise RPCN produces too much data while unmounting, so it takes unreasonable time to complete. Sleep removed as verifier is more robust. Time limit increased as sometimes unmount takes more time and it takes longer to get offline: - offline mode waits for consuming till migrations blocking offset - consume thread waits for query thread (limited comparison buffer) - query thread may lag because translation lags

bashtanov · 2025-02-16T01:21:13Z

/dt

andrwng · 2025-02-19T22:20:04Z

tests/rptest/tests/datalake/datalake_verifier.py

@@ -47,7 +47,8 @@ def __init__(self,
                 topic: str,
                 query_engine: QueryEngineBase,
                 compacted: bool = False,
-                 table_override: Optional[str] = None):
+                 table_override: Optional[str] = None,
+                 buffer=5000):


nit: maybe call this max_buffered_msgs or somesuch?

andrwng · 2025-02-19T22:23:28Z

tests/rptest/tests/datalake/mount_unmount_test.py


        connect.start_stream(name="ducky_stream",
                             config=self.avro_stream_config(
-                                 self.TOPIC_NAME, "verifier_schema", 1000000))
+                                 self.TOPIC_NAME, "verifier_schema", 1000000,
+                                 1))


nit: maybe pull 1 out into some low_interval_ms and explain in a comment why it's necessary?

andrwng · 2025-02-19T22:25:01Z

tests/rptest/tests/datalake/mount_unmount_test.py

+        verifier = DatalakeVerifier(self.redpanda,
+                                    self.TOPIC_NAME,
+                                    self.dl.spark(),
+                                    buffer=50000)


nit: maybe pull 50000 out into some high_buffered_msgs and explain in a comment why it's necessary?

andrwng · 2025-02-19T22:25:27Z

tests/rptest/tests/datalake/mount_unmount_test.py

@@ -127,8 +130,7 @@ def test_simple_unmount(self, cloud_storage_type):
        # the topic goes read-only during this wait
        self.wait_for_migration_states(out_migration_id, ['executed'])
        connect.stop_stream("ducky_stream", should_finish=False)
-        time.sleep(1)  # just it case: let verifier consume remaining messages
-        verifier.go_offline()
+        verifier.go_offline(600)


nit: comment explaining why this is necessary?

andrwng · 2025-02-19T22:27:20Z

tests/rptest/tests/datalake/mount_unmount_test.py

@@ -73,14 +73,14 @@ def __init__(self, test_context):
            redpanda=self.redpanda,
            include_query_engines=[QueryEngineType.SPARK])

-    def avro_stream_config(self, topic, subject, cnt=3000):
+    def avro_stream_config(self, topic, subject, cnt=3000, interval_ms=None):


Production rate reduced as otherwise RPCN produces too much data while unmounting

Hmm I thought we blocked writes while we unmounted. Are we sure we eventually actually quiesce? The tests I saw were hanging around for 12 minutes without completing -- I don't imagine unmount would take that long, would it?

Yes we do, but it takes some time since we block it.
The 12 minutes you saw was a combination of many problems, swallowed errors in particular -- see 1st commit.

bashtanov requested review from bharathv, andrwng and mmaslankaprv February 9, 2025 23:45

bashtanov marked this pull request as draft February 10, 2025 08:52

bashtanov added 5 commits February 16, 2025 01:19

tests/datalake/verifier: do not swallow exceptions when waiting

a0400b6

tests/datalake/verifier: offline mode improvements

f3c8069

- wait for consuming all messages regardless translation state - avoid race conditions when stopping consumer

tests/util/rpcn/counter_stream_config: make rate configurable

4e46220

sometimes maximum throughput is not desired

tests/datalake/verifier: make buffer size configurable

cb742ba

1000 messages per second is not a crazy production rate, but a buffer of 5000 messages will only keep 5 seconds worth, which is less than unmount or translation delays

bashtanov force-pushed the fix-iceberg-data-migration-test-timeout branch from 162ed94 to 3b4aa86 Compare February 16, 2025 01:20

bashtanov marked this pull request as ready for review February 16, 2025 09:11

andrwng reviewed Feb 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix iceberg data migration test timeout #25069

Fix iceberg data migration test timeout #25069

bashtanov commented Feb 9, 2025

vbotbuildovich commented Feb 10, 2025 •

edited

Loading

bashtanov commented Feb 11, 2025

bashtanov commented Feb 16, 2025

andrwng Feb 19, 2025

andrwng Feb 19, 2025

andrwng Feb 19, 2025

andrwng Feb 19, 2025

andrwng Feb 19, 2025

bashtanov Feb 24, 2025

Fix iceberg data migration test timeout #25069

Are you sure you want to change the base?

Fix iceberg data migration test timeout #25069

Conversation

bashtanov commented Feb 9, 2025

Backports Required

Release Notes

vbotbuildovich commented Feb 10, 2025 • edited Loading

CI test results

bashtanov commented Feb 11, 2025

bashtanov commented Feb 16, 2025

andrwng Feb 19, 2025

Choose a reason for hiding this comment

andrwng Feb 19, 2025

Choose a reason for hiding this comment

andrwng Feb 19, 2025

Choose a reason for hiding this comment

andrwng Feb 19, 2025

Choose a reason for hiding this comment

andrwng Feb 19, 2025

Choose a reason for hiding this comment

bashtanov Feb 24, 2025

Choose a reason for hiding this comment

vbotbuildovich commented Feb 10, 2025 •

edited

Loading