Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix iceberg data migration test timeout #25069

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from

Conversation

bashtanov
Copy link
Contributor

The test is broken: it swallows errors, timeout ones in particular.
Make it propagate errors. To avoid timeouts:

  • make sure verifier offline mode only waits for consumption not for anything in the querying thread
  • reduce production rate
  • allow more time to complete

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

  • none

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Feb 10, 2025

CI test results

test results on build#61767
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/61767#0194ed78-5b46-4f61-a4d1-c0b9f80c8fe8 FLAKY 1/3
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/61767#0194ed78-5b46-4a7a-b743-4765ba61ffe9 FLAKY 1/2
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_HADOOP ducktape https://buildkite.com/redpanda/redpanda/builds/61767#0194ed65-07a0-4a9d-b85f-052e159a33fa FLAKY 1/2
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_JDBC ducktape https://buildkite.com/redpanda/redpanda/builds/61767#0194ed65-079e-47f9-8066-e78270290712 FLAKY 1/2
rptest.tests.datalake.mount_unmount_test.MountUnmountIcebergTest.test_simple_unmount.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/61767#0194ed65-079f-4549-a7c5-cf74a8afb220 FLAKY 1/2
rptest.tests.e2e_shadow_indexing_test.ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/61767#0194ed65-079f-43ff-85ad-4a407350be3a FLAKY 1/2
rptest.tests.log_compaction_test.LogCompactionTest.compaction_stress_test.cleanup_policy=compact.delete.key_set_cardinality=1000.storage_compaction_key_map_memory_kb=10 ducktape https://buildkite.com/redpanda/redpanda/builds/61767#0194ed65-079f-43ff-85ad-4a407350be3a FLAKY 1/2
rptest.tests.log_compaction_test.LogCompactionTest.compaction_stress_test.cleanup_policy=compact.delete.key_set_cardinality=1000.storage_compaction_key_map_memory_kb=3 ducktape https://buildkite.com/redpanda/redpanda/builds/61767#0194ed65-079e-47f9-8066-e78270290712 FLAKY 1/2
rptest.tests.log_compaction_test.LogCompactionTest.compaction_stress_test.cleanup_policy=compact.key_set_cardinality=1000.storage_compaction_key_map_memory_kb=3 ducktape https://buildkite.com/redpanda/redpanda/builds/61767#0194ed65-079f-43ff-85ad-4a407350be3a FLAKY 1/2
rptest.tests.scaling_up_test.ScalingUpTest.test_scaling_up_with_recovered_topic ducktape https://buildkite.com/redpanda/redpanda/builds/61767#0194ed65-079f-43ff-85ad-4a407350be3a FLAKY 1/3
test results on build#61891
test_id test_kind job_url test_status passed
rptest.tests.availability_test.AvailabilityTests.test_recovery_after_catastrophic_failure ducktape https://buildkite.com/redpanda/redpanda/builds/61891#01950c9d-c81b-4cd0-b9c8-8b55cd0d47f5 FLAKY 1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/61891#01950c9d-c81d-4026-9b9d-23174d15a298 FLAKY 1/2
rptest.tests.delete_records_test.DeleteRecordsTest.test_delete_records_concurrent_truncations.cloud_storage_enabled=True.truncate_point=start_offset ducktape https://buildkite.com/redpanda/redpanda/builds/61891#01950c9d-c81c-4acb-a996-8092df40b022 FLAKY 1/2
rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=3.unclean_abort=True.recovery=no_recovery.compacted=False ducktape https://buildkite.com/redpanda/redpanda/builds/61891#01950c9d-c81b-4cd0-b9c8-8b55cd0d47f5 FLAKY 1/2
storage_single_thread_rpunit.storage_single_thread_rpunit unit https://buildkite.com/redpanda/redpanda/builds/61891#01950c5a-10e6-4d7d-8288-a8b7815d2517 FLAKY 1/2
test results on build#61893
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/61893#01950e4e-95ab-4976-b770-c4685c51f5e9 FLAKY 1/2
rptest.tests.log_compaction_test.LogCompactionTest.compaction_stress_test.cleanup_policy=compact.delete.key_set_cardinality=1000.storage_compaction_key_map_memory_kb=3 ducktape https://buildkite.com/redpanda/redpanda/builds/61893#01950e68-4471-4097-9721-039f49ff8226 FLAKY 1/5
storage_single_thread_rpunit.storage_single_thread_rpunit unit https://buildkite.com/redpanda/redpanda/builds/61893#01950e09-c5a2-43c5-9445-1a61bce8346c FLAKY 1/2

@bashtanov bashtanov marked this pull request as draft February 10, 2025 08:52
@bashtanov
Copy link
Contributor Author

Meh. It worked locally. Will debug.

- wait for consuming all messages regardless translation state
- avoid race conditions when stopping consumer
sometimes maximum throughput is not desired
1000 messages per second is not a crazy production rate, but a buffer
of 5000 messages will only keep 5 seconds worth, which is less than
unmount or translation delays
Production rate reduced as otherwise RPCN produces too much data while
unmounting, so it takes unreasonable time to complete.

Sleep removed as verifier is more robust.

Time limit increased as sometimes unmount takes more time and it takes
longer to get offline:
- offline mode waits for consuming till migrations blocking offset
- consume thread waits for query thread (limited comparison buffer)
- query thread may lag because translation lags
@bashtanov bashtanov force-pushed the fix-iceberg-data-migration-test-timeout branch from 162ed94 to 3b4aa86 Compare February 16, 2025 01:20
@bashtanov
Copy link
Contributor Author

/dt

@bashtanov bashtanov marked this pull request as ready for review February 16, 2025 09:11
@@ -47,7 +47,8 @@ def __init__(self,
topic: str,
query_engine: QueryEngineBase,
compacted: bool = False,
table_override: Optional[str] = None):
table_override: Optional[str] = None,
buffer=5000):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe call this max_buffered_msgs or somesuch?


connect.start_stream(name="ducky_stream",
config=self.avro_stream_config(
self.TOPIC_NAME, "verifier_schema", 1000000))
self.TOPIC_NAME, "verifier_schema", 1000000,
1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe pull 1 out into some low_interval_ms and explain in a comment why it's necessary?

verifier = DatalakeVerifier(self.redpanda,
self.TOPIC_NAME,
self.dl.spark(),
buffer=50000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe pull 50000 out into some high_buffered_msgs and explain in a comment why it's necessary?

@@ -127,8 +130,7 @@ def test_simple_unmount(self, cloud_storage_type):
# the topic goes read-only during this wait
self.wait_for_migration_states(out_migration_id, ['executed'])
connect.stop_stream("ducky_stream", should_finish=False)
time.sleep(1) # just it case: let verifier consume remaining messages
verifier.go_offline()
verifier.go_offline(600)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comment explaining why this is necessary?

@@ -73,14 +73,14 @@ def __init__(self, test_context):
redpanda=self.redpanda,
include_query_engines=[QueryEngineType.SPARK])

def avro_stream_config(self, topic, subject, cnt=3000):
def avro_stream_config(self, topic, subject, cnt=3000, interval_ms=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Production rate reduced as otherwise RPCN produces too much data while unmounting

Hmm I thought we blocked writes while we unmounted. Are we sure we eventually actually quiesce? The tests I saw were hanging around for 12 minutes without completing -- I don't imagine unmount would take that long, would it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we do, but it takes some time since we block it.
The 12 minutes you saw was a combination of many problems, swallowed errors in particular -- see 1st commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants