Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dt: disable core dumps in crash tracker tests #25117

Draft
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

pgellert
Copy link
Contributor

Currently, the CrashLoopChecksTest.test_crash_report_with_signal ducktape tests consistently fail in CDT. This is because when a crash signal is sent to the redpanda process during the test, in CDT a core dump gets generated, which takes a long time (>1min) and causes the test to time out waiting for redpanda to stop (in 10s).

To fix this, this PR disables core dumps for CrashLoopChecksTest ducktape tests.

See the commit message for implementation details.

Fixes https://redpandadata.atlassian.net/browse/CORE-9044
Tested this on PR #25100

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

  • none

I chose to set the core_pattern as it is the most reliable and simplest
option for our current setup.

Note that in our current CDT setup, core dumps are not written to a file
directly by the kernel but instead use a piping mechanism where they
pipe the coredump to `apport`. Because of this, the RLIMIT_CORE of the
process (set through ulimit) is not respected, meaning ulimit alone
cannot disable core dumps.

For reference, see the "Piping core dumps to a program" section of
https://www.man7.org/linux/man-pages/man5/core.5.html
@pgellert pgellert requested review from a team February 19, 2025 16:17
@pgellert pgellert self-assigned this Feb 19, 2025
@pgellert pgellert requested review from rpdevmp, oleiman, a team and michael-redpanda and removed request for a team and oleiman February 19, 2025 16:17
@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#62030
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/62030#01951f4a-0ff1-45d2-9d14-ccbd458b62db FLAKY 1/4
rptest.tests.log_compaction_test.LogCompactionTest.compaction_stress_test.cleanup_policy=compact.delete.key_set_cardinality=1000.storage_compaction_key_map_memory_kb=3 ducktape https://buildkite.com/redpanda/redpanda/builds/62030#01951f4a-0fee-46a6-9ea5-74ecd3c04977 FLAKY 1/2
rptest.tests.log_compaction_test.LogCompactionTest.compaction_stress_test.cleanup_policy=compact.key_set_cardinality=1000.storage_compaction_key_map_memory_kb=10 ducktape https://buildkite.com/redpanda/redpanda/builds/62030#01951f4a-0fee-46a6-9ea5-74ecd3c04977 FLAKY 1/3
rptest.tests.log_compaction_test.LogCompactionTest.compaction_stress_test.cleanup_policy=compact.key_set_cardinality=1000.storage_compaction_key_map_memory_kb=3 ducktape https://buildkite.com/redpanda/redpanda/builds/62030#01951f4a-0ff0-4db1-9a89-26d0b5f53b26 FLAKY 1/3
rptest.tests.partition_balancer_test.PartitionBalancerTest.test_unavailable_nodes ducktape https://buildkite.com/redpanda/redpanda/builds/62030#01951f4a-0fee-46a6-9ea5-74ecd3c04977 FLAKY 1/2
rptest.tests.scaling_up_test.ScalingUpTest.test_scaling_up_with_recovered_topic ducktape https://buildkite.com/redpanda/redpanda/builds/62030#01951f4a-0ff0-4db1-9a89-26d0b5f53b26 FLAKY 1/2

@@ -57,6 +57,9 @@ def __init__(self, test_context):
)
self.broker = self.redpanda.nodes[0]

# Disable core dumps as they take a long time (>1min). Core dumps are uninteresting for this test, since this test intentionally trigger crashes.
self.broker.account.ssh("sysctl -w kernel.core_pattern='|/dev/null'")
Copy link
Member

@StephanDollberg StephanDollberg Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this have to be reverted at the end of the test?

Note that apport is a giant piece of garbage so there is a point for disabling/removing it in our ansible altogether (at which point ulimit should work again).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this have to be reverted at the end of the test?

yeh i am concerned about running this on my local machine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's fair. This PR is meant to be a way of disabling apport, but yeah, good point about this affecting local machines and other tests. Let me reach out to devprod to ask them to disable apport or move to a non-piping core_pattern in CDT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. yeh i'm not opposed to this, but just as a drive-by review it seems like there might be some other ways to approach this like a one-off test outside the normal test harnesses. but not immediately sure the best thing to do

@pgellert pgellert marked this pull request as draft February 20, 2025 11:07
@pgellert
Copy link
Contributor Author

Putting this to draft mode while I discuss with devprod an alternative approach of making a CDT-only change to core_pattern / disabling apport.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants