Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bgpd: rfapi: fix mem leak when killed #18045

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

paulzlabn
Copy link
Contributor

Something changed in the bgpd termination sequence, leading to rfapi objects not being freed via their usual timed expiration pathway. This change frees rfapi peer BPIs immediately instead of expiring via timer when bgp is terminating.

@frrbot frrbot bot added the bgp label Feb 6, 2025
@paulzlabn paulzlabn force-pushed the ziemba/250205-rfapi-mem-cleanup branch from d8bea53 to 125a480 Compare February 6, 2025 16:46
@donaldsharp
Copy link
Member

as a note for something that needs exploration in the rfapi code -> There is a pattern of setting the vnc.import.timer thread pointer to NULL and then calling event_add_timer.

This pattern has caused problems in the past where memory allocated and pointed to by the event/thread get's leaked or worse when that event hasn't expired but the data structures that are used by that event/thread get yanked out from underneath it(say some form of no command) and the underlying code has no way to get ahold of that event/thread since the pointer to it in it's space is gone. This will cause crashes! I don't necessarily think this should hold up this memory leak problem but it is a pattern that the rfapi code should be aware of and fix.

@donaldsharp
Copy link
Member

	assert(bpi->extra);
	if (lifetime > UINT32_MAX / 1001) {
		/* sub-optimal case, but will probably never happen */
		bpi->extra->vnc->vnc.import.timer = NULL;
		event_add_timer(bm->master, timer_service_func, wcb, lifetime,
				&bpi->extra->vnc->vnc.import.timer);
	} else {
		static uint32_t jitter;
		uint32_t lifetime_msec;

		/*
		 * the goal here is to spread out the timers so they are
		 * sortable in the skip list
		 */
		if (++jitter >= 1000)
			jitter = 0;

		lifetime_msec = (lifetime * 1000) + jitter;

		bpi->extra->vnc->vnc.import.timer = NULL;
		event_add_timer_msec(bm->master, timer_service_func, wcb,
				     lifetime_msec,
				     &bpi->extra->vnc->vnc.import.timer);
	}

here is the code I am talking about

@paulzlabn
Copy link
Contributor Author

setting the vnc.import.timer thread pointer to NULL and then calling event_add_timer.

Interesting. Even the original code from 2009 had an equivalent pattern import.timer = thread_add_timer(...).
It would probably be good to replace setting-to-NULL with an assert and then see what falls out. I'll have a look at that.

@donaldsharp
Copy link
Member

I've definately fixed a few bugs from code using that pattern from taht time period.

@donaldsharp
Copy link
Member

==3115346==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 40 byte(s) in 1 object(s) allocated from:
    #0 0x7f27348b4a57 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x7f27342910a9 in qcalloc lib/memory.c:106
    #2 0x55a4670295b7 in rfapiRibStartTimer bgpd/rfapi/rfapi_rib.c:328
    #3 0x55a467031168 in process_pending_node bgpd/rfapi/rfapi_rib.c:1371
    #4 0x55a467031c4f in rib_do_callback_onepass bgpd/rfapi/rfapi_rib.c:1465
    #5 0x55a467032123 in rfapiRibDoQueuedCallback bgpd/rfapi/rfapi_rib.c:1507
    #6 0x7f27343cc2aa in work_queue_run lib/workqueue.c:282
    #7 0x7f27343a3ad9 in event_call lib/event.c:1984
    #8 0x7f2734262141 in frr_run lib/libfrr.c:1246
    #9 0x55a466c3ce19 in main bgpd/bgp_main.c:557
    #10 0x7f2733629d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

SUMMARY: AddressSanitizer: 40 byte(s) leaked in 1 allocation(s).
::::::::::::::
./bgp_rfapi_basic_sanity.test_bgp_rfapi_basic_sanity/r3.asan.bgpd.3094674
::::::::::::::

=================================================================
==3094674==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 40 byte(s) in 1 object(s) allocated from:
    #0 0x7f0ba40b4a57 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x7f0ba3a910a9 in qcalloc lib/memory.c:106
    #2 0x5575d77325b7 in rfapiRibStartTimer bgpd/rfapi/rfapi_rib.c:328
    #3 0x5575d773a168 in process_pending_node bgpd/rfapi/rfapi_rib.c:1371
    #4 0x5575d773ac4f in rib_do_callback_onepass bgpd/rfapi/rfapi_rib.c:1465
    #5 0x5575d773b123 in rfapiRibDoQueuedCallback bgpd/rfapi/rfapi_rib.c:1507
    #6 0x7f0ba3bcc2aa in work_queue_run lib/workqueue.c:282
    #7 0x7f0ba3ba3ad9 in event_call lib/event.c:1984
    #8 0x7f0ba3a62141 in frr_run lib/libfrr.c:1246
    #9 0x5575d7345e19 in main bgpd/bgp_main.c:557
    #10 0x7f0ba2e29d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

does this need the same fix?

@paulzlabn
Copy link
Contributor Author

#2 0x5575d77325b7 in rfapiRibStartTimer bgpd/rfapi/rfapi_rib.c:328
#3 0x5575d773a168 in process_pending_node bgpd/rfapi/rfapi_rib.c:1371
#4 0x5575d773ac4f in rib_do_callback_onepass bgpd/rfapi/rfapi_rib.c:1465
#5 0x5575d773b123 in rfapiRibDoQueuedCallback bgpd/rfapi/rfapi_rib.c:1507

does this need the same fix?

I think this will be a different fix, because the allocation here didn't occur while shutting down. But I suspect that both this case as well as the case addressed by this PR were triggered by some change in the ordering of FRR infrastructure cleanup during shutdown (which revealed some latent omission in the rfapi cleanup routine).

@paulzlabn
Copy link
Contributor Author

ci:rerun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants