WIP: Remove dependency on kallsyms with eBPF #5217

krzysiek4321 · 2025-02-22T05:21:06Z

I've been trying to improve perf tools' startup time to make working on fixing broken tests/features of perf bearable on NixOs. I've learned that processing /proc/kallsyms is a costly operation; on my ryzen5 system around 100ms just to read through it.

Aside from decreasing how many times kallsyms are read, I started to look into how to remove dependency on kallsyms.
From kernel source code kernel/kallsyms.c and printk() documentation https://docs.kernel.org/core-api/printk-formats.html#symbols-function-pointers I learned about special pointer formatting flags in kernel.

Writing a custom kernel module just for the purpose of extracting symbol names didn't feel right to me, so I've tried to use eBPF for that purpose. eBPF programs have these helpers available that are promising:

bpf_snprintf
bpf_trace_printk

I've looked into how these two are implemented and found that both of them use bstr_printf underneath. But before any printing is done, the format string must first go through bpf_bprintf_prepare, which disallows certain flags.

Fortunately for us, thanks to Florent Revest
https://lore.kernel.org/bpf/[email protected]/ %pB is accepted. We might consider adding that info to bpf-helpers man page?

Overview of the new approach:

eBPF profiler does low-latency recording of stack frames as it used to;
Go through all kernel ips recorded in stack traces and save them in bpf_hashmap(K=u64, V=char[MAX_SYM_LEN]) with an empty value for now;
Call eBPF converter program, which will populate values in the hashmap with bpf_snprintf(%pB);
Report recorded callstacks using the now-populated hashmap for ksyms.

In order to reliably trigger the converter program, I decided to use USDT.

Running time ./profile -F 2344 1 on a mostly idle system I got Before:
real 0m1,215s
user 0m0,058s
sys 0m0,157s

After:
real 0m1,045s
user 0m0,009s
sys 0m0,026s

I ask the community here for your opinion, help and guidance to make this mergeable.

Using %pB slightly changes the format of a symbol name. Example: kmem_cache_alloc_noprof+0x2cf/0x300
It would be trivial to remove the suffix if it's necessary. Generating flamegraphs with Brendan Gregg's perl script still works.

For now, max_entries of the hashmap is hardcoded. Would you make it dynamic, like stack-storage-size or compute it by collecting all ips into a set and take this set's size as value for max_entries?

In Makefile:
When adding bpf/usdt.bpf.h clang errored it couldn't find asm/errno.h so I added -v to cflags when V is set.
As a quick fix, I hardcoded include path to x86_64 host header files. In contrast to clang, gcc corretly has this in it's default include path. Do you have any idea how to set this path, preferably so it doesn't only work on ubuntu?

Experimentally I added BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID to bpf_get_stackid. It doesn't seem to break the program and should make recording faster. If somebody knows why adding them is a bad idea please do tell.

If this gets a positive reaction, I will look into converting other tools here to use eBPF instead of /proc/kallsyms when possible.

Best,
Krzysztof

I've been trying to improve perf tools' startup time to make working on fixing broken tests/features of perf bearable on NixOs. I've learned that processing /proc/kallsyms is a costly operation; on my ryzen5 system around 100ms just to read through it. Aside from decreasing how many times kallsyms are read, I started to look into how to remove dependency on kallsyms. From kernel source code kernel/kallsyms.c and printk() documentation https://docs.kernel.org/core-api/printk-formats.html#symbols-function-pointers I learned about special pointer formatting flags in kernel. Writing a custom kernel module just for the purpose of extracting symbol names didn't feel right to me, so I've tried to use eBPF for that purpose. eBPF programs have these helpers available that are promising: - bpf_snprintf - bpf_trace_printk I've looked into how these two are implemented and found that both of them use bstr_printf underneath. But before any printing is done, the format string must first go through bpf_bprintf_prepare, which disallows certain flags. Fortunately for us, thanks to Florent Revest https://lore.kernel.org/bpf/[email protected]/ %pB is accepted. We might consider adding that info to bpf-helpers man page? Overview of the new approach: 1. eBPF profiler does low-latency recording of stack frames as it used to; 2. Go through all kernel ips recorded in stack traces and save them in bpf_hashmap(K=u64, V=char[MAX_SYM_LEN]) with an empty value for now; 3. Call eBPF converter program, which will populate values in the hashmap with bpf_snprintf(%pB); 4. Report recorded callstacks using the now-populated hashmap for ksyms. In order to reliably trigger the converter program, I decided to use USDT. Running `time ./profile -F 2344 1` on a mostly idle system I got Before: real 0m1,215s user 0m0,058s sys 0m0,157s After: real 0m1,045s user 0m0,009s sys 0m0,026s I ask the community here for your opinion, help and guidance to make this mergeable. Using %pB slightly changes the format of a symbol name. Example: kmem_cache_alloc_noprof+0x2cf/0x300 It would be trivial to remove the suffix if it's necessary. Generating flamegraphs with Brendan Gregg's perl script still works. For now, max_entries of the hashmap is hardcoded. Would you make it dynamic, like stack-storage-size or compute it by collecting all ips into a set and take this set's size as value for max_entries? In Makefile: When adding bpf/usdt.bpf.h clang errored it couldn't find asm/errno.h so I added -v to cflags when V is set. As a quick fix, I hardcoded include path to x86_64 host header files. In contrast to clang, gcc corretly has this in it's default include path. Do you have any idea how to set this path, preferably so it doesn't only work on ubuntu? Experimentally I added BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID to bpf_get_stackid. It doesn't seem to break the program and should make recording faster. If somebody knows why adding them is a bad idea please do tell. If this gets a positive reaction, I will look into converting other tools here to use eBPF instead of /proc/kallsyms when possible. Best, Krzysztof

If there is an error and we jump to cleanup it's still freed in profile_bpf__destroy. Valgrind doesn't go well with eBPF programs, but to make it happier init empty with zeros explicitly.

krzysiek4321 · 2025-02-24T01:39:55Z

I've watched now Evolution of stack trace captures with BPF - Andrii Nakryiko
Originally posted by @ekyooo in #5181 (comment)

I think we should consider keeping BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID
since for some users latency of profiling may be more important
and if hash collision happens we are losing data anyway.

Perhaps creating an option --fast, where the possibility of overwriting data on hash colisions is mentioned in it's description.
Perhaps make this the default behavior and make a flag to turn it off.

krzysiek4321 requested review from drzaeus77, goldshtn, yonghong-song, 4ast, brendangregg and davemarchevsky as code owners February 22, 2025 05:21

Stop profiling before printing results

91684f1

If there is an error and we jump to cleanup it's still freed in profile_bpf__destroy. Valgrind doesn't go well with eBPF programs, but to make it happier init empty with zeros explicitly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Remove dependency on kallsyms with eBPF #5217

WIP: Remove dependency on kallsyms with eBPF #5217

krzysiek4321 commented Feb 22, 2025

krzysiek4321 commented Feb 24, 2025

WIP: Remove dependency on kallsyms with eBPF #5217

Are you sure you want to change the base?

WIP: Remove dependency on kallsyms with eBPF #5217

Conversation

krzysiek4321 commented Feb 22, 2025

krzysiek4321 commented Feb 24, 2025