Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Remove dependency on kallsyms with eBPF #5217

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

krzysiek4321
Copy link

I've been trying to improve perf tools' startup time to make working on fixing broken tests/features of perf bearable on NixOs. I've learned that processing /proc/kallsyms is a costly operation; on my ryzen5 system around 100ms just to read through it.

Aside from decreasing how many times kallsyms are read, I started to look into how to remove dependency on kallsyms.
From kernel source code kernel/kallsyms.c and printk() documentation https://docs.kernel.org/core-api/printk-formats.html#symbols-function-pointers I learned about special pointer formatting flags in kernel.

Writing a custom kernel module just for the purpose of extracting symbol names didn't feel right to me, so I've tried to use eBPF for that purpose. eBPF programs have these helpers available that are promising:

  • bpf_snprintf
  • bpf_trace_printk

I've looked into how these two are implemented and found that both of them use bstr_printf underneath. But before any printing is done, the format string must first go through bpf_bprintf_prepare, which disallows certain flags.

Fortunately for us, thanks to Florent Revest
https://lore.kernel.org/bpf/[email protected]/ %pB is accepted. We might consider adding that info to bpf-helpers man page?

Overview of the new approach:

  1. eBPF profiler does low-latency recording of stack frames as it used to;
  2. Go through all kernel ips recorded in stack traces and save them in bpf_hashmap(K=u64, V=char[MAX_SYM_LEN]) with an empty value for now;
  3. Call eBPF converter program, which will populate values in the hashmap with bpf_snprintf(%pB);
  4. Report recorded callstacks using the now-populated hashmap for ksyms.

In order to reliably trigger the converter program, I decided to use USDT.

Running time ./profile -F 2344 1 on a mostly idle system I got Before:
real 0m1,215s
user 0m0,058s
sys 0m0,157s

After:
real 0m1,045s
user 0m0,009s
sys 0m0,026s

I ask the community here for your opinion, help and guidance to make this mergeable.

Using %pB slightly changes the format of a symbol name. Example: kmem_cache_alloc_noprof+0x2cf/0x300
It would be trivial to remove the suffix if it's necessary. Generating flamegraphs with Brendan Gregg's perl script still works.

For now, max_entries of the hashmap is hardcoded. Would you make it dynamic, like stack-storage-size or compute it by collecting all ips into a set and take this set's size as value for max_entries?

In Makefile:
When adding bpf/usdt.bpf.h clang errored it couldn't find asm/errno.h so I added -v to cflags when V is set.
As a quick fix, I hardcoded include path to x86_64 host header files. In contrast to clang, gcc corretly has this in it's default include path. Do you have any idea how to set this path, preferably so it doesn't only work on ubuntu?

Experimentally I added BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID to bpf_get_stackid. It doesn't seem to break the program and should make recording faster. If somebody knows why adding them is a bad idea please do tell.

If this gets a positive reaction, I will look into converting other tools here to use eBPF instead of /proc/kallsyms when possible.

Best,
Krzysztof

I've been trying to improve perf tools' startup time to make working on
fixing broken tests/features of perf bearable on NixOs.
I've learned that processing /proc/kallsyms is a costly operation;
on my ryzen5 system around 100ms just to read through it.

Aside from decreasing how many times kallsyms are read, I started to look
into how to remove dependency on kallsyms.
From kernel source code kernel/kallsyms.c and printk() documentation
https://docs.kernel.org/core-api/printk-formats.html#symbols-function-pointers
I learned about special pointer formatting flags in kernel.

Writing a custom kernel module just for the purpose of extracting symbol
names didn't feel right to me, so I've tried to use eBPF for that purpose.
eBPF programs have these helpers available that are promising:
- bpf_snprintf
- bpf_trace_printk

I've looked into how these two are implemented and found that both of
them use bstr_printf underneath. But before any printing is done, the format
string must first go through bpf_bprintf_prepare, which disallows certain
flags.

Fortunately for us, thanks to Florent Revest
https://lore.kernel.org/bpf/[email protected]/
%pB is accepted. We might consider adding that info to bpf-helpers man page?

Overview of the new approach:
1. eBPF profiler does low-latency recording of stack frames as it used
to;
2. Go through all kernel ips recorded in stack traces and save them in
   bpf_hashmap(K=u64, V=char[MAX_SYM_LEN]) with an empty value for now;
3. Call eBPF converter program, which will populate values in the hashmap
   with bpf_snprintf(%pB);
4. Report recorded callstacks using the now-populated hashmap for ksyms.

In order to reliably trigger the converter program, I decided to use
USDT.

Running `time ./profile -F 2344 1` on a mostly idle system I got
Before:
real    0m1,215s
user    0m0,058s
sys     0m0,157s

After:
real    0m1,045s
user    0m0,009s
sys     0m0,026s

I ask the community here for your opinion, help and guidance to make
this mergeable.

Using %pB slightly changes the format of a symbol name. Example:
kmem_cache_alloc_noprof+0x2cf/0x300
It would be trivial to remove the suffix if it's necessary.
Generating flamegraphs with Brendan Gregg's perl script still works.

For now, max_entries of the hashmap is hardcoded. Would you make it
dynamic, like stack-storage-size or compute it by collecting all ips into
a set and take this set's size as value for max_entries?

In Makefile:
When adding bpf/usdt.bpf.h clang errored it couldn't find asm/errno.h so
I added -v to cflags when V is set.
As a quick fix, I hardcoded include path to x86_64 host header files.
In contrast to clang, gcc corretly has this in it's default include path.
Do you have any idea how to set this path, preferably so it doesn't
only work on ubuntu?

Experimentally I added BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID to
bpf_get_stackid. It doesn't seem to break the program and should make
recording faster. If somebody knows why adding them is a bad idea please
do tell.

If this gets a positive reaction, I will look into converting other tools
here to use eBPF instead of /proc/kallsyms when possible.

Best,
Krzysztof
If there is an error and we jump to cleanup it's still freed in
profile_bpf__destroy.

Valgrind doesn't go well with eBPF programs, but to make it happier
init empty with zeros explicitly.
@krzysiek4321
Copy link
Author

I've watched now Evolution of stack trace captures with BPF - Andrii Nakryiko
Originally posted by @ekyooo in #5181 (comment)

I think we should consider keeping BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID
since for some users latency of profiling may be more important
and if hash collision happens we are losing data anyway.

Perhaps creating an option --fast, where the possibility of overwriting data on hash colisions is mentioned in it's description.
Perhaps make this the default behavior and make a flag to turn it off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant