-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
another apparent case of binary search trouble #137
Comments
As you say, this seems complicated. It seems that we could have something like
If PC is 9, we will find entry 3, but we want to find entry 1. I'm not sure how best to handle that case. |
Yes, that is the complication I was referring to above where my approach wouldn't work. Still, I think it is strictly more general that existing code. As for proper fix. If we can assume that only nesting is possible without overlaps, then I'd "simply" introduce extra field to entry that points to it's direct "parent" entry. Parent entry being entry that nests "on top". We'd then add simple linear pass after sorting to establish those links. |
Here is my attempt to address this: alk/gperftools@e1d65de using logic I described above. Let me know what you think. It seems to work (I can indeed see cases like #137 (comment) in my testing). And I managed to convince myself it is "quite obvious" the right thing to do. But we should surely something better. Anything I can to do somehow cover this logic by tests ? |
Thanks for the patch. I'm a little concerned by the increased memory usage. A large program can have a lot of unit_addrs. It would be one thing if the new pointer were often non-NULL, but it seems to me that it will almost always be NULL. |
Sure. We can then have separate array of ranges simply for "top level" entries. I.e. entries without parent and with at least one child. Then when dealing with normal entries we'll simply scan backwards until the start of matching top-level entry (if any). I am happy to try that too. What do you think? |
Maybe? Not sure. |
okay I'll give it a shot. |
Of course it'll need at least some cleaning (I know my coding style doesn't always match this file's; there is debugging stuff left; variable names etc), but as a proof of concept it should do. Indeed at least in my test program number of top entries is quite small. 116 in total (when built ./configure CXX='g++ -no-pie' CXXFLAGS='-O0 -fno-inline -ggdb3' CFLAGS='-O0 -fno-inline -ggdb3' --disable-shared) Let me know what you think. |
I thought of a different approach: when there are overlapping ranges, fill in any gaps by inserting ranges that cover the gaps. This takes a bit more time setting up the overlap, but means that the lookup code stays simple and hopefully efficient. See #140. Can you see whether that handles your original test case? Thanks. |
This is nice idea. But something isn't right with the implementation. I am trying to figure that out. Meanwhile reproducing this is straightforward. Here is Dockerfile that shows the issue:
I was actually expecting much easier reproduction, as I see it across multiple gcc versions. But somehow neither latest ubuntu, nor debian stable has the problem. Debian unstable or testing do. Last step runs stack trace test that must fail and the outcome is symbolized backtrace. Here is what "docker build ." run gave me:
I.e. RUN_ALL_TESTS() line failed pcinfo symbolization, but also couple MallocBlock methods too. And I checked they fail for the same reason as before, failing to find unit. |
I am not entirely sure this is fully "it" but here is a change that appears to make it work.
Another notable thing is gcc (or linker+gcc) produce a number of very bogus ranges with low address of 0 and length from few tens of bytes to couple kilos. And that causes maximal recursion depth for walk thingy to reach 82. Probably still fine, but perhaps it might be worth just dropping those bogus entries (outside of ELF files known VMA ranges). |
And those bogus entries appear to be real, not some bug in libbacktrace reading address ranges. I.e. dwarfdump sees them too
|
Thanks. It looks some address ranges appear in multiple compilation units. My guess is that this is due to the linker merging some duplicate functions. For example That in itself is fine. But libbacktrace has an optimization in
to something like
which sorts to
and now we have the overlapping ranges. I will go ahead and commit the change, modified as you suggested. Thanks for digging into this. |
Fixes ianlancetaylor/libbacktrace#137. * dwarf.c (resolve_unit_addrs_overlap_walk): New static function. (resolve_unit_addrs_overlap): New static function. (build_dwarf_data): Call resolve_unit_addrs_overlap.
Fixes ianlancetaylor/libbacktrace#137. * dwarf.c (resolve_unit_addrs_overlap_walk): New static function. (resolve_unit_addrs_overlap): New static function. (build_dwarf_data): Call resolve_unit_addrs_overlap.
Thanks. Couple comments: *) from #139 my understanding is that similar treatment might be needed for functions *) what are your thoughts about possible deep recursion depths from this code? In the worst case it can get very deep and who knows if it could cause some security implications. |
Hi. I faced occasional symbolization failures. They tend to occur when building with less optimizations (-O1 -fno-inline -ggdb3 in my case).
When stepping in debugger I found that we're reaching "found_entry = 0" case in dwarf_lookup_pc. Thankfully, git blame pointed me at #44, which indeed seems relevant.
My entries end up like this:
I.e. entry with highest suitable low bound, has too low high bound, but entry immediately prior to that has lower low but high enough high. I.e. previous entry is the entry we need, but it isn't at exactly same low bound as current.
I am not sure what are the possibilities w.r.t. entries nesting or overlapping. But this patch seems to be more general than your code (which requires previous entry to be exactly at same low as current in order to step to that previous entry).
Because I am not sure about nesting (or even overlapping) possibilities, a so I cannot be sure if my patch is truly correct. But in some sense it is "more correct" than your current approach. Otherwise might need some more elaborate fix (like establishing proper parent-child links).
Let me know what you think and if my patch is okay, how you want this to be sent (pull request, or just patch above is fine etc). Also if/how we can somehow cover it by unit tests.
The text was updated successfully, but these errors were encountered: