Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

another apparent case of binary search trouble #137

Closed
alk opened this issue Oct 3, 2024 · 14 comments
Closed

another apparent case of binary search trouble #137

alk opened this issue Oct 3, 2024 · 14 comments

Comments

@alk
Copy link

alk commented Oct 3, 2024

Hi. I faced occasional symbolization failures. They tend to occur when building with less optimizations (-O1 -fno-inline -ggdb3 in my case).

When stepping in debugger I found that we're reaching "found_entry = 0" case in dwarf_lookup_pc. Thankfully, git blame pointed me at #44, which indeed seems relevant.

My entries end up like this:

(gdb) p pc
$3 = 4410269
(gdb) p *entry
$4 = {low = 4409762, high = 4409766, u = 0x7ffff7900000}
(gdb) p entry[1]
$5 = {low = 4411113, high = 4412614, u = 0x7ffff78d7360}
(gdb) p entry[-1]
$6 = {low = 4409046, high = 4411113, u = 0x7ffff78d7360}

I.e. entry with highest suitable low bound, has too low high bound, but entry immediately prior to that has lower low but high enough high. I.e. previous entry is the entry we need, but it isn't at exactly same low bound as current.

I am not sure what are the possibilities w.r.t. entries nesting or overlapping. But this patch seems to be more general than your code (which requires previous entry to be exactly at same low as current in order to step to that previous entry).

diff --git a/dwarf.c b/dwarf.c
index 9115836..18027f4 100644
--- a/dwarf.c
+++ b/dwarf.c
@@ -4031,9 +4031,13 @@ dwarf_lookup_pc (struct backtrace_state *state, struct dwarf_data *ddata,
        }
       if (entry == ddata->addrs)
        break;
-      if ((entry - 1)->low < entry->low)
-       break;
-      --entry;
+      /* Previous entry might be more "general" than current and so be
+         a better chance of meeting found_entry condition above */
+      if (entry[-1].high >= entry[0].high) {
+       --entry;
+       continue;
+      }
+      break;
     }
   if (!found_entry)
     {

Because I am not sure about nesting (or even overlapping) possibilities, a so I cannot be sure if my patch is truly correct. But in some sense it is "more correct" than your current approach. Otherwise might need some more elaborate fix (like establishing proper parent-child links).

Let me know what you think and if my patch is okay, how you want this to be sent (pull request, or just patch above is fine etc). Also if/how we can somehow cover it by unit tests.

@ianlancetaylor
Copy link
Owner

As you say, this seems complicated. It seems that we could have something like

  1. low = 0, high = 10
  2. low = 2, high = 5
  3. low = 5, high = 8
  4. low = 10, high = 20

If PC is 9, we will find entry 3, but we want to find entry 1. I'm not sure how best to handle that case.

@alk
Copy link
Author

alk commented Oct 3, 2024

Yes, that is the complication I was referring to above where my approach wouldn't work. Still, I think it is strictly more general that existing code.

As for proper fix. If we can assume that only nesting is possible without overlaps, then I'd "simply" introduce extra field to entry that points to it's direct "parent" entry. Parent entry being entry that nests "on top". We'd then add simple linear pass after sorting to establish those links.

alk added a commit to alk/gperftools that referenced this issue Oct 4, 2024
@alk
Copy link
Author

alk commented Oct 4, 2024

Here is my attempt to address this: alk/gperftools@e1d65de using logic I described above.

Let me know what you think. It seems to work (I can indeed see cases like #137 (comment) in my testing). And I managed to convince myself it is "quite obvious" the right thing to do. But we should surely something better. Anything I can to do somehow cover this logic by tests ?

@ianlancetaylor
Copy link
Owner

Thanks for the patch. I'm a little concerned by the increased memory usage. A large program can have a lot of unit_addrs. It would be one thing if the new pointer were often non-NULL, but it seems to me that it will almost always be NULL.

@alk
Copy link
Author

alk commented Oct 4, 2024

Sure. We can then have separate array of ranges simply for "top level" entries. I.e. entries without parent and with at least one child. Then when dealing with normal entries we'll simply scan backwards until the start of matching top-level entry (if any). I am happy to try that too. What do you think?

@ianlancetaylor
Copy link
Owner

Maybe? Not sure.

@alk
Copy link
Author

alk commented Oct 4, 2024

okay I'll give it a shot.

@alk
Copy link
Author

alk commented Oct 4, 2024

This alk/gperftools@d2795cd

Of course it'll need at least some cleaning (I know my coding style doesn't always match this file's; there is debugging stuff left; variable names etc), but as a proof of concept it should do.

Indeed at least in my test program number of top entries is quite small. 116 in total (when built ./configure CXX='g++ -no-pie' CXXFLAGS='-O0 -fno-inline -ggdb3' CFLAGS='-O0 -fno-inline -ggdb3' --disable-shared)

Let me know what you think.

@ianlancetaylor
Copy link
Owner

I thought of a different approach: when there are overlapping ranges, fill in any gaps by inserting ranges that cover the gaps. This takes a bit more time setting up the overlap, but means that the lookup code stays simple and hopefully efficient. See #140. Can you see whether that handles your original test case? Thanks.

@alk
Copy link
Author

alk commented Oct 16, 2024

This is nice idea. But something isn't right with the implementation. I am trying to figure that out.

Meanwhile reproducing this is straightforward. Here is Dockerfile that shows the issue:

FROM debian:testing

RUN apt-get update && apt-get install -y build-essential libtool automake autoconf git

RUN mkdir /stuff
WORKDIR /stuff

RUN git clone -b issue137 https://github.com/ianlancetaylor/libbacktrace
RUN cd libbacktrace && ./configure --prefix=/p && make -j2 install

RUN git clone -b external-backtrace https://github.com/alk/gperftools
RUN cd gperftools && ./autogen.sh && ./configure --disable-shared CXX='g++ -no-pie' CXXFLAGS='-O0 -ggdb3 -fno-inline' LDFLAGS=" -L/p/lib"
RUN cd gperftools && make -j8 debugallocation_test

RUN cd gperftools && ldd ./debugallocation_test && TCM_DEBUG_BT_SYMBOLIZATION_TEST=1 ./debugallocation_test --gtest_filter=*StackTrace*

I was actually expecting much easier reproduction, as I see it across multiple gcc versions. But somehow neither latest ubuntu, nor debian stable has the problem. Debian unstable or testing do.

Last step runs stack trace test that must fail and the outcome is symbolized backtrace. Here is what "docker build ." run gave me:

Step 10/10 : RUN cd gperftools && ldd ./debugallocation_test && TCM_DEBUG_BT_SYMBOLIZATION_TEST=1 ./debugallocation_test --gtest_filter=*StackTrace*
 ---> Running in 86aba2427995
	linux-vdso.so.1 (0x00007f6908b94000)
	libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f6908800000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f6908aa5000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f6908a78000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f690860a000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f6908b96000)
Note: Google Test filter = *StackTrace*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from DebugAllocationTest
[ RUN      ] DebugAllocationTest.StackTraceWithDanglingWriteAtExitTest
Found a corrupted memory buffer in MallocBlock (may be offset from user ptr): buffer index: 0, buffer ptr: 0x3a2b8700, size of buffer: 52
Buffer byte 32 is 0x01 (should be 0xcd).
Buffer byte 33 is 0x00 (should be 0xcd).
Buffer byte 34 is 0x00 (should be 0xcd).
Buffer byte 35 is 0x00 (should be 0xcd).
Deleted by thread 0x7fb32dce7698
    @ 0x41d0e2 MallocBlock::ProcessFreeQueue(MallocBlock*, unsigned long, int) + 113
    @ 0x41d032 MallocBlock::Deallocate(int, unsigned long) + 292
    @ 0x4187db DebugDeallocate src/debugallocation.cc:1050
    @ 0x4975c7 tc_delete src/debugallocation.cc:1336
    @ 0x40cb69 FooBar::StackTraceMarker(int*) src/tests/debugallocation_test.cc:195
    @ 0x40cbb0 DebugAllocationTest_StackTraceWithDanglingWriteAtExitTest_Test::TestBody() src/tests/debugallocation_test.cc:203
    @ 0x482d4b void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) vendor/googletest/googletest/src/gtest.cc:2638
    @ 0x47c49b void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) vendor/googletest/googletest/src/gtest.cc:2674
    @ 0x45bc3c testing::Test::Run() vendor/googletest/googletest/src/gtest.cc:2713
    @ 0x45c632 testing::TestInfo::Run() vendor/googletest/googletest/src/gtest.cc:2859
    @ 0x45cf4a testing::TestSuite::Run() vendor/googletest/googletest/src/gtest.cc:3037
    @ 0x46d80e testing::internal::UnitTestImpl::RunAllTests() vendor/googletest/googletest/src/gtest.cc:5967
    @ 0x483e71 bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 102
    @ 0x47d720 bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) vendor/googletest/googletest/src/gtest.cc:2674
    @ 0x46bfc1 testing::UnitTest::Run() vendor/googletest/googletest/src/gtest.cc:5546
    @ 0x4101fb RUN_ALL_TESTS() + 16
Memory was written to after being freed.  MallocBlock: 0x3a2b8700, user ptr: 0x3a2b8720, size: 52.  If you can't find the source of the error, try using ASan (https://github.com/google/sanitizers), Valgrind, or Purify, or study the output of the deleter's stack printed above.
Aborted (core dumped)
The command '/bin/sh -c cd gperftools && ldd ./debugallocation_test && TCM_DEBUG_BT_SYMBOLIZATION_TEST=1 ./debugallocation_test --gtest_filter=*StackTrace*' returned a non-zero code: 134

I.e. RUN_ALL_TESTS() line failed pcinfo symbolization, but also couple MallocBlock methods too. And I checked they fail for the same reason as before, failing to find unit.

@alk
Copy link
Author

alk commented Oct 16, 2024

I am not entirely sure this is fully "it" but here is a change that appears to make it work.

diff --git a/vendor/libbacktrace/dwarf.c b/vendor/libbacktrace/dwarf.c
index 23d90506..e89e9873 100644
--- a/vendor/libbacktrace/dwarf.c
+++ b/vendor/libbacktrace/dwarf.c
@@ -1634,6 +1634,17 @@ resolve_unit_addrs_overlap_walk (struct backtrace_state *state,
 
   for (from = *pfrom, to = *pto; from < old_count; from++, to++)
     {
+      /* If we are in the scope of a larger range which can no longer
+        cover any further ranges, return back to the caller.  */
+
+      if (enclosing != NULL
+         && enclosing->high <= old_addrs[from].low)
+       {
+         *pfrom = from;
+         *pto = to;
+         return 1;
+       }
+
       new_addrs[to] = old_addrs[from];
 
       /* If we are in scope of a larger range, fill in any gaps
@@ -1663,17 +1674,6 @@ resolve_unit_addrs_overlap_walk (struct backtrace_state *state,
          new_addrs[to].u = enclosing->u;
        }
 
-      /* If we are in the scope of a larger range which can no longer
-        cover any further ranges, return back to the caller.  */
-
-      if (enclosing != NULL
-         && enclosing->high <= old_addrs[from].low)
-       {
-         *pfrom = from + 1;
-         *pto = to + 1;
-         return 1;
-       }
-
       /* If this range has a larger scope than the next one, use it to
         fill in any gaps.  */
 

Another notable thing is gcc (or linker+gcc) produce a number of very bogus ranges with low address of 0 and length from few tens of bytes to couple kilos. And that causes maximal recursion depth for walk thingy to reach 82. Probably still fine, but perhaps it might be worth just dropping those bogus entries (outside of ELF files known VMA ranges).

@alk
Copy link
Author

alk commented Oct 16, 2024

And those bogus entries appear to be real, not some bug in libbacktrace reading address ranges. I.e. dwarfdump sees them too

COMPILE_UNIT<header overall offset = 0x000e1c8d>:
< 0><0x0000000c>  DW_TAG_compile_unit
                    DW_AT_producer              GNU C++17 14.2.0 -momit-leaf-frame-pointer -mtune=generic -march=x86-64 -ggdb3 -O0 -std=gnu++17 -fsized-deallocation -fno-omit-frame-pointer -fno-inline -fasynchronous-unwind-tables
                    DW_AT_language              DW_LANG_C_plus_plus_14
                    DW_AT_name                  src/stacktrace.cc
                    DW_AT_comp_dir              /home/me/src/External/gperftools
                    DW_AT_ranges                0x00003b51

      Offset of rnglists entries: 0x00003b51
      [ 0] start,end             0x0042efc0 0x00430f67
      [ 1] start,end             0x00418dfb 0x00418e62
      [ 2] start,end             0x00000000 0x00000117
      [ 3] start,end             0x00419019 0x004190cd
      [ 4] start,end             0x004190ce 0x00419130
      [ 5] start,end             0x00419130 0x00419179
      [ 6] end of list                                
                    DW_AT_low_pc                0x00000000
                    DW_AT_stmt_list             0x0001bccc
                    DW_AT_macros                0x0005dfd2

@ianlancetaylor
Copy link
Owner

Thanks. It looks some address ranges appear in multiple compilation units. My guess is that this is due to the linker merging some duplicate functions. For example RUN_ALL_TESTS is an inline function. It most likely appear multiple times in object files passed to the linker. The linker is eliminating the copies but not eliminating the debug info for the old copies. Instead, the debug info for the old copies changes to point to the single remaining copy.

That in itself is fine. But libbacktrace has an optimization in add_unit_addr in which it merges adjacent ranges that are seen consecutively in the debug info. I think what is happening here is for the single remaining copy of RUN_ALL_TESTS we merge the surrounding ranges. Then for the other copies, we don't, because they occur in compilation units that are mostly in other parts of the address space. So we go from something like

low 10 high 20
low 20 high 30 // RUN_ALL_TESTS
low 30 high 40
...
low 100 high 110
low 20 high 30 // merged RUN_ALL_TESTS
low 130 high 140

to something like

low 10 high 40
log 100 high 110
low 20 high 30
low 130 high 140

which sorts to

low 10 high 40
low 20 high 30
low 100 high 110
low 130 high 140

and now we have the overlapping ranges.

I will go ahead and commit the change, modified as you suggested.

Thanks for digging into this.

ianlancetaylor added a commit to ianlancetaylor/cgosymbolizer that referenced this issue Oct 18, 2024
Fixes ianlancetaylor/libbacktrace#137.

	* dwarf.c (resolve_unit_addrs_overlap_walk): New static function.
	(resolve_unit_addrs_overlap): New static function.
	(build_dwarf_data): Call resolve_unit_addrs_overlap.
hubot pushed a commit to gcc-mirror/gcc that referenced this issue Oct 18, 2024
Fixes ianlancetaylor/libbacktrace#137.

	* dwarf.c (resolve_unit_addrs_overlap_walk): New static function.
	(resolve_unit_addrs_overlap): New static function.
	(build_dwarf_data): Call resolve_unit_addrs_overlap.
@alk
Copy link
Author

alk commented Oct 18, 2024

Thanks. Couple comments:

*) from #139 my understanding is that similar treatment might be needed for functions

*) what are your thoughts about possible deep recursion depths from this code? In the worst case it can get very deep and who knows if it could cause some security implications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants