Skip to content

Segmentation Fault in gdb.reverse/sigall-precsave.exp #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
louspe-linaro opened this issue Jun 1, 2023 · 1 comment
Open

Segmentation Fault in gdb.reverse/sigall-precsave.exp #11

louspe-linaro opened this issue Jun 1, 2023 · 1 comment
Assignees
Labels

Comments

@louspe-linaro
Copy link

CUDA GDB segmentation faults when running the gdb.reverse/sigall-precsave.exp test case under valgrind.

make check-gdb RUNTESTFLAGS="--target_board=cuda sigall-precsave.exp GDB=$(which valgrind)' --fullpath-after='$HOME' --log-file=/tmp/vallog-'$RANDOM' /usr/local/cuda-12.1/bin/cuda-gdb'"

where cuda.exp is:

set_board_info gdb_prompt "\\(cuda-gdb\\)"

and placed in the boards folder.

Backtrace:

Breakpoint 2, main () at /home/user/binutils-gdb/gdb/testsuite/gdb.reverse/sigall-reverse.c:1407
1407>---  return 0;>----/* end of main */
(cuda-gdb) PASS: gdb.reverse/sigall-precsave.exp: run to end of main
delete breakpoints
Delete all breakpoints? (y or n) y
(cuda-gdb) info breakpoints
No breakpoints or watchpoints.
(cuda-gdb) record save /home/user/build/gdb/testsuite/outputs/gdb.reverse/sigall-precsave/sigall.precsave
warning: Memory read failed for corefile section, 4096 bytes at 0xffffffffff600000.


Fatal signal: Segmentation fault
----- Backtrace -----
0x62e137 ???
0x7769df ???
0x776a4f ???
0x4c75d8f ???
0x0 ???
---------------------
A fatal error internal to GDB has been detected, further
debugging is not possible.  GDB will now terminate.
@agontarek agontarek self-assigned this Jun 2, 2023
@agontarek agontarek added the bug label Jun 2, 2023
@agontarek
Copy link
Member

@louspe-linaro I now see the following while running this test:

(cuda-gdb) file gdb/testsuite/outputs/gdb.reverse/sigall-precsave/sigall-precsave
Reading symbols from gdb/testsuite/outputs/gdb.reverse/sigall-precsave/sigall-precsave...
(cuda-gdb) delete breakpoints
(cuda-gdb) info breakpoints
No breakpoints or watchpoints.
(cuda-gdb) break gen_ABRT
Breakpoint 1 at 0x15d3: file gdb/testsuite/gdb.reverse/sigall-reverse.c, line 398.
(cuda-gdb) run 
Starting program: gdb/testsuite/outputs/gdb.reverse/sigall-precsave/sigall-precsave 


Fatal signal: Segmentation fault
----- Backtrace -----
0xc711a4 gdb_internal_backtrace_1
        gdb/bt-utils.c:122
0xc7123c _Z22gdb_internal_backtracev
        gdb/bt-utils.c:168
0xeee836 handle_fatal_signal
        gdb/event-top.c:956
0xeee981 handle_sigsegv
        gdb/event-top.c:1029
0x4c7951f ???
        ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
0x4037000 ???
---------------------
A fatal error internal to GDB has been detected, further
debugging is not possible.  GDB will now terminate.

This is a bug, please report it.  For instructions, see:
<https://www.gnu.org/software/gdb/bugs/>.

warning: linux_ptrace_test_ret_to_nx: PC 0x580e3dc0 is neither near return address 0x4037000 nor is the return instruction 0x10649ec!
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, gen_ABRT () at gdb/testsuite/gdb.reverse/sigall-reverse.c:398
398       kill (getpid (), SIGABRT);
(cuda-gdb) record
(cuda-gdb) PASS: gdb.reverse/sigall-precsave.exp: turn on process record

I have encountered this segfault report while running under valgrind in the past, but it doesn't impact debuggability from what I can tell. Does this match what you see on your end?

The summary is showing passes for this test:

		=== gdb Summary ===

# of expected passes		957

agontarek pushed a commit that referenced this issue Mar 13, 2024
This commit fixes an issue that was discovered while writing the tests
for the previous commit.

I noticed that, when GDB restarts an inferior, the executable_changed
event would trigger twice.  The first notification would originate
from:

  #0  exec_file_attach (filename=0x4046680 "/tmp/hello.x", from_tty=0) at ../../src/gdb/exec.c:513
  #1  0x00000000006f3adb in reopen_exec_file () at ../../src/gdb/corefile.c:122
  #2  0x0000000000e6a3f2 in generic_mourn_inferior () at ../../src/gdb/target.c:3682
  #3  0x0000000000995121 in inf_child_target::mourn_inferior (this=0x2fe95c0 <the_amd64_linux_nat_target>) at ../../src/gdb/inf-child.c:192
  #4  0x0000000000995cff in inf_ptrace_target::mourn_inferior (this=0x2fe95c0 <the_amd64_linux_nat_target>) at ../../src/gdb/inf-ptrace.c:125
  #5  0x0000000000a32472 in linux_nat_target::mourn_inferior (this=0x2fe95c0 <the_amd64_linux_nat_target>) at ../../src/gdb/linux-nat.c:3609
  #6  0x0000000000e68a40 in target_mourn_inferior (ptid=...) at ../../src/gdb/target.c:2761
  #7  0x0000000000a323ec in linux_nat_target::kill (this=0x2fe95c0 <the_amd64_linux_nat_target>) at ../../src/gdb/linux-nat.c:3593
  #8  0x0000000000e64d1c in target_kill () at ../../src/gdb/target.c:924
  #9  0x00000000009a19bc in kill_if_already_running (from_tty=1) at ../../src/gdb/infcmd.c:328
  #10 0x00000000009a1a6f in run_command_1 (args=0x0, from_tty=1, run_how=RUN_STOP_AT_MAIN) at ../../src/gdb/infcmd.c:381
  #11 0x00000000009a20a5 in start_command (args=0x0, from_tty=1) at ../../src/gdb/infcmd.c:527
  #12 0x000000000068dc5d in do_simple_func (args=0x0, from_tty=1, c=0x35c7200) at ../../src/gdb/cli/cli-decode.c:95

While the second originates from:

  #0  exec_file_attach (filename=0x3d7a1d0 "/tmp/hello.x", from_tty=0) at ../../src/gdb/exec.c:513
  #1  0x0000000000dfe525 in reread_symbols (from_tty=1) at ../../src/gdb/symfile.c:2517
  #2  0x00000000009a1a98 in run_command_1 (args=0x0, from_tty=1, run_how=RUN_STOP_AT_MAIN) at ../../src/gdb/infcmd.c:398
  #3  0x00000000009a20a5 in start_command (args=0x0, from_tty=1) at ../../src/gdb/infcmd.c:527
  #4  0x000000000068dc5d in do_simple_func (args=0x0, from_tty=1, c=0x35c7200) at ../../src/gdb/cli/cli-decode.c:95

In the first case the call to exec_file_attach first passes through
reopen_exec_file.  The reopen_exec_file performs a modification time
check on the executable file, and only calls exec_file_attach if the
executable has changed on disk since it was last loaded.

However, in the second case things work a little differently.  In this
case GDB is really trying to reread the debug symbol.  As such, we
iterate over the objfiles list, and for each of those we check the
modification time, if the file on disk has changed then we reload the
debug symbols from that file.

However, there is an additional check, if the objfile has the same
name as the executable then we will call exec_file_attach, but we do
so without checking the cached modification time that indicates when
the executable was last reloaded, as a result, we reload the
executable twice.

In this commit I propose that reread_symbols be changed to
unconditionally call reopen_exec_file before performing the objfile
iteration.  This will ensure that, if the executable has changed, then
the executable will be reloaded, however, if the executable has
already been recently reloaded, we will not reload it for a second
time.

After handling the executable, GDB can then iterate over the objfiles
list and reload them in the normal way.

With this done I now see the executable reloaded only once when GDB
restarts an inferior, which means I can remove the kfail that I added
to the gdb.python/py-exec-file.exp test in the previous commit.

Approved-By: Tom Tromey <tom@tromey.com>
agontarek pushed a commit that referenced this issue Mar 13, 2024
It was pointed out on the mailing list that a recently added
test (gdb.python/py-progspace-events.exp) was failing when run with
the native-extended-gdbserver board.  This test was added with this
commit:

  commit 59912fb
  Date:   Tue Sep 19 11:45:36 2023 +0100

      gdb: add Python events for program space addition and removal

It turns out though that the test is failing due to a existing bug
in GDB, the new test just exposes the problem.  Additionally, the
failure really doesn't even rely on the new functionality added in the
above commit.  I reduced the test to a simple set of steps that
reproduced the failure and tested against GDB 13, and the test passes;
so the bug was introduced since then.  In fact, the bug was introduced
with this commit:

  commit a282736
  Date:   Fri Sep 8 15:48:16 2023 +0100

      gdb: remove final user of the executable_changed observer

This commit changed how the per-inferior auxv data cache is managed,
specifically, when the cache is cleared, and it is this that leads to
the failure.

This bug is interesting because it exposes a number of issues with
GDB, I'll explain all of the problems I see, though ultimately, I only
propose fixing one problem in this commit, which is enough to resolve
the crash we are currently seeing.

The crash that we are seeing manifests like this:

  ...
  [Inferior 2 (process 3970384) exited normally]
  +inferior 1
  [Switching to inferior 1 [process 3970383] (/tmp/build/gdb/testsuite/outputs/gdb.python/py-progspace-events/py-progspace-events)]
  [Switching to thread 1.1 (Thread 3970383.3970383)]
  #0  breakpt () at /tmp/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.python/py-progspace-events.c:28
  28	{ /* Nothing.  */ }
  (gdb) step
  +step
  terminate called after throwing an instance of 'gdb_exception_error'

  Fatal signal: Aborted
  ... etc ...

What's happening is that GDB attempts to refill the auxv cache as a
result of the gdbarch_has_shared_address_space call in
program_space::~program_space, the backtrace looks like this:

  #0  0x00007fb4f419a9a5 in raise () from /lib64/libpthread.so.0
  #1  0x00000000008b635d in handle_fatal_signal (sig=6) at ../../src/gdb/event-top.c:912
  #2  <signal handler called>
  #3  0x00007fb4f38e3625 in raise () from /lib64/libc.so.6
  #4  0x00007fb4f38cc8d9 in abort () from /lib64/libc.so.6
  #5  0x00007fb4f3c70756 in __gnu_cxx::__verbose_terminate_handler() [clone .cold] () from /lib64/libstdc++.so.6
  #6  0x00007fb4f3c7c6dc in __cxxabiv1::__terminate(void (*)()) () from /lib64/libstdc++.so.6
  #7  0x00007fb4f3c7b6e9 in __cxa_call_terminate () from /lib64/libstdc++.so.6
  #8  0x00007fb4f3c7c094 in __gxx_personality_v0 () from /lib64/libstdc++.so.6
  #9  0x00007fb4f3a80c63 in _Unwind_RaiseException_Phase2 () from /lib64/libgcc_s.so.1
  #10 0x00007fb4f3a8154e in _Unwind_Resume () from /lib64/libgcc_s.so.1
  #11 0x0000000000e8832d in target_read_alloc_1<unsigned char> (ops=0x408a3a0, object=TARGET_OBJECT_AUXV, annex=0x0) at ../../src/gdb/target.c:2266
  #12 0x0000000000e73dea in target_read_alloc (ops=0x408a3a0, object=TARGET_OBJECT_AUXV, annex=0x0) at ../../src/gdb/target.c:2315
  #13 0x000000000058248c in target_read_auxv_raw (ops=0x408a3a0) at ../../src/gdb/auxv.c:379
  #14 0x000000000058243d in target_read_auxv () at ../../src/gdb/auxv.c:368
  #15 0x000000000058255c in target_auxv_search (match=0x0, valp=0x7ffdee17c598) at ../../src/gdb/auxv.c:415
  #16 0x0000000000a464bb in linux_is_uclinux () at ../../src/gdb/linux-tdep.c:433
  #17 0x0000000000a464f6 in linux_has_shared_address_space (gdbarch=0x409a2d0) at ../../src/gdb/linux-tdep.c:440
  #18 0x0000000000510eae in gdbarch_has_shared_address_space (gdbarch=0x409a2d0) at ../../src/gdb/gdbarch.c:4889
  #19 0x0000000000bc7558 in program_space::~program_space (this=0x4544aa0, __in_chrg=<optimized out>) at ../../src/gdb/progspace.c:124
  #20 0x00000000009b245d in delete_inferior (inf=0x47b3de0) at ../../src/gdb/inferior.c:290
  #21 0x00000000009b2c10 in prune_inferiors () at ../../src/gdb/inferior.c:480
  #22 0x00000000009c5e3e in fetch_inferior_event () at ../../src/gdb/infrun.c:4558
  #23 0x000000000099b4dc in inferior_event_handler (event_type=INF_REG_EVENT) at ../../src/gdb/inf-loop.c:42
  #24 0x0000000000cbc64f in remote_async_serial_handler (scb=0x4090a30, context=0x408a6b0) at ../../src/gdb/remote.c:14859
  #25 0x0000000000d83d3a in run_async_handler_and_reschedule (scb=0x4090a30) at ../../src/gdb/ser-base.c:138
  #26 0x0000000000d83e1f in fd_event (error=0, context=0x4090a30) at ../../src/gdb/ser-base.c:189

So this is problem #1, if we throw an exception while deleting a
program_space then this is not caught, and is going to crash GDB.

Problem #2 becomes evident when we ask why GDB is throwing an error in
this case; the error is thrown because the remote target, operating in
non-async mode, can't read the auxv data while an inferior is running
and GDB is waiting for a stop reply.  The problem here then, is why
does GDB get into a position where it tries to interact with the
remote target in this way, at this time?  The problem is caused by the
prune_inferiors call which can be seen in the above backtrace.

In prune_inferiors we check if the inferior is deletable, and if it
is, we delete it.  The problem is, I think, we should also check if
the target is currently in a state that would allow us to delete the
inferior.  We don't currently have such a check available, we'd need
to add one, but for the remote target, this would return false if the
remote is in async mode and the remote is currently waiting for a stop
reply.  With this change in place GDB would defer deleting the
inferior until the remote target has stopped, at which point GDB would
be able to refill the auxv cache successfully.

And then, problem #3 becomes evident when we ask why GDB is needing to
refill the auxv cache now when it didn't need to for GDB 13.  This is
where the second commit mentioned above (a282736) comes in.
Prior to this commit, the auxv cache was cleared by the
executable_changed observer, while after that commit the auxv cache
was cleared by the new_objfile observer -- but only when the
new_objfile observer is used in the special mode that actually means
that all objfiles have been unloaded (I know, the overloading of the
new_objfile observer is horrible, and unnecessary, but it's not really
important for this bug).

The difference arises because the new_objfile observer is triggered
from clear_symtab_users, which in turn is called from
program_space::~program_space.  The new_objfile observer for auxv does
this:

  static void
  auxv_new_objfile_observer (struct objfile *objfile)
  {
    if (objfile == nullptr)
      invalidate_auxv_cache_inf (current_inferior ());
  }

That is, when all the objfiles are unloaded, we clear the auxv cache
for the current inferior.

The problem is, then when we look at the prune_inferiors ->
delete_inferior -> ~program_space path, we see that the current
inferior is not going to be an inferior that exists within the
program_space being deleted; delete_inferior removes the deleted
inferior from the global inferior list, and then only deletes the
program_space if program_space::empty() returns true, which is only
the case if the current inferior isn't within the program_space to
delete, and no other inferior exists within that program_space
either.

What this means is that when the new_objfile observer is called we
can't rely on the current inferior having any relationship with the
program space in which the objfiles were removed.  This was an error
in the commit a282736, the only thing we can rely on is the
current program space.  As a result of this mistake, after commit
a282736, GDB was sometimes clearing the auxv cache for a random
inferior.  In the native target case this was harmless as we can
always refill the cache when needed, but in the remote target case, if
we need to refill the cache when the remote target is executing, then
we get the crash we observed.

And additionally, if we think about this a little more, we see that
commit a282736 made another mistake.  When all the objfiles are
removed, they are removed from a program_space, a program_space might
contain multiple inferiors, so surely, we should clear the auxv cache
for all of the matching inferiors?

Given these two insights, that the current_inferior is not relevant,
only the current_program_space, and that we should be clearing the
cache for all inferiors in the current_program_space, we can update
auxv_new_objfile_observer to:

  if (objfile == nullptr)
    {
      for (inferior *inf : all_inferiors ())
	{
	  if (inf->pspace == current_program_space)
	    invalidate_auxv_cache_inf (inf);
	}
    }

With this change we now correctly clear the auxv cache for the correct
inferiors, and GDB no longer needs to refill the cache at an
inconvenient time, this avoids the crash we were seeing.

And finally, we reach problem #4.  Inspired by the observation that
using the current_inferior from within the ~program_space function was
not correct, I added some debug to see if current_inferior() was
called anywhere else (below ~program_space), and the answer is yes,
it's called a often.  Mostly the culprit is GDB doing:

  current_inferior ()->top_target ()-> ....

But I think all of these calls are most likely doing the wrong thing,
and only work because the top target in all these cases is shared
between all inferiors, e.g. it's the native target, or the remote
target for all inferiors.  But if we had a truly multi-connection
setup, then we might start to see odd behaviour.

Problem #1 I'm just ignoring for now, I guess at some point we might
run into this again, and then we'd need to solve this.  But in this
case I wasn't sure what a "good" solution would look like.  We need
the auxv data in order to implement the linux_is_uclinux() function.
If we can't get the auxv data then what should we do, assume yes, or
assume no?  The right answer would probably be to propagate the error
back up the stack, but then we reach ~program_space, and throwing
exceptions from a destructor is problematic, so we'd need to catch and
deal at this point.  The linux_is_uclinux() call is made from within
gdbarch_has_shared_address_space(), which is used like:

  if (!gdbarch_has_shared_address_space (target_gdbarch ()))
    delete this->aspace;

So, we would have to choose; delete the address space or not.  If we
delete it on error, then we might delete an address space that is
shared within another program space.  If we don't delete the address
space, then we might leak it.  Neither choice is great.

A better solution might be to have the address spaces be reference
counted, then we could remove the gdbarch_has_shared_address_space
call completely, and just rely on the reference count to auto-delete
the address space when appropriate.

The solution for problem #2 I already hinted at above, we should have
a new target_can_delete_inferiors() call, which should be called from
prune_inferiors, this would prevent GDB from trying to delete
inferiors when a (remote) target is in a state where we know it can't
delete the inferior.  Deleting an inferior often (always?) requires
sending packets to the remote, and if the remote is waiting for a stop
reply then this will never work, so the pruning should be deferred in
this case.

The solution for problem #3 is included in this commit.

And, for problem #4, I'm not sure what the right solution is.  Maybe
delete_inferior should ensure the inferior to be deleted is in place
when ~program_space is called?  But that seems a little weird, as the
current inferior would, in theory, still be using the current
program_space...

Anyway, after this commit, the gdb.python/py-progspace-events.exp test
now passes when run with the native-extended-remote board.

Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=30935
Approved-By: Simon Marchi <simon.marchi@efficios.com>
Change-Id: I41f0e6e2d7ecc1e5e55ec170f37acd4052f46eaf
agontarek pushed a commit that referenced this issue Mar 13, 2024
Running the
gdb.threads/step-over-thread-exit-while-stop-all-threads.exp testcase
added later in the series against gdbserver, after the
TARGET_WAITKIND_NO_RESUMED fix from the following patch, would run
into an infinite loop in stop_all_threads, leading to a timeout:

  FAIL: gdb.threads/step-over-thread-exit-while-stop-all-threads.exp: displaced-stepping=off: target-non-stop=on: iter 0: continue (timeout)

The is really a latent bug, and it is about the fact that
stop_all_threads stops listening to events from a target as soon as it
sees a TARGET_WAITKIND_NO_RESUMED, ignoring that
TARGET_WAITKIND_NO_RESUMED may be delayed.  handle_no_resumed knows
how to handle delayed no-resumed events, but stop_all_threads was
never taught to.

In more detail, here's what happens with that testcase:

#1 - Multiple threads report breakpoint hits to gdb.

#2 - gdb picks one events, and it's for thread 1.  All other stops are
     left pending.  thread 1 needs to move past a breakpoint, so gdb
     stops all threads to start an inline step over for thread 1.
     While stopping threads, some of the threads that were still
     running report events that are also left pending.

#2 - gdb steps thread 1

#3 - Thread 1 exits while stepping (it steps over an exit syscall),
     gdbserver reports thread exit for thread 1

#4 - Thread 1 was the last resumed thread, so gdbserver also reports
     no-resumed:

    [remote]   Notification received: Stop:w0;p3445d0.3445d3
    [remote] Sending packet: $vStopped#55
    [remote] Packet received: N
    [remote] Sending packet: $vStopped#55
    [remote] Packet received: OK

#5 - gdb processes the thread exit for thread 1, finishes the step
     over and restarts threads.

#6 - gdb picks the next event to process out of one of the resumed
     threads with pending events:

    [infrun] random_resumed_with_pending_wait_status: Found 32 events, selecting #11

#7 - This is again a breakpoint hit and the breakpoint needs to be
     stepped over too, so gdb starts a step-over dance again.

#8 - We reach stop_all_threads, which finds that some threads need to
     be stopped.

#9 - wait_one finally consumes the no-resumed event queue by #4.
     Seeing this, wait_one disable target async, to stop listening for
     events out of the remote target.

#10 - We still haven't seen all the stops expected, so
      stop_all_threads tries another iteration.

#11 - Because the remote target is no longer async, and there are no
      other targets, wait_one return no-resumed immediately without
      polling the remote target.

#12 - We still haven't seen all the stops expected, so
      stop_all_threads tries another iteration.  goto #11, looping
      forever.

Fix this by explicitly enabling/re-enabling target async on targets
that can async, before waiting for stops.

Reviewed-By: Andrew Burgess <aburgess@redhat.com>
Change-Id: Ie3ffb0df89635585a6631aa842689cecc989e33f
agontarek pushed a commit that referenced this issue Mar 13, 2024
When using GDB on native linux, it can happen that, while attempting
to detach an inferior, the inferior may have been exited or have been
killed, yet still be in the list of lwps.  Should that happen, the
assert in x86_linux_update_debug_registers in
gdb/nat/x86-linux-dregs.c will trigger.  The line in question looks
like this:

  gdb_assert (lwp_is_stopped (lwp));

For this case, the lwp isn't stopped - it's dead.

The bug which brought this problem to my attention is one in which the
pwntools library uses GDB to to debug a process; as the script is
shutting things down, it kills the process that GDB is debugging and
also sends GDB a SIGTERM signal, which causes GDB to detach all
inferiors prior to exiting.  Here's a link to the bug:

https://bugzilla.redhat.com/show_bug.cgi?id=2192169

The following shell command mimics part of what the pwntools
reproducer script does (with regard to shutting things down), but
reproduces the bug much less reliably.  I have found it necessary to
run the command a bunch of times before seeing the bug.  (I usually
see it within 5-10 repetitions.)  If you choose to try this command,
make sure that you have no running "cat" or "gdb" processes first!

  cat </dev/zero >/dev/null & \
  (sleep 5; (kill -KILL `pgrep cat` & kill -TERM `pgrep gdb`)) & \
  sleep 1 ; \
  gdb -q -iex 'set debuginfod enabled off' -ex 'set height 0' \
      -ex c /usr/bin/cat `pgrep cat`

So, basically, the idea here is to kill both gdb and cat at roughly
the same time.  If we happen to attempt the detach before the process
lwp has been deleted from GDB's (linux native) LWP data structures,
then the assert will trigger.  The relevant part of the backtrace
looks like this:

  #8  0x00000000008a83ae in x86_linux_update_debug_registers (lwp=0x1873280)
      at gdb/nat/x86-linux-dregs.c:146
  #9  0x00000000008a862f in x86_linux_prepare_to_resume (lwp=0x1873280)
      at gdb/nat/x86-linux.c:81
  #10 0x000000000048ea42 in x86_linux_nat_target::low_prepare_to_resume (
      this=0x121eee0 <the_amd64_linux_nat_target>, lwp=0x1873280)
      at gdb/x86-linux-nat.h:70
  #11 0x000000000081a452 in detach_one_lwp (lp=0x1873280, signo_p=0x7fff8ca3441c)
      at gdb/linux-nat.c:1374
  #12 0x000000000081a85f in linux_nat_target::detach (
      this=0x121eee0 <the_amd64_linux_nat_target>, inf=0x16e8f70, from_tty=0)
      at gdb/linux-nat.c:1450
  #13 0x000000000083a23b in thread_db_target::detach (
      this=0x1206ae0 <the_thread_db_target>, inf=0x16e8f70, from_tty=0)
      at gdb/linux-thread-db.c:1385
  #14 0x0000000000a66722 in target_detach (inf=0x16e8f70, from_tty=0)
      at gdb/target.c:2526
  #15 0x0000000000a8f0ad in kill_or_detach (inf=0x16e8f70, from_tty=0)
      at gdb/top.c:1659
  #16 0x0000000000a8f4fa in quit_force (exit_arg=0x0, from_tty=0)
      at gdb/top.c:1762
  #17 0x000000000070829c in async_sigterm_handler (arg=0x0)
      at gdb/event-top.c:1141

My colleague, Andrew Burgess, has done some recent work on other
problems with detach.  Upon hearing of this problem, he came up a test
case which reliably reproduces the problem and tests for a few other
problems as well.  In addition to testing detach when the inferior has
terminated due to a signal, it also tests detach when the inferior has
exited normally.  Andrew observed that the linux-native-only
"checkpoint" command would be affected too, so the test also tests
those cases when there's an active checkpoint.

For the LWP exit / termination case with no checkpoint, that's handled
via newly added checks of the waitstatus in detach_one_lwp in
linux-nat.c.

For the checkpoint detach problem, I chose to pass the lwp_info
to linux_fork_detach in linux-fork.c.  With that in place, suitable
tests were added before attempting a PTRACE_DETACH operation.

I added a few asserts at the beginning of linux_fork_detach and
modified the caller code so that the newly added asserts shouldn't
trigger.  (That's what the 'pid == inferior_ptid.pid' check is about
in gdb/linux-nat.c.)

Lastly, I'll note that the checkpoint code needs some work with regard
to background execution.  This patch doesn't attempt to fix that
problem, but it doesn't make it any worse.  It does slightly improve
the situation with detach because, due to the check noted above,
linux_fork_detach() won't be called for the wrong inferior when there
are multiple inferiors.  (There are at least two other problems with
the checkpoint code when there are multiple inferiors.  See:
https://sourceware.org/bugzilla/show_bug.cgi?id=31065)

This commit also adds a new test,
gdb.base/process-dies-while-detaching.exp.  Andrew Burgess is the
primary author of this test case.  Its design is similar to that of
gdb.threads/main-thread-exit-during-detach.exp, which was also written
by Andrew.

This test checks that GDB correctly handles several cases that can
occur when GDB attempts to detach an inferior process.  The process
can exit or be terminated (e.g.  via SIGKILL) prior to GDB's event
loop getting a chance to remove it from GDB's internal data
structures.  To complicate things even more, detach works differently
when a checkpoint (created via GDB's "checkpoint" command) exists for
the inferior.  This test checks all four possibilities: process exit
with no checkpoint, process termination with no checkpoint, process
exit with a checkpoint, and process termination with a checkpoint.

Co-Authored-By: Andrew Burgess <aburgess@redhat.com>
Approved-By: Andrew Burgess <aburgess@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants