Skip to content

Conversation

athierry-oct
Copy link
Contributor

This PR addresses a race condition encountered when using LLDB through the Python scripting interface.

I'm relatively new to LLDB, so feedback is very welcome, especially if there's a more appropriate way to address this issue.

Bug Description

When running a script that repeatedly calls debugger.GetListener().WaitForEvent() in a loop, and at some point invokes process.Kill() from within that loop to terminate the session, a race condition can occur if process.Kill() is called around the same time a breakpoint is hit.

Race Condition Details

The issue arises when the following sequence of events happens:

  1. The process's private state transitions to stopped when the breakpoint is hit.
  2. process.Kill() calls Process::Destroy(), which invokes Process::WaitForProcessToStop(). At this point:
    • private_state = stopped
    • public_state = running (the public state has not yet been updated)
  3. The public stop event is broadcast before the hijack listener is installed.
  4. As a result, the stop event is delivered to the non-hijack listener.
  5. The interrupt request sent by Process::StopForDestroyOrDetach() is ignored because the process is already stopped (private_state = stopped).
  6. No public stop event reaches the hijack listener.
  7. Process::WaitForProcessToStop() hangs waiting for a public stop event, but the event is never received.
  8. process.Kill() times out after 20 seconds

Fix Summary

This patch modifies Process::WaitForProcessToStop() to ensure that any pending events in the non-hijack listener queue are processed before checking the hijack listener. This guarantees that any missed public state change events are handled, preventing the hang.

Additional Context

A discussion of this issue, including a script to reproduce the bug, can be found here:
LLDB hangs when killing process at the same time a breakpoint is hit

Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@athierry-oct athierry-oct marked this pull request as ready for review June 19, 2025 15:51
@llvmbot llvmbot added the lldb label Jun 19, 2025
@llvmbot
Copy link
Member

llvmbot commented Jun 19, 2025

@llvm/pr-subscribers-lldb

Author: None (athierry-oct)

Changes

This PR addresses a race condition encountered when using LLDB through the Python scripting interface.

I'm relatively new to LLDB, so feedback is very welcome, especially if there's a more appropriate way to address this issue.

Bug Description

When running a script that repeatedly calls debugger.GetListener().WaitForEvent() in a loop, and at some point invokes process.Kill() from within that loop to terminate the session, a race condition can occur if process.Kill() is called around the same time a breakpoint is hit.

Race Condition Details

The issue arises when the following sequence of events happens:

  1. The process's private state transitions to stopped when the breakpoint is hit.
  2. process.Kill() calls Process::Destroy(), which invokes Process::WaitForProcessToStop(). At this point:
    • private_state = stopped
    • public_state = running (the public state has not yet been updated)
  3. The public stop event is broadcast before the hijack listener is installed.
  4. As a result, the stop event is delivered to the non-hijack listener.
  5. The interrupt request sent by Process::StopForDestroyOrDetach() is ignored because the process is already stopped (private_state = stopped).
  6. No public stop event reaches the hijack listener.
  7. Process::WaitForProcessToStop() hangs waiting for a public stop event, but the event is never received.
  8. process.Kill() times out after 20 seconds

Fix Summary

This patch modifies Process::WaitForProcessToStop() to ensure that any pending events in the non-hijack listener queue are processed before checking the hijack listener. This guarantees that any missed public state change events are handled, preventing the hang.

Additional Context

A discussion of this issue, including a script to reproduce the bug, can be found here:
LLDB hangs when killing process at the same time a breakpoint is hit


Full diff: https://github.com/llvm/llvm-project/pull/144919.diff

1 Files Affected:

  • (modified) lldb/source/Target/Process.cpp (+12-1)
diff --git a/lldb/source/Target/Process.cpp b/lldb/source/Target/Process.cpp
index 61a3d05bc3746..c75acdb1ecd80 100644
--- a/lldb/source/Target/Process.cpp
+++ b/lldb/source/Target/Process.cpp
@@ -683,7 +683,18 @@ StateType Process::WaitForProcessToStop(
 
   while (state != eStateInvalid) {
     EventSP event_sp;
-    state = GetStateChangedEvents(event_sp, timeout, hijack_listener_sp);
+
+    // A stop event may have been queued in the non-hijack listener before
+    // the hijack listener was installed (e.g., if a breakpoint is hit
+    // around the same time SBProcess::Kill() is called). Check and process
+    // those events first to avoid missing them.
+    if (PeekAtStateChangedEvents()) {
+      // Get event from non-hijack listener
+      state = GetStateChangedEvents(event_sp, timeout, nullptr);
+    } else {
+      state = GetStateChangedEvents(event_sp, timeout, hijack_listener_sp);
+    }
+
     if (event_sp_ptr && event_sp)
       *event_sp_ptr = event_sp;
 

@athierry-oct
Copy link
Contributor Author

@jimingham tagging you since you were in the discussion in LLDB hangs when killing process at the same time a breakpoint is hit

@labath
Copy link
Collaborator

labath commented Jun 20, 2025

Jim is the process state king here, but FWIW, I think you are onto something here -- I just think it needs to be done more carefully. Fetching the events from the non-hijack listener in this way is racy as there could be another thread trying to do the same (that's sort of why we bother with hijacking in the first place). Instead, what I think could work is to -- in the process of installing the hijack listener -- take (hijack) the pending events from the real listener and enqueue them to the hijack listener instead. This should be done while holding the event mutex of the real listener to make sure the other thread doesn't trample over this.

Then you should be able to listen on the hijack listener to your heart's content.

(I don't know if this would matter in practice, but for completeness, we probably should also re-enqueue any unlistened-for events from the hijack listener back to the real listener.)

@JDevlieghere JDevlieghere requested a review from jimingham June 20, 2025 14:44
@jimingham
Copy link
Collaborator

jimingham commented Jun 23, 2025

Your solution is going in the right direction, but is trying to solve a problem "after the fact" that we can solve better by preventing the public stop events from making their way to the non-primary listener before the primary listener gets a chance to fully process them.

Process event listeners are a bit special because we have to keep the "process event state" in sync with the value returned from Process::GetState. It would be hard to figure out the meaning of events if for instance you could still have the "stopped" event sitting unfetched in the event queue, but the process state were shown as stopped. To ensure this, we designate one of the listeners (the one passed into the Process constructor) as the "primary process event listener". That one gets sent the event first, and we're supposed to only change the process state when the primary listener has finished handling the event. Note, while a listener that hijacks the process event broadcaster gets designated as the primary event listener while it is hijacking the broadcaster, the converse is not true, the primary listener does not have to be hijacking the process event stream. In fact, it's usually just the listener you passed in when you created the Process.

So why doesn't that all just work?

The problem is that when lldb goes to run the DoOnRemoval callback for the Process event (that's what calls your python breakpoint code, for instance), the code needs to see the process state as stopped - so that any commands will execute in the right context. Currently, the only way to do that is to switch the public state to stopped. But as you correctly analyzed, if you set the public state to stopped while we're in the middle of handling the stop, then any other code that's waiting on the state change will trigger, possibly before the breakpoint has decided whether to stop or not, which is incorrect. To really honor the contract above, we need to make sure the stop event doesn't get delivered to the non-primary listeners till the stop is fully handled.

We already have code to deal with a slightly different aspect of this.

The other reason that process events are a bit special is that we know that to implement a "user level operation" like source line stepping, the debugger often has to stop and start numerous times. Those stops have no meaning at the user level, so we want to suppress them. The way we do that is to have a two stage handling of process events and their corresponding "state". First the "private" side of lldb fetches the event (usually on the private state thread), and sets the "private state" to stopped without changing the "public state". Then the thread plan logic is run to decide whether we are indeed going to stop to the user. If we are, we rebroadcast the event to the primary process listener - and as a result set the "public state" to stopped.

That's fine, but the code that runs the "Thread Plans" on the private state thread needs to see the state as stopped. However, by the time we're calling Process::GetState, we're usually in code that has no way to know what context it's getting run in.

We implement this for code running in the private state thread by making Process::GetState smart about who is using it. If you look at Process::GetState, you will see it doesn't just return the public state. That's because all the thread plan logic has to run on the Private State Thread before it is appropriate to set the public state to stopped. Instead, when code running on the "Private State Thread" asks for the process state, it will get the private state. But code running on any other thread will still see the public state. That's done by checking "CurrentThreadIsPrivateStateThread" in Process::GetState.

I think the correct solution to this problem is to add the thread that's going to run DoOnRemoval for the primary process event listener to the private state thread so that both of them will see the private state in Process::GetState. Then we only flip the public state when DoOnRemoval is all the way done . That way - like with the private event handling - the code doing the handling will see the right process state, so it will function correctly. And then only when the callback has completed will the public state get switched (causing the event to be broadcast to the other clients waiting on the change in public state.)

That way the thread that is running the breakpoint command will run uninterrupted but in the correct context without any of the other threads waiting on the process knowing about the stop too early.

And that way the non-primary process event listeners will be purely advisory as they are supposed to be.

@jimingham
Copy link
Collaborator

Note, I have a patch I'm almost done with (finishing the test case now) that handles a similar version of this problem. "Attach" and "Launch" don't spin up the private state thread until they know the attach/launch succeeded, which means that the first event that comes in is NOT handled on the private state thread, so if it has any callbacks, they will see the process state as still running. In that case, I'm going to fix it by saying that if you are running code before the private state thread has been spun up, then you see the private state.

@labath
Copy link
Collaborator

labath commented Jun 24, 2025

I think I understand most of what you're saying, and I think the proposed solution may solve the problem you have in mind -- but I don't think it can solve the problem I have in mind (and I'm not sure which of those is OP's problem :P). The race I think I see is in Process::StopForDestroyOrDetach.

We first check the process state (both of them), and if it's running, we install the hijack listener (and wait on it). However if the stopped event is sent (enqueued to the primary listener) before the hijack listener is installed, the listener will never get anything. I don't see how fiddling with thread identities can help there. Even if it does, it will only help for one particular thread. OP is calling Kill from the event handling thread, but I don't think we require that. If Kill is called from some other thread, I think it could still run into the same bug (although the race window will most likely be smaller, since the event thread will be alive and pulling events from the queue).

@jimingham
Copy link
Collaborator

We'd have to make the change and see to be sure, but what I would expect is that if any other thread tried to call Kill while whoever is the currently the primary process listener is handling the stop, it would immediately fail because the process is not stopped. Right now, the Kill is sometimes allowed because the public state is "stopped" even though we haven't actually finished handling the stop event. So I think what I was suggesting will fix the original problem.

The race Pavel is pointing out also seems real to me, but that should be fixed at the source, rather than after the fact. Process::HijackProcessEvents should check the previous "primary event listener" to see if it has any events and if so they should be transferred to the Hijack Listener. We should also do the same thing in RestoreProcessEvents. That way we'll never drop an event when shuffling primary listeners.

@athierry-oct
Copy link
Contributor Author

Thanks for the feedback and the good discussion!

I think I understand most of what you're saying, and I think the proposed solution may solve the problem you have in mind -- but I don't think it can solve the problem I have in mind (and I'm not sure which of those is OP's problem :P). The race I think I see is in Process::StopForDestroyOrDetach.

The race I have in mind is the one described by @labath, ie. this one:

We first check the process state (both of them), and if it's running, we install the hijack listener (and wait on it). However if the stopped event is sent (enqueued to the primary listener) before the hijack listener is installed, the listener will never get anything. I don't see how fiddling with thread identities can help there. Even if it does, it will only help for one particular thread. OP is calling Kill from the event handling thread, but I don't think we require that. If Kill is called from some other thread, I think it could still run into the same bug (although the race window will most likely be smaller, since the event thread will be alive and pulling events from the queue).

I'm not exactly sure if it's also related to what you're describing @jimingham. From my (limited) understanding, it looks like a separate issue: in my case I'm calling Kill() from the same thread that's handling the stop, so nobody is actually handling the stop (because the thread handling the stop is blocked inside Kill()).

The race Pavel is pointing out also seems real to me, but that should be fixed at the source, rather than after the fact. Process::HijackProcessEvents should check the previous "primary event listener" to see if it has any events and if so they should be transferred to the Hijack Listener. We should also do the same thing in RestoreProcessEvents. That way we'll never drop an event when shuffling primary listeners.

I can try to submit another version of the patch doing that

@athierry-oct athierry-oct force-pushed the lldb-process-fix-hang-on-kill branch from dfe59d0 to 52cb939 Compare July 3, 2025 18:29
@athierry-oct
Copy link
Contributor Author

I've just pushed a new version that moves the pending events from the previous listener to the hijack listener when hijacking, and vice-versa when restoring the previous listener.

In the patch:

  • Transferring the events between listeners is done atomically with adding/removing the hijack listener (by holding m_listeners_mutex), It should not be possible for an event to be received in the previous listener after the transfer of events but before the switch of listeners, because m_listeners_mutex is also locked by PrivateBroadcastEvent()
  • In HijackBroadcaster() and RestoreBroadcaster(), we're holding both m_listeners_mutex and the listeners' m_events_mutex. So in theory, a deadlock is possible if some other thread locks m_events_mutex and then m_listeners_mutex. I've not seen any place in the code where that happens, but please let me know if I missed something.

Copy link
Collaborator

@jimingham jimingham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we ever have a use case where the Hijack listener would only want to see "new" events after the Hijack point. And that's certainly not what we want for process events where the full sequence is important. So the behavior you're encoding here is correct for the ways we use Listeners.

I had to think a second about whether we wanted the notification behavior of AddEvent (it both adds the event to the queue and wakes up any Listeners). But you have to do that. For instance, the original listener is very likely sitting in WaitForEvent - it has no way of knowing it's been hijacked - and so the events you transferred back have to wake it up from the Wait or we'll stall.

@jimingham
Copy link
Collaborator

jimingham commented Jul 3, 2025

I don't think the locking should be a real issue. You should only hold the m_events mutex for a listener when adding or removing an event from the Listener's event queue. But it doesn't make sense to actually do work when holding the event mutex.

For instance, the primary place where events get pulled is Listener::FindNextEventInternal. That function does trigger arbitrary work to be done after the event is fetched - in the DoOnRemoval call. Since that's just a callout into a plugin provided callback we can't guarantee it won't do something that ends up requiring the listeners mutex.

But we are careful to do this like:

if (remove) {
  m_events.erase(pos);
  // Unlock the event queue here.  We've removed this event and are about
  // to return it so it should be okay to get the next event off the queue
  // here - and it might be useful to do that in the "DoOnRemoval".
  lock.unlock();
  event_sp->DoOnRemoval();
}

where that lock was the lock_guard for the m_events_mutex, so that the actual work done on fetching the event happens after we drop the m_events_mutex.

I can't think of a good way to enforce this, but I'm also not sure that's necessary. However, it might be a good idea to leave a comment by the definition of m_events_mutex saying that this should only be held when adding, peeking at, or removing events from the queue, and not over any function that does arbitrary work on the event's behalf.

Copy link
Collaborator

@labath labath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but you should add a unit test for the moving functionality. You know, things like: queue, some events, hijack them, make sure they're received at the new listener, and vice versa.

@athierry-oct athierry-oct force-pushed the lldb-process-fix-hang-on-kill branch from 52cb939 to 89fb977 Compare July 4, 2025 21:13
@athierry-oct
Copy link
Contributor Author

@jimingham Thanks for the explanation!

@labath I've added some unit tests, let me know if you need something else

Copy link
Collaborator

@labath labath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty much what I had in mind. Thanks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a newline here

Comment on lines 339 to 341
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: reflow

Suggested change
// Move pending events from the previous listener to the
// hijack listener. This ensures that no relevant event queued before the
// transition is missed by the hijack listener.
// Move pending events from the previous listener to the hijack listener.
// This ensures that no relevant event queued before the transition is missed
// by the hijack listener.

Comment on lines 347 to 349
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (prev_listener && listener_sp) {
prev_listener->MoveEvents(listener_sp, &m_broadcaster, event_mask);
}
if (prev_listener && listener_sp)
prev_listener->MoveEvents(listener_sp, &m_broadcaster, event_mask);

Comment on lines 392 to 395
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (listener_sp && prev_listener && !m_hijacking_masks.empty()) {
listener_sp->MoveEvents(prev_listener, &m_broadcaster,
m_hijacking_masks.back());
}
if (listener_sp && prev_listener && !m_hijacking_masks.empty())
listener_sp->MoveEvents(prev_listener, &m_broadcaster,
m_hijacking_masks.back());

Comment on lines 192 to 196
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to check log, that's part of the macro:

Suggested change
if (log)
LLDB_LOGF(
log, "%p Listener('%s')::MoveEvents moving event %p to %p('%s')",
static_cast<void *>(this), m_name.c_str(),
static_cast<void *>(event_sp.get()),
LLDB_LOGF(
log, "%p Listener('%s')::MoveEvents moving event %p to %p('%s')",
static_cast<void *>(this), m_name.c_str(),
static_cast<void *>(event_sp.get()),

Comment on lines 56 to 58
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Transfers all events matching the specified broadcaster and type to the
// destination listener. This can be useful when setting up a hijack listener,
// to ensure that no relevant events are missed.
/// Transfers all events matching the specified broadcaster and type to the
/// destination listener. This can be useful when setting up a hijack listener,
/// to ensure that no relevant events are missed.

This patch addresses a race condition that can occur when a script using
the Python API calls SBProcess::Kill() around the same time a breakpoint
is hit.

If a stop event is broadcast before the hijack listener is installed, it
may be delivered to the default listener and missed entirely by the
hijack listener. This causes Process::WaitForProcessToStop() to hang
until it times out, waiting for a public stop event that is never
received.

To fix this, we now transfer all pending events from the original
listener to the hijack listener before hijacking, using a new
Listener::MoveEvents method. We also do the reverse operation when
restoring the original listener to avoid dropping any event.
@athierry-oct athierry-oct force-pushed the lldb-process-fix-hang-on-kill branch from 89fb977 to 28e1078 Compare July 8, 2025 13:15
@athierry-oct
Copy link
Contributor Author

@JDevlieghere done!

@athierry-oct
Copy link
Contributor Author

Gentle ping — is this PR ready to merge? I don’t have merge privileges, so if one of the reviewers could take care of it when it's ready, I’d appreciate it!

@jimingham
Copy link
Collaborator

It doesn't look like there are any outstanding emendations. Pushing.

@jimingham jimingham merged commit 8612926 into llvm:main Jul 15, 2025
5 checks passed
Copy link

@athierry-oct Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

@llvm-ci
Copy link
Collaborator

llvm-ci commented Jul 15, 2025

LLVM Buildbot has detected a new failure on builder lldb-aarch64-ubuntu running on linaro-lldb-aarch64-ubuntu while building lldb at step 6 "test".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/59/builds/21064

Here is the relevant piece of the build log for the reference
Step 6 (test) failure: build (failure)
...
PASS: lldb-api :: commands/disassemble/basic/TestFrameDisassemble.py (30 of 2287)
UNSUPPORTED: lldb-api :: commands/dwim-print/objc/TestDWIMPrintObjC.py (31 of 2287)
PASS: lldb-api :: commands/expression/calculator_mode/TestCalculatorMode.py (32 of 2287)
PASS: lldb-api :: commands/expression/anonymous-struct/TestCallUserAnonTypedef.py (33 of 2287)
PASS: lldb-api :: commands/expression/argument_passing_restrictions/TestArgumentPassingRestrictions.py (34 of 2287)
PASS: lldb-api :: commands/expression/bitfield_enums/TestBitfieldEnums.py (35 of 2287)
PASS: lldb-api :: api/listeners/TestListener.py (36 of 2287)
PASS: lldb-api :: commands/expression/call-function/TestCallBuiltinFunction.py (37 of 2287)
UNSUPPORTED: lldb-api :: commands/expression/call-throws/TestCallThatThrows.py (38 of 2287)
PASS: lldb-api :: commands/command/script/TestCommandScript.py (39 of 2287)
FAIL: lldb-api :: commands/expression/call-restarts/TestCallThatRestarts.py (40 of 2287)
******************** TEST 'lldb-api :: commands/expression/call-restarts/TestCallThatRestarts.py' FAILED ********************
Script:
--
/usr/bin/python3.10 /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/llvm-project/lldb/test/API/dotest.py -u CXXFLAGS -u CFLAGS --env LLVM_LIBS_DIR=/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./lib --env LLVM_INCLUDE_DIR=/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/include --env LLVM_TOOLS_DIR=/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./bin --arch aarch64 --build-dir /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lldb-test-build.noindex --lldb-module-cache-dir /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lldb-test-build.noindex/module-cache-lldb/lldb-api --clang-module-cache-dir /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lldb-test-build.noindex/module-cache-clang/lldb-api --executable /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./bin/lldb --compiler /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./bin/clang --dsymutil /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./bin/dsymutil --make /usr/bin/gmake --llvm-tools-dir /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./bin --lldb-obj-root /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/tools/lldb --lldb-libs-dir /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./lib --cmake-build-type Release /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/llvm-project/lldb/test/API/commands/expression/call-restarts -p TestCallThatRestarts.py
--
Exit Code: 1

Command Output (stdout):
--
lldb version 22.0.0git (https://github.com/llvm/llvm-project.git revision 8612926c306c5191a5fb385dd11467728c59e982)
  clang revision 8612926c306c5191a5fb385dd11467728c59e982
  llvm revision 8612926c306c5191a5fb385dd11467728c59e982
Skipping the following test categories: ['libc++', 'msvcstl', 'dsym', 'gmodules', 'debugserver', 'objc']

--
Command Output (stderr):
--
FAIL: LLDB (/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/bin/clang-aarch64) :: test (TestCallThatRestarts.ExprCommandThatRestartsTestCase)
======================================================================
FAIL: test (TestCallThatRestarts.ExprCommandThatRestartsTestCase)
   Test calling function that hits a signal and restarts.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/llvm-project/lldb/packages/Python/lldbsuite/test/decorators.py", line 581, in wrapper
    func(*args, **kwargs)
  File "/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/llvm-project/lldb/packages/Python/lldbsuite/test/decorators.py", line 581, in wrapper
    func(*args, **kwargs)
  File "/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/llvm-project/lldb/test/API/commands/expression/call-restarts/TestCallThatRestarts.py", line 29, in test
    self.call_function()
  File "/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/llvm-project/lldb/test/API/commands/expression/call-restarts/TestCallThatRestarts.py", line 136, in call_function
    self.assertEqual(
AssertionError: 281473636542668 != 187650944797248 : Continuing returned to the place we started.
Config=aarch64-/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/bin/clang
----------------------------------------------------------------------
Ran 1 test in 0.595s

FAILED (failures=1)


@jimingham
Copy link
Collaborator

@athierry-oct - this patch seems to have caused some failures on the ubuntu bot. If you think you can resolve those failures quickly, then please do. If you aren't able to do that give me a ping and I'll revert this till you can figure this out.

@athierry-oct
Copy link
Contributor Author

@jimingham Yes can you revert it please, while I figure this out?
Thank you

jimingham added a commit that referenced this pull request Jul 15, 2025
#148915)

…(#144919)"

This was causing a couple of failures on the Ubuntu bots. Reverting
while we wait on a fix for those issues.
This reverts commit 8612926.
@jimingham
Copy link
Collaborator

The revert was:

#148915

Good hunting!

jimingham added a commit to jimingham/from-apple-llvm-project that referenced this pull request Jul 15, 2025
…lvm#144919)"

This was causing a couple of failures on the Ubuntu bots.  Reverting while we
wait on a fix for those issues.
This reverts commit 8612926.
@athierry-oct
Copy link
Contributor Author

Hi, sorry for the delay. I investigated the TestCallThatRestarts.py failure, and I think I’ve figured out what’s going on.

Toward the end of the test, we run:

value = frame.EvaluateExpression("call_me (%d)" % (num_sigchld), options)

Then we call:

error = process.Continue()

This triggers Process::ResumeSynchronous(), which:

  • Hijacks the process events
  • Calls PrivateResume()
  • Waits for a stop event via WaitForProcessToStop()

The issue is that a public stop event is sent at the end of EvaluateExpression(), and that event is still sitting in the primary listener’s queue when Process::ResumeSynchronous() hijacks the events. With my changes, that old stop event gets moved to the hijacker’s queue. So ResumeSynchronous() ends up grabbing it (even though it happened before the resume) and returns too early.

It looks like moving pending events during hijacking might not always be the right thing to do. In the case of ResumeSynchronous(), I think we want to make sure the stop event we wait for happens after hijacking and resuming.

One idea: we could add a bool move_pending_events flag to HijackProcessEvents() and RestoreProcessEvents(). It would default to false, and we could set it to true in StopForDestroyOrDetach(). This way, only the behavior of StopForDestroyOrDetach() is modified for now.

I gave this a quick try, and now the check-lldb-api suite passes on my machine.

Does that approach sound reasonable to you?

@jimingham
Copy link
Collaborator

Seems to me the problem there was how EvaluateExpression knew that the expression was successfully or unsuccessfully completed without pulling that public Stop event off the event queue? It shouldn't be possible to strand events like that.

@labath
Copy link
Collaborator

labath commented Aug 4, 2025

Agreed. I think the right fix is to make sure that event gets consumed by the EvaluateExpression machinery.

@athierry-oct
Copy link
Contributor Author

Thanks for your help!

I did a bit more digging : in TestCallThatRestarts.py, it seems the stop event is sent by EvaluateExpression() to the primary listener because the process is stopped by a signal (SIGCHLD). So, the execution of the thread plan that runs during expression evaluation is interrupted, and we end up in the following code:

Process.cpp

  // If the process exited during the run of the thread plan, notify everyone.

  if (event_to_broadcast_sp) {
    if (log)
      log->PutCString("Process::RunThreadPlan(): rebroadcasting event.");
    BroadcastEvent(event_to_broadcast_sp);
  }

Here's the relevant log:

python3          Process::RunThreadPlan(): execution interrupted: 0x7f3748007600 Event: broadcaster = 0x133665e8 (lldb.process), type = 0x00000001 (state-changed), data = { process = 0x133665b0 (pid = 1200636), state = stopped} <1 threads> <0x1251fc [ip 0x7f42dffa653b] signal SIGCHLD>
python3          Process::RunThreadPlan: ExecutionInterrupted - for plan: 0x1307b8d0 not discarding.
python3          0x000000001335E690 Broadcaster("lldb.process")::RestoreBroadcaster (about to pop listener("lldb.process.listener.run-thread-plan")=0x000000001305F8D0)
python3          Process::RunThreadPlan(): rebroadcasting event.
python3          0x000000001335e690 Broadcaster("lldb.process")::BroadcastEvent (event_sp = 0x7f3748007600 Event: broadcaster = 0x133665e8 (lldb.process), type = 0x00000001 (state-changed), data = { process = 0x133665b0 (pid = 1200636), state = stopped}, unique=false) hijack = 0x0000000000000000
python3          0x1242a210 Listener('lldb.Debugger')::AddEvent (event_sp = {0x7f3748007600})
python3          -- [UserExpression::Execute] Execution of expression completed --
python3          == [UserExpression::Evaluate] Execution completed abnormally ==

IIUC LLDB broadcasts this stop event to the primary listener to let the user (eg. the IDE) know that the process stopped during expression evaluation.

In light of this, do you still think EvaluateExpression() should consume the stop event instead of rebroadcasting it to the primary listener? Or should the test itself handle the event to clear the queue after EvaluateExpression() but before calling ResumeSynchronous() ?

@jimingham
Copy link
Collaborator

jimingham commented Aug 7, 2025

Thanks for tracking this down!

If the user has set "UnwindOnException" to false in their ExpressionOptions, we need to propagate that stop event and the driver would need to consume it to show the user that they are stopped at the crash site for further debugging. So we definitely have to rebroadcast the event and the driver needs to be waiting to consume it. So having the test check for the event (just like the lldb driver would do) is the correct way to use lldb.

@athierry-oct
Copy link
Contributor Author

Sounds good, thank you!

In the test, it seems the intention was originally to set UnwindOnError to true, but it's still set to false when evaluating the expression (there's no call to options.SetUnwindOnError(False) after the comment that says "now set UnwindOnError to true"):

TestCallThatRestarts.py

options.SetUnwindOnError(False)

[...]

# Okay, now set UnwindOnError to true, and then make the signal behavior to stop
# and see that now we do stop at the signal point:

self.runCmd("process handle SIGCHLD -s 1 -p 1 -n 1")

value = frame.EvaluateExpression("call_me (%d)" % (num_sigchld), options)
self.assertTrue(value.IsValid())
self.assertFalse(value.GetError().Success())

# Set signal handling back to no-stop, and continue and we should end
# up back in out starting frame:
self.runCmd("process handle SIGCHLD -s 0 -p 1 -n 1")

error = process.Continue()
self.assertSuccess(error, "Continuing after stopping for signal succeeds.")

frame = self.thread.GetFrameAtIndex(0)
self.assertEqual(
    frame.GetPC(),
    self.orig_frame_pc,
    "Continuing returned to the place we started.",
)

Do you prefer:

  • keeping UnwindOnError set to false and having the test consume the stop event
  • or setting UnwindOnError to true as the comment says, and adapt the rest of the test (because after continuing, we won't stop at the same place as when UnwindOnError is false, and the last assertion will fail) ?

@jimingham
Copy link
Collaborator

You can easily unwind when you're done looking at a crashed expression with thread return -x - there is probably an SB API way to do that as well but I don't remember it off the top of my head. So you could support the UnwindOnError(False) pretty easily by doing that after successfully stopping in the expression.

@athierry-oct
Copy link
Contributor Author

Sorry for the long delay, I'm back from vacation and tackling this issue in parallel with other tasks.

I've started investigating the other failing tests (such as the DAP tests). For all failing tests, it seems essentially the root cause is the same: moving events between primary and hijacked listeners exposes places where events were left unprocessed in the primary listener or hijack listener queue.

Focusing first on EvaluateExpression() and TestCallThatRestarts, the test failure is due to a stop event being rebroadcast to the primary listener when the process stops during expression evaluation, and then being unexpectedly consumed by process.Continue() (see previous messages)
To solve this, at first I thought maybe this stop event should be manually consumed in the test after the call to frame.EvaluateExpression() and before the call to process.Continue()
But now I'm realizing, since the tests are running in sync mode without an event loop, it seems this event should not have to be handled explicitly by the user (ie. the python test script in our case).

I also came across the code of SBDebugger::HandleCommand(), and noticed that in sync mode it consumes events after running the command:

void SBDebugger::HandleCommand(const char *command) {
  LLDB_INSTRUMENT_VA(this, command);

  ... // Handle the command

   // In sync mode, consume events produced by the command

    if (!m_opaque_sp->GetAsyncExecution()) {
      SBProcess process(GetCommandInterpreter().GetProcess());
      ProcessSP process_sp(process.GetSP());
      if (process_sp) {
        EventSP event_sp;
        ListenerSP lldb_listener_sp = m_opaque_sp->GetListener();
        while (lldb_listener_sp->GetEventForBroadcaster(
            process_sp.get(), event_sp, std::chrono::seconds(0))) {
          SBEvent event(event_sp);
          HandleProcessEvent(process, event, GetOutputFile(), GetErrorFile());
        }
      }
    }
  }
}

Would it make sense to do something similar for SBFrame::EvaluateExpression() ? This way, events would still be handled transparently when evaluating an expression in sync mode.

@labath
Copy link
Collaborator

labath commented Sep 2, 2025

I think it'd be better to not do that. HandleCommand does that because it has no idea what events are going to be produced by the command it runs. In case of EvaluateExpression, it feels like there ought to be a way to know which events are going to be produced and consumed, and I think it'd be better to handle this where the events are being processed (when we're listening on the hijack listener) -- instead of letting the hijack listener forward events to who-knows-where, and then trying to clean up after it.

@athierry-oct
Copy link
Contributor Author

athierry-oct commented Sep 2, 2025

@labath ok, thanks for explaining why HandleCommand() works this way

So in synchronous mode, should EvaluateExpression() (and more specifically, RunThreadPlan()) avoid rebroadcasting the stop event when expression evaluation fails due to a process stop? I may be missing something, but it seems to me that rebroadcasting the event is only necessary in async mode, since there’s an actual driver that needs to be notified. In sync mode, to find out what happened, the user can simply read the error returned by the function they used to evaluate the expression, e.g., SBFrame::EvaluateExpression()

@labath
Copy link
Collaborator

labath commented Sep 3, 2025

As before, I'll defer to @jimingham, but I think that (a)sync mode should not influence the behavior of evaluating an expression. I mean, the operation is always synchronous (waits for the expression to complete before returning), right?

If so, then I don't think we need to broadcast anything (in either mode) -- the process starts out being stopped, and we leave it stopped when we're done, so any outside listener can be blissfully unaware of what is happening. The situation would be different if the process lands in a different state as a result of the evaluation (e.g. because the expression causes it to exit), but it looks like RunThreadPlan already handles that.

@jimingham
Copy link
Collaborator

jimingham commented Sep 3, 2025

RunThreadPlan has to handle two different use cases. In the normal case where -i is 1 and -u is 1, the expression evaluation always pretends that nothing happened in the target. The stop-reason is set back to whatever it was before expression evaluation and no events are generated. After all, expressions can be IR interpreted so their success doesn't necessarily depend on running the process.

But when say -i is 0 and you hit a breakpoint while running the expression, then we need to promote that breakpoint-caused stop event to the user so the process state will change to "stopped at a breakpoint in the middle of this expression". If you are in sync mode, then that stop event should get consumed on the way out of EvaluateExpression, but in async mode that event should be propagated so that the process event listener can fetch the event and so the public state gets correctly reset.

@athierry-oct
Copy link
Contributor Author

in async mode that event should be propagated so that the process event listener can fetch the event and so the public state gets correctly reset

@jimingham currently the event is also propagated in sync mode (and this propagated event stays unprocessed forever). Should a condition be added in RunThreadPlan() in order to not propagate it in sync mode? Or should it still be propagated in sync mode but the propagated event should then be consumed from within EvaluateExpression() ?

@jimingham
Copy link
Collaborator

The latter. We have to send the event so that lldb will know that the expression stopped mid-way through. But process events have to be consumed to have any effect, so in sync mode EvaluateExpression should consume the event and then print the stop status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants