-
Notifications
You must be signed in to change notification settings - Fork 14.8k
[lldb] Fix race condition in Process::WaitForProcessToStop() #144919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[lldb] Fix race condition in Process::WaitForProcessToStop() #144919
Conversation
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-lldb Author: None (athierry-oct) ChangesThis PR addresses a race condition encountered when using LLDB through the Python scripting interface. I'm relatively new to LLDB, so feedback is very welcome, especially if there's a more appropriate way to address this issue. Bug DescriptionWhen running a script that repeatedly calls Race Condition DetailsThe issue arises when the following sequence of events happens:
Fix SummaryThis patch modifies Additional ContextA discussion of this issue, including a script to reproduce the bug, can be found here: Full diff: https://github.com/llvm/llvm-project/pull/144919.diff 1 Files Affected:
diff --git a/lldb/source/Target/Process.cpp b/lldb/source/Target/Process.cpp
index 61a3d05bc3746..c75acdb1ecd80 100644
--- a/lldb/source/Target/Process.cpp
+++ b/lldb/source/Target/Process.cpp
@@ -683,7 +683,18 @@ StateType Process::WaitForProcessToStop(
while (state != eStateInvalid) {
EventSP event_sp;
- state = GetStateChangedEvents(event_sp, timeout, hijack_listener_sp);
+
+ // A stop event may have been queued in the non-hijack listener before
+ // the hijack listener was installed (e.g., if a breakpoint is hit
+ // around the same time SBProcess::Kill() is called). Check and process
+ // those events first to avoid missing them.
+ if (PeekAtStateChangedEvents()) {
+ // Get event from non-hijack listener
+ state = GetStateChangedEvents(event_sp, timeout, nullptr);
+ } else {
+ state = GetStateChangedEvents(event_sp, timeout, hijack_listener_sp);
+ }
+
if (event_sp_ptr && event_sp)
*event_sp_ptr = event_sp;
|
@jimingham tagging you since you were in the discussion in LLDB hangs when killing process at the same time a breakpoint is hit |
Jim is the process state king here, but FWIW, I think you are onto something here -- I just think it needs to be done more carefully. Fetching the events from the non-hijack listener in this way is racy as there could be another thread trying to do the same (that's sort of why we bother with hijacking in the first place). Instead, what I think could work is to -- in the process of installing the hijack listener -- take (hijack) the pending events from the real listener and enqueue them to the hijack listener instead. This should be done while holding the event mutex of the real listener to make sure the other thread doesn't trample over this. Then you should be able to listen on the hijack listener to your heart's content. (I don't know if this would matter in practice, but for completeness, we probably should also re-enqueue any unlistened-for events from the hijack listener back to the real listener.) |
Your solution is going in the right direction, but is trying to solve a problem "after the fact" that we can solve better by preventing the public stop events from making their way to the non-primary listener before the primary listener gets a chance to fully process them. Process event listeners are a bit special because we have to keep the "process event state" in sync with the value returned from Process::GetState. It would be hard to figure out the meaning of events if for instance you could still have the "stopped" event sitting unfetched in the event queue, but the process state were shown as stopped. To ensure this, we designate one of the listeners (the one passed into the Process constructor) as the "primary process event listener". That one gets sent the event first, and we're supposed to only change the process state when the primary listener has finished handling the event. Note, while a listener that hijacks the process event broadcaster gets designated as the primary event listener while it is hijacking the broadcaster, the converse is not true, the primary listener does not have to be hijacking the process event stream. In fact, it's usually just the listener you passed in when you created the Process. So why doesn't that all just work? The problem is that when lldb goes to run the DoOnRemoval callback for the Process event (that's what calls your python breakpoint code, for instance), the code needs to see the process state as stopped - so that any commands will execute in the right context. Currently, the only way to do that is to switch the public state to stopped. But as you correctly analyzed, if you set the public state to stopped while we're in the middle of handling the stop, then any other code that's waiting on the state change will trigger, possibly before the breakpoint has decided whether to stop or not, which is incorrect. To really honor the contract above, we need to make sure the stop event doesn't get delivered to the non-primary listeners till the stop is fully handled. We already have code to deal with a slightly different aspect of this. The other reason that process events are a bit special is that we know that to implement a "user level operation" like source line stepping, the debugger often has to stop and start numerous times. Those stops have no meaning at the user level, so we want to suppress them. The way we do that is to have a two stage handling of process events and their corresponding "state". First the "private" side of lldb fetches the event (usually on the private state thread), and sets the "private state" to stopped without changing the "public state". Then the thread plan logic is run to decide whether we are indeed going to stop to the user. If we are, we rebroadcast the event to the primary process listener - and as a result set the "public state" to stopped. That's fine, but the code that runs the "Thread Plans" on the private state thread needs to see the state as stopped. However, by the time we're calling Process::GetState, we're usually in code that has no way to know what context it's getting run in. We implement this for code running in the private state thread by making Process::GetState smart about who is using it. If you look at Process::GetState, you will see it doesn't just return the public state. That's because all the thread plan logic has to run on the Private State Thread before it is appropriate to set the public state to stopped. Instead, when code running on the "Private State Thread" asks for the process state, it will get the private state. But code running on any other thread will still see the public state. That's done by checking "CurrentThreadIsPrivateStateThread" in Process::GetState. I think the correct solution to this problem is to add the thread that's going to run DoOnRemoval for the primary process event listener to the private state thread so that both of them will see the private state in Process::GetState. Then we only flip the public state when DoOnRemoval is all the way done . That way - like with the private event handling - the code doing the handling will see the right process state, so it will function correctly. And then only when the callback has completed will the public state get switched (causing the event to be broadcast to the other clients waiting on the change in public state.) That way the thread that is running the breakpoint command will run uninterrupted but in the correct context without any of the other threads waiting on the process knowing about the stop too early. And that way the non-primary process event listeners will be purely advisory as they are supposed to be. |
Note, I have a patch I'm almost done with (finishing the test case now) that handles a similar version of this problem. "Attach" and "Launch" don't spin up the private state thread until they know the attach/launch succeeded, which means that the first event that comes in is NOT handled on the private state thread, so if it has any callbacks, they will see the process state as still running. In that case, I'm going to fix it by saying that if you are running code before the private state thread has been spun up, then you see the private state. |
I think I understand most of what you're saying, and I think the proposed solution may solve the problem you have in mind -- but I don't think it can solve the problem I have in mind (and I'm not sure which of those is OP's problem :P). The race I think I see is in We first check the process state (both of them), and if it's running, we install the hijack listener (and wait on it). However if the stopped event is sent (enqueued to the primary listener) before the hijack listener is installed, the listener will never get anything. I don't see how fiddling with thread identities can help there. Even if it does, it will only help for one particular thread. OP is calling Kill from the event handling thread, but I don't think we require that. If Kill is called from some other thread, I think it could still run into the same bug (although the race window will most likely be smaller, since the event thread will be alive and pulling events from the queue). |
We'd have to make the change and see to be sure, but what I would expect is that if any other thread tried to call Kill while whoever is the currently the primary process listener is handling the stop, it would immediately fail because the process is not stopped. Right now, the Kill is sometimes allowed because the public state is "stopped" even though we haven't actually finished handling the stop event. So I think what I was suggesting will fix the original problem. The race Pavel is pointing out also seems real to me, but that should be fixed at the source, rather than after the fact. Process::HijackProcessEvents should check the previous "primary event listener" to see if it has any events and if so they should be transferred to the Hijack Listener. We should also do the same thing in RestoreProcessEvents. That way we'll never drop an event when shuffling primary listeners. |
Thanks for the feedback and the good discussion!
The race I have in mind is the one described by @labath, ie. this one:
I'm not exactly sure if it's also related to what you're describing @jimingham. From my (limited) understanding, it looks like a separate issue: in my case I'm calling
I can try to submit another version of the patch doing that |
dfe59d0
to
52cb939
Compare
I've just pushed a new version that moves the pending events from the previous listener to the hijack listener when hijacking, and vice-versa when restoring the previous listener. In the patch:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we ever have a use case where the Hijack listener would only want to see "new" events after the Hijack point. And that's certainly not what we want for process events where the full sequence is important. So the behavior you're encoding here is correct for the ways we use Listeners.
I had to think a second about whether we wanted the notification behavior of AddEvent (it both adds the event to the queue and wakes up any Listeners). But you have to do that. For instance, the original listener is very likely sitting in WaitForEvent - it has no way of knowing it's been hijacked - and so the events you transferred back have to wake it up from the Wait or we'll stall.
I don't think the locking should be a real issue. You should only hold the m_events mutex for a listener when adding or removing an event from the Listener's event queue. But it doesn't make sense to actually do work when holding the event mutex. For instance, the primary place where events get pulled is Listener::FindNextEventInternal. That function does trigger arbitrary work to be done after the event is fetched - in the DoOnRemoval call. Since that's just a callout into a plugin provided callback we can't guarantee it won't do something that ends up requiring the listeners mutex. But we are careful to do this like:
where that I can't think of a good way to enforce this, but I'm also not sure that's necessary. However, it might be a good idea to leave a comment by the definition of m_events_mutex saying that this should only be held when adding, peeking at, or removing events from the queue, and not over any function that does arbitrary work on the event's behalf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but you should add a unit test for the moving functionality. You know, things like: queue, some events, hijack them, make sure they're received at the new listener, and vice versa.
52cb939
to
89fb977
Compare
@jimingham Thanks for the explanation! @labath I've added some unit tests, let me know if you need something else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's pretty much what I had in mind. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a newline here
lldb/source/Utility/Broadcaster.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: reflow
// Move pending events from the previous listener to the | |
// hijack listener. This ensures that no relevant event queued before the | |
// transition is missed by the hijack listener. | |
// Move pending events from the previous listener to the hijack listener. | |
// This ensures that no relevant event queued before the transition is missed | |
// by the hijack listener. |
lldb/source/Utility/Broadcaster.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (prev_listener && listener_sp) { | |
prev_listener->MoveEvents(listener_sp, &m_broadcaster, event_mask); | |
} | |
if (prev_listener && listener_sp) | |
prev_listener->MoveEvents(listener_sp, &m_broadcaster, event_mask); |
lldb/source/Utility/Broadcaster.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (listener_sp && prev_listener && !m_hijacking_masks.empty()) { | |
listener_sp->MoveEvents(prev_listener, &m_broadcaster, | |
m_hijacking_masks.back()); | |
} | |
if (listener_sp && prev_listener && !m_hijacking_masks.empty()) | |
listener_sp->MoveEvents(prev_listener, &m_broadcaster, | |
m_hijacking_masks.back()); |
lldb/source/Utility/Listener.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to check log
, that's part of the macro:
if (log) | |
LLDB_LOGF( | |
log, "%p Listener('%s')::MoveEvents moving event %p to %p('%s')", | |
static_cast<void *>(this), m_name.c_str(), | |
static_cast<void *>(event_sp.get()), | |
LLDB_LOGF( | |
log, "%p Listener('%s')::MoveEvents moving event %p to %p('%s')", | |
static_cast<void *>(this), m_name.c_str(), | |
static_cast<void *>(event_sp.get()), |
lldb/include/lldb/Utility/Listener.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Transfers all events matching the specified broadcaster and type to the | |
// destination listener. This can be useful when setting up a hijack listener, | |
// to ensure that no relevant events are missed. | |
/// Transfers all events matching the specified broadcaster and type to the | |
/// destination listener. This can be useful when setting up a hijack listener, | |
/// to ensure that no relevant events are missed. |
This patch addresses a race condition that can occur when a script using the Python API calls SBProcess::Kill() around the same time a breakpoint is hit. If a stop event is broadcast before the hijack listener is installed, it may be delivered to the default listener and missed entirely by the hijack listener. This causes Process::WaitForProcessToStop() to hang until it times out, waiting for a public stop event that is never received. To fix this, we now transfer all pending events from the original listener to the hijack listener before hijacking, using a new Listener::MoveEvents method. We also do the reverse operation when restoring the original listener to avoid dropping any event.
89fb977
to
28e1078
Compare
@JDevlieghere done! |
Gentle ping — is this PR ready to merge? I don’t have merge privileges, so if one of the reviewers could take care of it when it's ready, I’d appreciate it! |
It doesn't look like there are any outstanding emendations. Pushing. |
@athierry-oct Congratulations on having your first Pull Request (PR) merged into the LLVM Project! Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR. Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues. How to do this, and the rest of the post-merge process, is covered in detail here. If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again. If you don't get any reports, no action is required from you. Your changes are working as expected, well done! |
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/59/builds/21064 Here is the relevant piece of the build log for the reference
|
@athierry-oct - this patch seems to have caused some failures on the ubuntu bot. If you think you can resolve those failures quickly, then please do. If you aren't able to do that give me a ping and I'll revert this till you can figure this out. |
@jimingham Yes can you revert it please, while I figure this out? |
The revert was: Good hunting! |
…lvm#144919)" This was causing a couple of failures on the Ubuntu bots. Reverting while we wait on a fix for those issues. This reverts commit 8612926.
Hi, sorry for the delay. I investigated the Toward the end of the test, we run: value = frame.EvaluateExpression("call_me (%d)" % (num_sigchld), options) Then we call: error = process.Continue() This triggers
The issue is that a public stop event is sent at the end of It looks like moving pending events during hijacking might not always be the right thing to do. In the case of One idea: we could add a I gave this a quick try, and now the Does that approach sound reasonable to you? |
Seems to me the problem there was how EvaluateExpression knew that the expression was successfully or unsuccessfully completed without pulling that public Stop event off the event queue? It shouldn't be possible to strand events like that. |
Agreed. I think the right fix is to make sure that event gets consumed by the EvaluateExpression machinery. |
Thanks for your help! I did a bit more digging : in Process.cpp // If the process exited during the run of the thread plan, notify everyone.
if (event_to_broadcast_sp) {
if (log)
log->PutCString("Process::RunThreadPlan(): rebroadcasting event.");
BroadcastEvent(event_to_broadcast_sp);
} Here's the relevant log:
IIUC LLDB broadcasts this stop event to the primary listener to let the user (eg. the IDE) know that the process stopped during expression evaluation. In light of this, do you still think |
Thanks for tracking this down! If the user has set "UnwindOnException" to false in their ExpressionOptions, we need to propagate that stop event and the driver would need to consume it to show the user that they are stopped at the crash site for further debugging. So we definitely have to rebroadcast the event and the driver needs to be waiting to consume it. So having the test check for the event (just like the lldb driver would do) is the correct way to use lldb. |
Sounds good, thank you! In the test, it seems the intention was originally to set UnwindOnError to true, but it's still set to false when evaluating the expression (there's no call to TestCallThatRestarts.py options.SetUnwindOnError(False)
[...]
# Okay, now set UnwindOnError to true, and then make the signal behavior to stop
# and see that now we do stop at the signal point:
self.runCmd("process handle SIGCHLD -s 1 -p 1 -n 1")
value = frame.EvaluateExpression("call_me (%d)" % (num_sigchld), options)
self.assertTrue(value.IsValid())
self.assertFalse(value.GetError().Success())
# Set signal handling back to no-stop, and continue and we should end
# up back in out starting frame:
self.runCmd("process handle SIGCHLD -s 0 -p 1 -n 1")
error = process.Continue()
self.assertSuccess(error, "Continuing after stopping for signal succeeds.")
frame = self.thread.GetFrameAtIndex(0)
self.assertEqual(
frame.GetPC(),
self.orig_frame_pc,
"Continuing returned to the place we started.",
) Do you prefer:
|
You can easily unwind when you're done looking at a crashed expression with |
Sorry for the long delay, I'm back from vacation and tackling this issue in parallel with other tasks. I've started investigating the other failing tests (such as the DAP tests). For all failing tests, it seems essentially the root cause is the same: moving events between primary and hijacked listeners exposes places where events were left unprocessed in the primary listener or hijack listener queue. Focusing first on I also came across the code of void SBDebugger::HandleCommand(const char *command) {
LLDB_INSTRUMENT_VA(this, command);
... // Handle the command
// In sync mode, consume events produced by the command
if (!m_opaque_sp->GetAsyncExecution()) {
SBProcess process(GetCommandInterpreter().GetProcess());
ProcessSP process_sp(process.GetSP());
if (process_sp) {
EventSP event_sp;
ListenerSP lldb_listener_sp = m_opaque_sp->GetListener();
while (lldb_listener_sp->GetEventForBroadcaster(
process_sp.get(), event_sp, std::chrono::seconds(0))) {
SBEvent event(event_sp);
HandleProcessEvent(process, event, GetOutputFile(), GetErrorFile());
}
}
}
}
} Would it make sense to do something similar for |
I think it'd be better to not do that. HandleCommand does that because it has no idea what events are going to be produced by the command it runs. In case of EvaluateExpression, it feels like there ought to be a way to know which events are going to be produced and consumed, and I think it'd be better to handle this where the events are being processed (when we're listening on the hijack listener) -- instead of letting the hijack listener forward events to who-knows-where, and then trying to clean up after it. |
@labath ok, thanks for explaining why HandleCommand() works this way So in synchronous mode, should |
As before, I'll defer to @jimingham, but I think that (a)sync mode should not influence the behavior of evaluating an expression. I mean, the operation is always synchronous (waits for the expression to complete before returning), right? If so, then I don't think we need to broadcast anything (in either mode) -- the process starts out being stopped, and we leave it stopped when we're done, so any outside listener can be blissfully unaware of what is happening. The situation would be different if the process lands in a different state as a result of the evaluation (e.g. because the expression causes it to exit), but it looks like RunThreadPlan already handles that. |
RunThreadPlan has to handle two different use cases. In the normal case where -i is 1 and -u is 1, the expression evaluation always pretends that nothing happened in the target. The stop-reason is set back to whatever it was before expression evaluation and no events are generated. After all, expressions can be IR interpreted so their success doesn't necessarily depend on running the process. But when say -i is 0 and you hit a breakpoint while running the expression, then we need to promote that breakpoint-caused stop event to the user so the process state will change to "stopped at a breakpoint in the middle of this expression". If you are in sync mode, then that stop event should get consumed on the way out of EvaluateExpression, but in async mode that event should be propagated so that the process event listener can fetch the event and so the public state gets correctly reset. |
@jimingham currently the event is also propagated in sync mode (and this propagated event stays unprocessed forever). Should a condition be added in RunThreadPlan() in order to not propagate it in sync mode? Or should it still be propagated in sync mode but the propagated event should then be consumed from within EvaluateExpression() ? |
The latter. We have to send the event so that lldb will know that the expression stopped mid-way through. But process events have to be consumed to have any effect, so in sync mode EvaluateExpression should consume the event and then print the stop status. |
This PR addresses a race condition encountered when using LLDB through the Python scripting interface.
I'm relatively new to LLDB, so feedback is very welcome, especially if there's a more appropriate way to address this issue.
Bug Description
When running a script that repeatedly calls
debugger.GetListener().WaitForEvent()
in a loop, and at some point invokesprocess.Kill()
from within that loop to terminate the session, a race condition can occur ifprocess.Kill()
is called around the same time a breakpoint is hit.Race Condition Details
The issue arises when the following sequence of events happens:
stopped
when the breakpoint is hit.process.Kill()
callsProcess::Destroy()
, which invokesProcess::WaitForProcessToStop()
. At this point:private_state = stopped
public_state = running
(the public state has not yet been updated)Process::StopForDestroyOrDetach()
is ignored because the process is already stopped (private_state = stopped
).Process::WaitForProcessToStop()
hangs waiting for a public stop event, but the event is never received.process.Kill()
times out after 20 secondsFix Summary
This patch modifies
Process::WaitForProcessToStop()
to ensure that any pending events in the non-hijack listener queue are processed before checking the hijack listener. This guarantees that any missed public state change events are handled, preventing the hang.Additional Context
A discussion of this issue, including a script to reproduce the bug, can be found here:
LLDB hangs when killing process at the same time a breakpoint is hit