Skip to content

Commit dbd9eee

Browse files
aagitgregkh
authored andcommitted
userfaultfd: fix SIGBUS resulting from false rwsem wakeups
[ Upstream commit 15a77c6 ] With >=32 CPUs the userfaultfd selftest triggered a graceful but unexpected SIGBUS because VM_FAULT_RETRY was returned by handle_userfault() despite the UFFDIO_COPY wasn't completed. This seems caused by rwsem waking the thread blocked in handle_userfault() and we can't run up_read() before the wait_event sequence is complete. Keeping the wait_even sequence identical to the first one, would require running userfaultfd_must_wait() again to know if the loop should be repeated, and it would also require retaking the rwsem and revalidating the whole vma status. It seems simpler to wait the targeted wakeup so that if false wakeups materialize we still wait for our specific wakeup event, unless of course there are signals or the uffd was released. Debug code collecting the stack trace of the wakeup showed this: $ ./userfaultfd 100 99999 nr_pages: 25600, nr_pages_per_cpu: 800 bounces: 99998, mode: racing ver poll, userfaults: 32 35 90 232 30 138 69 82 34 30 139 40 40 31 20 19 43 13 15 28 27 38 21 43 56 22 1 17 31 8 4 2 bounces: 99997, mode: rnd ver poll, Bus error (core dumped) save_stack_trace+0x2b/0x50 try_to_wake_up+0x2a6/0x580 wake_up_q+0x32/0x70 rwsem_wake+0xe0/0x120 call_rwsem_wake+0x1b/0x30 up_write+0x3b/0x40 vm_mmap_pgoff+0x9c/0xc0 SyS_mmap_pgoff+0x1a9/0x240 SyS_mmap+0x22/0x30 entry_SYSCALL_64_fastpath+0x1f/0xbd 0xffffffffffffffff FAULT_FLAG_ALLOW_RETRY missing 70 CPU: 24 PID: 1054 Comm: userfaultfd Tainted: G W 4.8.0+ #30 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014 Call Trace: dump_stack+0xb8/0x112 handle_userfault+0x572/0x650 handle_mm_fault+0x12cb/0x1520 __do_page_fault+0x175/0x500 trace_do_page_fault+0x61/0x270 do_async_page_fault+0x19/0x90 async_page_fault+0x25/0x30 This always happens when the main userfault selftest thread is running clone() while glibc runs either mprotect or mmap (both taking mmap_sem down_write()) to allocate the thread stack of the background threads, while locking/userfault threads already run at full throttle and are susceptible to false wakeups that may cause handle_userfault() to return before than expected (which results in graceful SIGBUS at the next attempt). This was reproduced only with >=32 CPUs because the loop to start the thread where clone() is too quick with fewer CPUs, while with 32 CPUs there's already significant activity on ~32 locking and userfault threads when the last background threads are started with clone(). This >=32 CPUs SMP race condition is likely reproducible only with the selftest because of the much heavier userfault load it generates if compared to real apps. We'll have to allow "one more" VM_FAULT_RETRY for the WP support and a patch floating around that provides it also hidden this problem but in reality only is successfully at hiding the problem. False wakeups could still happen again the second time handle_userfault() is invoked, even if it's a so rare race condition that getting false wakeups twice in a row is impossible to reproduce. This full fix is needed for correctness, the only alternative would be to allow VM_FAULT_RETRY to be returned infinitely. With this fix the WP support can stick to a strict "one more" VM_FAULT_RETRY logic (no need of returning it infinite times to avoid the SIGBUS). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Reported-by: Shubham Kumar Sharma <[email protected]> Tested-by: Mike Kravetz <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Michael Rapoport <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> Signed-off-by: Sasha Levin <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
1 parent 9618fba commit dbd9eee

File tree

1 file changed

+35
-2
lines changed

1 file changed

+35
-2
lines changed

fs/userfaultfd.c

Lines changed: 35 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ struct userfaultfd_wait_queue {
6363
struct uffd_msg msg;
6464
wait_queue_t wq;
6565
struct userfaultfd_ctx *ctx;
66+
bool waken;
6667
};
6768

6869
struct userfaultfd_wake_range {
@@ -86,6 +87,12 @@ static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
8687
if (len && (start > uwq->msg.arg.pagefault.address ||
8788
start + len <= uwq->msg.arg.pagefault.address))
8889
goto out;
90+
WRITE_ONCE(uwq->waken, true);
91+
/*
92+
* The implicit smp_mb__before_spinlock in try_to_wake_up()
93+
* renders uwq->waken visible to other CPUs before the task is
94+
* waken.
95+
*/
8996
ret = wake_up_state(wq->private, mode);
9097
if (ret)
9198
/*
@@ -264,6 +271,7 @@ int handle_userfault(struct fault_env *fe, unsigned long reason)
264271
struct userfaultfd_wait_queue uwq;
265272
int ret;
266273
bool must_wait, return_to_userland;
274+
long blocking_state;
267275

268276
BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
269277

@@ -333,10 +341,13 @@ int handle_userfault(struct fault_env *fe, unsigned long reason)
333341
uwq.wq.private = current;
334342
uwq.msg = userfault_msg(fe->address, fe->flags, reason);
335343
uwq.ctx = ctx;
344+
uwq.waken = false;
336345

337346
return_to_userland =
338347
(fe->flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
339348
(FAULT_FLAG_USER|FAULT_FLAG_KILLABLE);
349+
blocking_state = return_to_userland ? TASK_INTERRUPTIBLE :
350+
TASK_KILLABLE;
340351

341352
spin_lock(&ctx->fault_pending_wqh.lock);
342353
/*
@@ -349,8 +360,7 @@ int handle_userfault(struct fault_env *fe, unsigned long reason)
349360
* following the spin_unlock to happen before the list_add in
350361
* __add_wait_queue.
351362
*/
352-
set_current_state(return_to_userland ? TASK_INTERRUPTIBLE :
353-
TASK_KILLABLE);
363+
set_current_state(blocking_state);
354364
spin_unlock(&ctx->fault_pending_wqh.lock);
355365

356366
must_wait = userfaultfd_must_wait(ctx, fe->address, fe->flags, reason);
@@ -362,6 +372,29 @@ int handle_userfault(struct fault_env *fe, unsigned long reason)
362372
wake_up_poll(&ctx->fd_wqh, POLLIN);
363373
schedule();
364374
ret |= VM_FAULT_MAJOR;
375+
376+
/*
377+
* False wakeups can orginate even from rwsem before
378+
* up_read() however userfaults will wait either for a
379+
* targeted wakeup on the specific uwq waitqueue from
380+
* wake_userfault() or for signals or for uffd
381+
* release.
382+
*/
383+
while (!READ_ONCE(uwq.waken)) {
384+
/*
385+
* This needs the full smp_store_mb()
386+
* guarantee as the state write must be
387+
* visible to other CPUs before reading
388+
* uwq.waken from other CPUs.
389+
*/
390+
set_current_state(blocking_state);
391+
if (READ_ONCE(uwq.waken) ||
392+
READ_ONCE(ctx->released) ||
393+
(return_to_userland ? signal_pending(current) :
394+
fatal_signal_pending(current)))
395+
break;
396+
schedule();
397+
}
365398
}
366399

367400
__set_current_state(TASK_RUNNING);

0 commit comments

Comments
 (0)