|
| 1 | +Using Stack Trace Variables to Find a Kyber Bug |
| 2 | +=============================================== |
| 3 | + |
| 4 | +| Author: Omar Sandoval |
| 5 | +| Date: June 9th, 2021 |
| 6 | +
|
| 7 | +.. highlight:: pycon |
| 8 | + |
| 9 | +Jakub Kicinski reported a crash in the `Kyber I/O scheduler |
| 10 | +<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/block/kyber-iosched.c>`_ |
| 11 | +when he was testing Linux 5.12. He captured a core dump and asked me to debug |
| 12 | +it. This is a quick writeup of that investigation. |
| 13 | + |
| 14 | +First, we can get the task that crashed:: |
| 15 | + |
| 16 | + >>> task = per_cpu(prog["runqueues"], prog["crashing_cpu"]).curr |
| 17 | + |
| 18 | +Then, we can get its stack trace:: |
| 19 | + |
| 20 | + >>> trace = prog.stack_trace(task) |
| 21 | + >>> trace |
| 22 | + #0 queued_spin_lock_slowpath (../kernel/locking/qspinlock.c:471:3) |
| 23 | + #1 queued_spin_lock (../include/asm-generic/qspinlock.h:85:2) |
| 24 | + #2 do_raw_spin_lock (../kernel/locking/spinlock_debug.c:113:2) |
| 25 | + #3 spin_lock (../include/linux/spinlock.h:354:2) |
| 26 | + #4 kyber_bio_merge (../block/kyber-iosched.c:573:2) |
| 27 | + #5 blk_mq_sched_bio_merge (../block/blk-mq-sched.h:37:9) |
| 28 | + #6 blk_mq_submit_bio (../block/blk-mq.c:2182:6) |
| 29 | + #7 __submit_bio_noacct_mq (../block/blk-core.c:1015:9) |
| 30 | + #8 submit_bio_noacct (../block/blk-core.c:1048:10) |
| 31 | + #9 submit_bio (../block/blk-core.c:1125:9) |
| 32 | + #10 submit_stripe_bio (../fs/btrfs/volumes.c:6553:2) |
| 33 | + #11 btrfs_map_bio (../fs/btrfs/volumes.c:6642:3) |
| 34 | + #12 btrfs_submit_data_bio (../fs/btrfs/inode.c:2440:8) |
| 35 | + #13 submit_one_bio (../fs/btrfs/extent_io.c:175:9) |
| 36 | + #14 submit_extent_page (../fs/btrfs/extent_io.c:3229:10) |
| 37 | + #15 __extent_writepage_io (../fs/btrfs/extent_io.c:3793:9) |
| 38 | + #16 __extent_writepage (../fs/btrfs/extent_io.c:3872:8) |
| 39 | + #17 extent_write_cache_pages (../fs/btrfs/extent_io.c:4514:10) |
| 40 | + #18 extent_writepages (../fs/btrfs/extent_io.c:4635:8) |
| 41 | + #19 do_writepages (../mm/page-writeback.c:2352:10) |
| 42 | + #20 __writeback_single_inode (../fs/fs-writeback.c:1467:8) |
| 43 | + #21 writeback_sb_inodes (../fs/fs-writeback.c:1732:3) |
| 44 | + #22 __writeback_inodes_wb (../fs/fs-writeback.c:1801:12) |
| 45 | + #23 wb_writeback (../fs/fs-writeback.c:1907:15) |
| 46 | + #24 wb_check_background_flush (../fs/fs-writeback.c:1975:10) |
| 47 | + #25 wb_do_writeback (../fs/fs-writeback.c:2063:11) |
| 48 | + #26 wb_workfn (../fs/fs-writeback.c:2091:20) |
| 49 | + #27 process_one_work (../kernel/workqueue.c:2275:2) |
| 50 | + #28 worker_thread (../kernel/workqueue.c:2421:4) |
| 51 | + #29 kthread (../kernel/kthread.c:292:9) |
| 52 | + #30 ret_from_fork+0x1f/0x2a (../arch/x86/entry/entry_64.S:294) |
| 53 | + |
| 54 | +It looks like ``kyber_bio_merge()`` tried to lock an invalid spinlock. For |
| 55 | +reference, this is the source code of ``kyber_bio_merge()``: |
| 56 | + |
| 57 | +.. code-block:: c |
| 58 | + :lineno-start: 563 |
| 59 | +
|
| 60 | + static bool kyber_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio, |
| 61 | + unsigned int nr_segs) |
| 62 | + { |
| 63 | + struct kyber_hctx_data *khd = hctx->sched_data; |
| 64 | + struct blk_mq_ctx *ctx = blk_mq_get_ctx(hctx->queue); |
| 65 | + struct kyber_ctx_queue *kcq = &khd->kcqs[ctx->index_hw[hctx->type]]; |
| 66 | + unsigned int sched_domain = kyber_sched_domain(bio->bi_opf); |
| 67 | + struct list_head *rq_list = &kcq->rq_list[sched_domain]; |
| 68 | + bool merged; |
| 69 | +
|
| 70 | + spin_lock(&kcq->lock); |
| 71 | + merged = blk_bio_list_merge(hctx->queue, rq_list, bio, nr_segs); |
| 72 | + spin_unlock(&kcq->lock); |
| 73 | +
|
| 74 | + return merged; |
| 75 | + } |
| 76 | +
|
| 77 | +When printed, the ``kcq`` structure containing the spinlock indeed looks like |
| 78 | +garbage (omitted for brevity). |
| 79 | + |
| 80 | +A crash course on the Linux kernel block layer: for each block device, there is |
| 81 | +a "software queue" (``struct blk_mq_ctx *ctx``) for each CPU and a "hardware |
| 82 | +queue" (``struct blk_mq_hw_ctx *hctx``) for each I/O queue provided by the |
| 83 | +device. Each hardware queue has one or more software queues assigned to it. |
| 84 | +Kyber keeps additional data per hardware queue (``struct kyber_hctx_data |
| 85 | +*khd``) and per software queue (``struct kyber_ctx_queue *kcq``). |
| 86 | + |
| 87 | +Let's try to figure out where the bad ``kcq`` came from. It should be an |
| 88 | +element of the ``khd->kcqs`` array (``khd`` is optimized out, but we can |
| 89 | +recover it from ``hctx->sched_data``):: |
| 90 | + |
| 91 | + >>> trace[4]["khd"] |
| 92 | + (struct kyber_hctx_data *)<absent> |
| 93 | + >>> hctx = trace[4]["hctx"] |
| 94 | + >>> khd = cast("struct kyber_hctx_data *", hctx.sched_data) |
| 95 | + >>> trace[4]["kcq"] - khd.kcqs |
| 96 | + (ptrdiff_t)1 |
| 97 | + >>> hctx.nr_ctx |
| 98 | + (unsigned short)1 |
| 99 | + |
| 100 | +So the ``kcq`` is for the second software queue, but the hardware queue is only |
| 101 | +supposed to have one software queue. Let's see which CPU was assigned to the |
| 102 | +hardware queue:: |
| 103 | + |
| 104 | + >>> hctx.ctxs[0].cpu |
| 105 | + (unsigned int)6 |
| 106 | + |
| 107 | +Here's the problem: we're not running on CPU 6, we're running on CPU 19:: |
| 108 | + |
| 109 | + >>> prog["crashing_cpu"] |
| 110 | + (int)19 |
| 111 | + |
| 112 | +And CPU 19 is assigned to a different hardware queue that actually does have |
| 113 | +two software queues:: |
| 114 | + |
| 115 | + >>> ctx = per_cpu_ptr(hctx.queue.queue_ctx, 19) |
| 116 | + >>> other_hctx = ctx.hctxs[hctx.type] |
| 117 | + >>> other_hctx == hctx |
| 118 | + False |
| 119 | + >>> other_hctx.nr_ctx |
| 120 | + (unsigned short)2 |
| 121 | + |
| 122 | +The bug is that the caller gets the ``hctx`` for the current CPU, then |
| 123 | +``kyber_bio_merge()`` gets the ``ctx`` for the current CPU, and if the thread |
| 124 | +is migrated to another CPU in between, they won't match. The fix is to get a |
| 125 | +consistent view of the ``hctx`` and ``ctx``. The commit that fixes this is |
| 126 | +`here |
| 127 | +<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=efed9a3337e341bd0989161b97453b52567bc59d>`_. |
0 commit comments