Repeatable kernel crash in #737 #859

radford-for-smpte · 2015-02-28T04:36:56Z

I have a node.js app that repeatable causes the following kernel error:

pi@pi ~ $ uname -a
Linux pi 3.12.36+ #737 PREEMPT Wed Jan 14 19:40:07 GMT 2015 armv6l GNU/Linux

Feb 27 19:33:22 pi kernel: [169472.321768] Modules linked in: snd_bcm2835 snd_pcm snd_page_alloc snd_seq snd_seq_device snd_timer snd pl2303 usbserial leds_gpio led_class spi_bcm2708
Feb 27 19:33:22 pi kernel: [169472.339986] CPU: 0 PID: 2472 Comm: node Not tainted 3.12.36+ #737
Feb 27 19:33:22 pi kernel: [169472.347687] task: d6178c80 ti: d61ee000 task.ti: d61ee000
Feb 27 19:33:22 pi kernel: [169472.354723] PC is at vfp_save_state+0x0/0x28
Feb 27 19:33:22 pi kernel: [169472.360637] LR is at vfp_sync_hwstate+0x70/0x7c
Feb 27 19:33:22 pi kernel: [169472.366797] pc : [<c00099c4>]    lr : [<c0009588>]    psr: 60000013
Feb 27 19:33:22 pi kernel: [169472.366797] sp : d61efeb0  ip : 00000018  fp : beb11de0
Feb 27 19:33:22 pi kernel: [169472.381537] r10: 01771044  r9 : d6179028  r8 : 00000000
Feb 27 19:33:22 pi kernel: [169472.388375] r7 : d61effb0  r6 : beb11cd0  r5 : 80000780  r4 : d61ee030
Feb 27 19:33:22 pi kernel: [169472.396549] r3 : d61ee0f8  r2 : c062ce60  r1 : c0000780  r0 : d61ee0f8
Feb 27 19:33:22 pi kernel: [169472.404701] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
Feb 27 19:33:22 pi kernel: [169472.414917] Control: 00c5387d  Table: 16140008  DAC: 00000015
Feb 27 19:33:22 pi kernel: [169472.563714] [<c00099c4>] (vfp_save_state+0x0/0x28) from [<c0009588>] (vfp_sync_hwstate+0x70/0x7c)
Feb 27 19:33:22 pi kernel: [169472.576493] [<c0009588>] (vfp_sync_hwstate+0x70/0x7c) from [<c0009794>] (vfp_preserve_user_clear_hwstate+0x20/0x8c)
Feb 27 19:33:22 pi kernel: [169472.590885] [<c0009794>] (vfp_preserve_user_clear_hwstate+0x20/0x8c) from [<c0010a8c>] (setup_sigframe+0x190/0x1a0)
Feb 27 19:33:22 pi kernel: [169472.605262] [<c0010a8c>] (setup_sigframe+0x190/0x1a0) from [<c0010ee8>] (do_signal+0x2e8/0x440)
Feb 27 19:33:22 pi kernel: [169472.617874] [<c0010ee8>] (do_signal+0x2e8/0x440) from [<c00111d8>] (do_work_pending+0xa4/0xb4)
Feb 27 19:33:22 pi kernel: [169472.630384] [<c00111d8>] (do_work_pending+0xa4/0xb4) from [<c000df00>] (work_pending+0xc/0x20)
Feb 27 19:33:23 pi kernel: [169472.661933] ---[ end trace ca4423d502cf7049 ]---
Feb 27 19:33:23 pi kernel: [169472.675291] note: node[2472] exited with preempt_count 2
Feb 27 19:33:23 pi kernel: [169472.705002] Modules linked in: snd_bcm2835 snd_pcm snd_page_alloc snd_seq snd_seq_device snd_timer snd pl2303 usbserial leds_gpio led_class spi_bcm2708
Feb 27 19:33:23 pi kernel: [169472.748913] CPU: 0 PID: 2472 Comm: node Tainted: G      D      3.12.36+ #737
Feb 27 19:33:23 pi kernel: [169472.776293] [<c00143d4>] (unwind_backtrace+0x0/0xec) from [<c00116b8>] (show_stack+0x10/0x14)
Feb 27 19:33:23 pi kernel: [169472.805157] [<c00116b8>] (show_stack+0x10/0x14) from [<c04459b8>] (__schedule_bug+0x48/0x64)
Feb 27 19:33:23 pi kernel: [169472.833943] [<c04459b8>] (__schedule_bug+0x48/0x64) from [<c044aebc>] (__schedule+0x518/0x5ac)
Feb 27 19:33:23 pi kernel: [169472.862904] [<c044aebc>] (__schedule+0x518/0x5ac) from [<c0045758>] (__cond_resched+0x24/0x34)
Feb 27 19:33:23 pi kernel: [169472.891872] [<c0045758>] (__cond_resched+0x24/0x34) from [<c044b6c4>] (_cond_resched+0x3c/0x44)
Feb 27 19:33:23 pi kernel: [169472.920995] [<c044b6c4>] (_cond_resched+0x3c/0x44) from [<c00c3004>] (__get_user_pages.part.91+0xa0/0x408)
Feb 27 19:33:23 pi kernel: [169472.950871] [<c00c3004>] (__get_user_pages.part.91+0xa0/0x408) from [<c00b8214>] (get_user_pages_fast+0x60/0x78)
Feb 27 19:33:23 pi kernel: [169472.981397] [<c00b8214>] (get_user_pages_fast+0x60/0x78) from [<c0065af8>] (get_futex_key+0xec/0x244)
Feb 27 19:33:23 pi kernel: [169473.010922] [<c0065af8>] (get_futex_key+0xec/0x244) from [<c0065d00>] (futex_wake+0x30/0x160)
Feb 27 19:33:23 pi kernel: [169473.039707] [<c0065d00>] (futex_wake+0x30/0x160) from [<c0067e0c>] (do_futex+0x120/0xb78)
Feb 27 19:33:23 pi kernel: [169473.068342] [<c0067e0c>] (do_futex+0x120/0xb78) from [<c00688e4>] (SyS_futex+0x80/0x160)
Feb 27 19:33:23 pi kernel: [169473.097280] [<c00688e4>] (SyS_futex+0x80/0x160) from [<c001d328>] (mm_release+0xfc/0x138)
Feb 27 19:33:23 pi kernel: [169473.126477] [<c001d328>] (mm_release+0xfc/0x138) from [<c0021774>] (do_exit+0x110/0x970)
Feb 27 19:33:23 pi kernel: [169473.155673] [<c0021774>] (do_exit+0x110/0x970) from [<c00119f0>] (die+0x334/0x390)
Feb 27 19:33:23 pi kernel: [169473.184507] [<c00119f0>] (die+0x334/0x390) from [<c0008334>] (do_undefinstr+0x1b0/0x1dc)
Feb 27 19:33:23 pi kernel: [169473.213890] [<c0008334>] (do_undefinstr+0x1b0/0x1dc) from [<c044d0ec>] (__und_svc_finish+0x0/0x34)
Feb 27 19:33:23 pi kernel: [169473.244100] Exception stack(0xd61efe28 to 0xd61efe70)
Feb 27 19:33:23 pi kernel: [169473.259851] fe20:                   d61ee0f8 c0000780 c062ce60 d61ee0f8 d61ee030 80000780
Feb 27 19:33:23 pi kernel: [169473.288918] fe40: beb11cd0 d61effb0 00000000 d6179028 01771044 beb11de0 00000018 d61efeb0
Feb 27 19:33:23 pi kernel: [169473.317875] fe60: c0009588 c00099c4 60000013 ffffffff
Feb 27 19:33:23 pi kernel: [169473.333576] [<c044d0ec>] (__und_svc_finish+0x0/0x34) from [<c00099c4>] (vfp_save_state+0x0/0x28)
Feb 27 19:33:23 pi kernel: [169473.363003] [<c00099c4>] (vfp_save_state+0x0/0x28) from [<c0009588>] (vfp_sync_hwstate+0x70/0x7c)
Feb 27 19:33:23 pi kernel: [169473.392348] [<c0009588>] (vfp_sync_hwstate+0x70/0x7c) from [<c0009794>] (vfp_preserve_user_clear_hwstate+0x20/0x8c)
Feb 27 19:33:23 pi kernel: [169473.423359] [<c0009794>] (vfp_preserve_user_clear_hwstate+0x20/0x8c) from [<c0010a8c>] (setup_sigframe+0x190/0x1a0)
Feb 27 19:33:23 pi kernel: [169473.454368] [<c0010a8c>] (setup_sigframe+0x190/0x1a0) from [<c0010ee8>] (do_signal+0x2e8/0x440)
Feb 27 19:33:23 pi kernel: [169473.483620] [<c0010ee8>] (do_signal+0x2e8/0x440) from [<c00111d8>] (do_work_pending+0xa4/0xb4)
Feb 27 19:33:23 pi kernel: [169473.512866] [<c00111d8>] (do_work_pending+0xa4/0xb4) from [<c000df00>] (work_pending+0xc/0x20)

Happy to help debug this more, but I'm not sure exactly how. As I said, it's very repeatable.

The text was updated successfully, but these errors were encountered:

popcornmix · 2015-02-28T12:45:31Z

3.12 kernel is no longer supported. Can you apt-get upgrade to get current 3.18 kernel and confirm if the issue is still present?

radford-for-smpte · 2015-02-28T15:41:40Z

OK thanks, upgraded. I'll update if the problem persists...

radford-for-smpte · 2015-03-02T15:01:09Z

OK problem still occurs:

[44897.980959] BUG: unsupported FP instruction in kernel mode
[44897.988358] Internal error: Oops - undefined instruction: 0 [#1] PREEMPT ARM
[44897.998862] Modules linked in: snd_bcm2835 snd_pcm snd_seq snd_seq_device snd_timer snd pl2303 usbserial spi_bcm2708 uio_pdrv_genirq uio
[44898.014916] CPU: 0 PID: 2415 Comm: node Not tainted 3.18.7+ #755
[44898.022729] task: d6321b00 ti: d62ec000 task.ti: d62ec000
[44898.029912] PC is at vfp_save_state+0x0/0x28
[44898.035972] LR is at vfp_sync_hwstate+0x7c/0x88
[44898.042276] pc : [<c0009d20>]    lr : [<c00098b0>]    psr: 60000113
[44898.042276] sp : d62ede70  ip : d62ede88  fp : d62ede84
[44898.057358] r10: d6321f58  r9 : d62ec000  r8 : 00000000
[44898.064377] r7 : d62edec0  r6 : bea1acd8  r5 : 80000780  r4 : d62ec030
[44898.072714] r3 : d62ec0f8  r2 : c0855904  r1 : c0000780  r0 : d62ec0f8
[44898.081022] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[44898.091720] Control: 00c5387d  Table: 172e8008  DAC: 00000015
[44898.099297] Process node (pid: 2415, stack limit = 0xd62ec1b0)
[44898.106934] Stack: (0xd62ede70 to 0xd62ee000)
[44898.113119] de60:                                     bea1abd0 d62ec000 d62edea4 d62ede88
[44898.124980] de80: c0009ad4 c0009840 00000120 00000000 bea1aae0 00000000 d62edebc d62edea8
[44898.136965] dea0: c0011874 c0009ab0 d62edfb0 bea1aae0 d62edf84 d62edec0 c0011cb8 c00116e0
[44898.148962] dec0: 0028fa94 04000000 b6cbcb20 7ffbfeff fffffffe 00000011 00000000 00040001
[44898.160958] dee0: 00004d1e 000003e8 00000000 00000001 00000003 c0556c74 a0317588 000028d5
[44898.173013] df00: d62edf5c d62edf10 c00266a8 c0026234 00010000 c0811200 00000001 00400000
[44898.185199] df20: c0823388 60000193 00000000 00000000 d62edf5c d62edf40 c0069544 c00c3f98
[44898.197503] df40: c08104c0 d62ec000 c0837108 d62ec008 d62ec000 00000011 d62edfb0 d62ec008
[44898.209942] df60: d62ec000 00000000 d62edfb0 00000000 d62ec000 01d40044 d62edfac d62edf88
[44898.222522] df80: c0011fcc c0011a04 00c5387d 00000001 0030db24 60000010 f200b200 00c5387d
[44898.235135] dfa0: 00000000 d62edfb0 c000e918 c0011f18 01d40040 00000000 01d40038 330080a8
[44898.247828] dfc0: 00000002 003eedd4 bea1adec 4f6080c9 4610b3e9 4c6096c4 01d40044 bea1ade0
[44898.260625] dfe0: bea1add4 bea1add0 39d0a35c 0030db24 60000010 ffffffff 00000000 00000000
[44898.273491] [<c0009d20>] (vfp_save_state) from [<c0009ad4>] (vfp_preserve_user_clear_hwstate+0x30/0x9c)
[44898.287607] [<c0009ad4>] (vfp_preserve_user_clear_hwstate) from [<c0011874>] (setup_sigframe+0x1a0/0x1b0)
[44898.301888] [<c0011874>] (setup_sigframe) from [<c0011cb8>] (do_signal+0x2c0/0x400)
[44898.314226] [<c0011cb8>] (do_signal) from [<c0011fcc>] (do_work_pending+0xc0/0x110)
[44898.326552] [<c0011fcc>] (do_work_pending) from [<c000e918>] (work_pending+0xc/0x20)
[44898.338944] Code: e12fff1e e1a0200d e1a0e009 eafffe76 (eca00b20)
[44898.363805] ---[ end trace de1f09b100f292e5 ]---
[44898.377259] note: node[2415] exited with preempt_count 2

popcornmix · 2015-03-02T15:21:43Z

Can you provide a tar file containing something that can be run on raspbian to produce this error?

popcornmix · 2015-03-02T15:26:33Z

Your problem looks similar to:
#600
Interestingly that appeared to get fixed by an update to BOINC:
http://boinc.berkeley.edu/dev/forum_thread.php?id=9222
not by a kernel change, so I suspect the bug is in your user code.

Ferroin · 2015-03-04T16:21:33Z

If you are hitting an illegal instruction in kernel mode, it's not a userspace issue, it's a kernel bug. The only reason that it doesn't happen with the updated version of BOINC is that it was modified to not hit that codepath in the kernel, thus avoiding triggering the bug.

popcornmix · 2015-03-04T17:29:06Z

@Ferroin Do you know what the exact change that was made to BOINC? It might help understand the issue.

Ferroin · 2015-03-04T18:14:02Z

No, but It shouldn't take me long to figure out, I use a local build of the BOINC client on all my systems, so I have a local copy of the upstream repository anyway. I'll see if I can figure it out and hopefully be back with an answer shortly.

Ferroin · 2015-03-04T18:31:39Z

Hmm, I'm not seeing any tag for the release hey are saying is working, despite having just updated my local copy today.

In theory, it shouldn't be too hard to find out what is causing this, as there isn't a huge ammount of stuff in the kernel that actually uses VFP. Based on the backtraces it looks like something in the kernel's signal handling code is what's having the issues, although why that would be I have no idea. The actual stacktrace appears to be almost identical to the one seen in #600, and I believe the BOINC client changed how it's using signals to manage individual apps between 7.2 and 7.4, so the signal handling code would make some sense as a common issue.

It would be interesting though to see if the same thing happens on a Pi2, as arm7a has a different superset of the ISA than armv6.

popcornmix · 2015-03-04T18:41:24Z

The should be no VFP code in the kernel.

Ferroin · 2015-03-04T19:11:39Z

In the core kernel there isn't any that uses it for floating point calculations, although there are some SIMD optimizations that use it (CryptoAPI has some modules that do this if they detect VFP support, but I don't know for certain of anywhere else that uses it). Also, some non-FP op-codes might get used during context switch when saving the FPU state, and that looks like what is causing the issue here, although why it is only triggering during the context switch for the signal handler I have no idea.

P33M · 2015-03-06T11:50:17Z

It's an undefined instruction exception around the VFP save/restore state.

From http://lxr.free-electrons.com/source/arch/arm/vfp/vfpmodule.c#L516

It appears that the macros are using MCR/MRC accesses to talk to VFP registers. Pi 1 is "unusual" in having a VFPv2 implementation which isn't widely used.

While waiting on availability of the node.js implementation that causes this - perhaps a printk inside vfp_sync_hwstate() will tell us if this function is ever called without crashing?

radford-for-smpte · 2015-03-06T16:04:27Z

Re sample code, it's a rather large project that among other things, polls a lot of different equipment. It'd be rather difficult to run externally. I'll see if I weed it down to a smaller utility that is more portable while still exhibiting the crash.

Since my last post I had upgraded node to the latest v0.10.28 and I had thought the problem went away. Now after 3 days, the problem recurred:

[327065.395401] BUG: unsupported FP instruction in kernel mode
[327065.402939] Internal error: Oops - undefined instruction: 0 [#1] PREEMPT ARM
[327065.413619] Modules linked in: snd_bcm2835 snd_pcm snd_seq snd_seq_device snd_timer snd pl2303 usbserial spi_bcm2708 uio_pdrv_genirq uio
[327065.429854] CPU: 0 PID: 4453 Comm: node Not tainted 3.18.8+ #763
[327065.437801] task: d631a880 ti: d6010000 task.ti: d6010000
[327065.445117] PC is at vfp_save_state+0x0/0x28
[327065.451313] LR is at vfp_sync_hwstate+0x7c/0x88
[327065.457752] pc : [<c0009c4c>]    lr : [<c00097dc>]    psr: 60000113
[327065.457752] sp : d6011e70  ip : d6011e88  fp : d6011e84
[327065.473102] r10: d631acd8  r9 : d6010000  r8 : 00000000
[327065.480260] r7 : d6011ec0  r6 : bea9ec68  r5 : 80000780  r4 : d6010030
[327065.488732] r3 : d60100f8  r2 : c083cf44  r1 : c0000780  r0 : d60100f8
[327065.497177] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[327065.507988] Control: 00c5387d  Table: 163dc008  DAC: 00000015
[327065.515650] Process node (pid: 4453, stack limit = 0xd60101b0)
[327065.523400] Stack: (0xd6011e70 to 0xd6012000)
[327065.529666] 1e60:                                     bea9eb60 d6010000 d6011ea4 d6011e88
[327065.541613] 1e80: c0009a00 c000976c 00000120 00000000 bea9ea70 00000000 d6011ebc d6011ea8
[327065.553692] 1ea0: c00117b4 c00099dc d6011fb0 bea9ea70 d6011f84 d6011ec0 c0011bf8 c0011620
[327065.565774] 1ec0: 00529b4c 04000000 b6c69b20 7ffbfeff fffffffe 00000011 00000000 00040001
[327065.577862] 1ee0: 00007eea 000003e8 00000000 00000003 00000003 c05476e4 d8a878b0 00012976
[327065.590006] 1f00: d6011f5c d6011f10 c0026250 c0025ddc 00010000 c07fac0c 00000001 00400000
[327065.602274] 1f20: c080b520 60000193 00000000 00000000 d6011f5c d6011f40 c006742c c00c1dac
[327065.614657] 1f40: c07f9fa0 d6010000 c081f228 d6010008 d6010000 00000011 d6011fb0 d6010008
[327065.627183] 1f60: d6010000 00000000 d6011fb0 00000000 d6010000 01153044 d6011fac d6011f88
[327065.639847] 1f80: c0011f0c c0011944 00c5387d 00000001 001fa8e0 60000010 f200b200 00c5387d
[327065.652544] 1fa0: 00000000 d6011fb0 c000e858 c0011e58 01153040 00000000 01153038 229080a8
[327065.665319] 1fc0: 00000000 01153040 006e5768 590080c9 234dab1d 455afdc4 01153044 bea9ed7c
[327065.678202] 1fe0: bea9ed84 bea9ed60 2400a35c 001fa8e0 60000010 ffffffff 00000000 00000000
[327065.691147] [<c0009c4c>] (vfp_save_state) from [<c0009a00>] (vfp_preserve_user_clear_hwstate+0x30/0x9c)
[327065.705338] [<c0009a00>] (vfp_preserve_user_clear_hwstate) from [<c00117b4>] (setup_sigframe+0x1a0/0x1b0)
[327065.719687] [<c00117b4>] (setup_sigframe) from [<c0011bf8>] (do_signal+0x2c0/0x400)
[327065.732082] [<c0011bf8>] (do_signal) from [<c0011f0c>] (do_work_pending+0xc0/0x110)
[327065.744478] [<c0011f0c>] (do_work_pending) from [<c000e858>] (work_pending+0xc/0x20)
[327065.756948] Code: e12fff1e e1a0200d e1a0e009 eafffe76 (eca00b20)
[327065.777848] ---[ end trace a6e0bc6dd3e22dcd ]---
[327065.791353] note: node[4453] exited with preempt_count 2

johnspackman · 2015-03-12T08:43:39Z

We have a lot of Pis in production (70+) and this seems to crop up from time to time in recent builds; I now have a couple in my office that are doing it more frequently and one which does it every morning. The odd thing is that with the Pi which does it every day, I have replaced the SD card and the Pi itself (both brand new out of the packaging, the SD card is supplied by Raspberry themselves), leaving only the PSU, HDML cable, and the TV. The PSU is a standard PSU from RS components.

Last night I restarted it and by 2:35am it had gone again.

Stack trace below

pi@player-00002:~$ uname -a
Linux player-00002 3.12.28+ #709 PREEMPT Mon Sep 8 15:28:00 BST 2014 armv6l GNU/Linux

pi@player-00002:~$ /opt/vc/bin/vcgencmd version
Sep  8 2014 19:02:48
Copyright (c) 2012 Broadcom
version 3f2f2607186be72e4945cfa8edc77872dfc73195 (clean) (release)

pi@player-00002:~$ node -v
v0.10.26

[Thu Mar 12 02:36:35 2015] BUG: unsupported FP instruction in kernel mode
[Thu Mar 12 02:36:35 2015] Internal error: Oops - undefined instruction: 0 [#1] PREEMPT ARM
[Thu Mar 12 02:36:35 2015] Modules linked in: tun snd_bcm2835 leds_gpio snd_soc_bcm2708_i2s led_class regmap_mmio snd_soc_core snd_compress regmap_i2c snd_pcm_dmaengine regmap_spi snd_pcm snd_page_alloc snd_timer snd
[Thu Mar 12 02:36:35 2015] CPU: 0 PID: 2161 Comm: node Not tainted 3.12.28+ #709
[Thu Mar 12 02:36:35 2015] task: d71bcb00 ti: d609c000 task.ti: d609c000
[Thu Mar 12 02:36:35 2015] PC is at vfp_save_state+0x0/0x28
[Thu Mar 12 02:36:35 2015] LR is at vfp_sync_hwstate+0x70/0x7c
[Thu Mar 12 02:36:35 2015] pc : [<c00099c4>]    lr : [<c0009588>]    psr: 60000013
[Thu Mar 12 02:36:35 2015] sp : d609deb0  ip : 00000018  fp : bec84690
[Thu Mar 12 02:36:35 2015] r10: 02550044  r9 : d71bce90  r8 : 00000000
[Thu Mar 12 02:36:35 2015] r7 : d609dfb0  r6 : bec84590  r5 : 80000780  r4 : d609c030
[Thu Mar 12 02:36:35 2015] r3 : d609c0f8  r2 : c061e920  r1 : c0000780  r0 : d609c0f8
[Thu Mar 12 02:36:35 2015] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[Thu Mar 12 02:36:35 2015] Control: 00c5387d  Table: 17248008  DAC: 00000015
[Thu Mar 12 02:36:35 2015] Process node (pid: 2161, stack limit = 0xd609c1b8)
[Thu Mar 12 02:36:35 2015] Stack: (0xd609deb0 to 0xd609e000)
[Thu Mar 12 02:36:35 2015] dea0:                                     d609dec0 bec84488 d609c000 c0009794
[Thu Mar 12 02:36:35 2015] dec0: 00000000 bec84398 00000000 c0010a48 5ac3c35a d609dfb0 bec84398 c0010ea4
[Thu Mar 12 02:36:35 2015] dee0: 00529190 04000000 b6ce9f20 7ffbfeff fffffffe 00000011 00000000 00040001
[Thu Mar 12 02:36:35 2015] df00: 00007d64 000003e8 00000000 00000000 00000000 c061fa90 c061fc90 c0011114
[Thu Mar 12 02:36:35 2015] df20: c67429a0 00001883 00000001 00000001 00000004 c061f128 c061f120 d609c000
[Thu Mar 12 02:36:35 2015] df40: 00000000 00000008 c061f12c c0023a8c d609c000 00000000 0000000a 0028abd1
[Thu Mar 12 02:36:35 2015] df60: d609c010 00400000 bec84690 60000193 00000003 00000011 d609c018 d609c000
[Thu Mar 12 02:36:35 2015] df80: 00000000 d609dfb0 00000000 d609c000 02550044 c0011194 00000000 2ae4dc40
[Thu Mar 12 02:36:35 2015] dfa0: 60000010 f200b200 00000002 c000df00 449080a5 449080a5 449080a4 00002be4
[Thu Mar 12 02:36:35 2015] dfc0: 449080a5 00000002 000010bf 00000002 58e241c5 b6ba8101 02550044 bec84690
[Thu Mar 12 02:36:35 2015] dfe0: 2ae4dc00 bec84688 2ae9e9fc 2ae4dc40 60000010 ffffffff 00000000 00000000
[Thu Mar 12 02:36:35 2015] [<c00099c4>] (vfp_save_state+0x0/0x28) from [<c0009588>] (vfp_sync_hwstate+0x70/0x7c)
[Thu Mar 12 02:36:35 2015] [<c0009588>] (vfp_sync_hwstate+0x70/0x7c) from [<c0009794>] (vfp_preserve_user_clear_hwstate+0x20/0x8c)
[Thu Mar 12 02:36:35 2015] [<c0009794>] (vfp_preserve_user_clear_hwstate+0x20/0x8c) from [<c0010a48>] (setup_sigframe+0x190/0x1a0)
[Thu Mar 12 02:36:35 2015] [<c0010a48>] (setup_sigframe+0x190/0x1a0) from [<c0010ea4>] (do_signal+0x2e8/0x440)
[Thu Mar 12 02:36:35 2015] [<c0010ea4>] (do_signal+0x2e8/0x440) from [<c0011194>] (do_work_pending+0xa4/0xb4)
[Thu Mar 12 02:36:35 2015] [<c0011194>] (do_work_pending+0xa4/0xb4) from [<c000df00>] (work_pending+0xc/0x20)
[Thu Mar 12 02:36:35 2015] Code: e1a0f00e e1a0200d e1a0e009 eafffe89 (eca00b20) 
[Thu Mar 12 02:36:35 2015] ---[ end trace ac93b69b8274a1e0 ]---
[Thu Mar 12 02:36:35 2015] note: node[2161] exited with preempt_count 2
[Thu Mar 12 02:36:35 2015] BUG: scheduling while atomic: node/2161/0x40000003
[Thu Mar 12 02:36:35 2015] Modules linked in: tun snd_bcm2835 leds_gpio snd_soc_bcm2708_i2s led_class regmap_mmio snd_soc_core snd_compress regmap_i2c snd_pcm_dmaengine regmap_spi snd_pcm snd_page_alloc snd_timer snd
[Thu Mar 12 02:36:35 2015] CPU: 0 PID: 2161 Comm: node Tainted: G      D      3.12.28+ #709
[Thu Mar 12 02:36:35 2015] [<c001444c>] (unwind_backtrace+0x0/0xec) from [<c0011730>] (show_stack+0x10/0x14)
[Thu Mar 12 02:36:35 2015] [<c0011730>] (show_stack+0x10/0x14) from [<c043d490>] (__schedule_bug+0x48/0x64)
[Thu Mar 12 02:36:35 2015] [<c043d490>] (__schedule_bug+0x48/0x64) from [<c0442b54>] (__schedule+0x518/0x5ac)
[Thu Mar 12 02:36:35 2015] [<c0442b54>] (__schedule+0x518/0x5ac) from [<c004566c>] (__cond_resched+0x24/0x34)
[Thu Mar 12 02:36:35 2015] [<c004566c>] (__cond_resched+0x24/0x34) from [<c044335c>] (_cond_resched+0x3c/0x44)
[Thu Mar 12 02:36:35 2015] [<c044335c>] (_cond_resched+0x3c/0x44) from [<c00c2030>] (__get_user_pages.part.91+0xa0/0x408)
[Thu Mar 12 02:36:35 2015] [<c00c2030>] (__get_user_pages.part.91+0xa0/0x408) from [<c00b754c>] (get_user_pages_fast+0x60/0x78)
[Thu Mar 12 02:36:35 2015] [<c00b754c>] (get_user_pages_fast+0x60/0x78) from [<c00657dc>] (get_futex_key+0xec/0x244)
[Thu Mar 12 02:36:35 2015] [<c00657dc>] (get_futex_key+0xec/0x244) from [<c00659e4>] (futex_wake+0x30/0x160)
[Thu Mar 12 02:36:35 2015] [<c00659e4>] (futex_wake+0x30/0x160) from [<c0067abc>] (do_futex+0x120/0xb78)
[Thu Mar 12 02:36:35 2015] [<c0067abc>] (do_futex+0x120/0xb78) from [<c0068594>] (SyS_futex+0x80/0x160)
[Thu Mar 12 02:36:35 2015] [<c0068594>] (SyS_futex+0x80/0x160) from [<c001d2ac>] (mm_release+0xfc/0x138)
[Thu Mar 12 02:36:35 2015] [<c001d2ac>] (mm_release+0xfc/0x138) from [<c00216e8>] (do_exit+0x110/0x970)
[Thu Mar 12 02:36:35 2015] [<c00216e8>] (do_exit+0x110/0x970) from [<c0011a68>] (die+0x334/0x390)
[Thu Mar 12 02:36:35 2015] [<c0011a68>] (die+0x334/0x390) from [<c0008334>] (do_undefinstr+0x1b0/0x1dc)
[Thu Mar 12 02:36:35 2015] [<c0008334>] (do_undefinstr+0x1b0/0x1dc) from [<c0444d6c>] (__und_svc_finish+0x0/0x34)
[Thu Mar 12 02:36:35 2015] Exception stack(0xd609de28 to 0xd609de70)
[Thu Mar 12 02:36:35 2015] de20:                   d609c0f8 c0000780 c061e920 d609c0f8 d609c030 80000780
[Thu Mar 12 02:36:35 2015] de40: bec84590 d609dfb0 00000000 d71bce90 02550044 bec84690 00000018 d609deb0
[Thu Mar 12 02:36:35 2015] de60: c0009588 c00099c4 60000013 ffffffff
[Thu Mar 12 02:36:35 2015] [<c0444d6c>] (__und_svc_finish+0x0/0x34) from [<c00099c4>] (vfp_save_state+0x0/0x28)
[Thu Mar 12 02:36:36 2015] [<c00099c4>] (vfp_save_state+0x0/0x28) from [<c0009588>] (vfp_sync_hwstate+0x70/0x7c)
[Thu Mar 12 02:36:36 2015] [<c0009588>] (vfp_sync_hwstate+0x70/0x7c) from [<c0009794>] (vfp_preserve_user_clear_hwstate+0x20/0x8c)
[Thu Mar 12 02:36:36 2015] [<c0009794>] (vfp_preserve_user_clear_hwstate+0x20/0x8c) from [<c0010a48>] (setup_sigframe+0x190/0x1a0)
[Thu Mar 12 02:36:36 2015] [<c0010a48>] (setup_sigframe+0x190/0x1a0) from [<c0010ea4>] (do_signal+0x2e8/0x440)
[Thu Mar 12 02:36:36 2015] [<c0010ea4>] (do_signal+0x2e8/0x440) from [<c0011194>] (do_work_pending+0xa4/0xb4)
[Thu Mar 12 02:36:36 2015] [<c0011194>] (do_work_pending+0xa4/0xb4) from [<c000df00>] (work_pending+0xc/0x20)

P33M · 2015-03-12T10:13:21Z

Just built from raspberrypi/linux with bcmrpi_defconfig.

Some disassembly:

(gdb) disassemble  vfp_sync_hwstate
Dump of assembler code for function vfp_sync_hwstate:
   0xc0009718 <+0>:     mov     r12, sp
   0xc000971c <+4>:     push    {r4, r5, r11, r12, lr, pc}
   0xc0009720 <+8>:     sub     r11, r12, #4
   0xc0009724 <+12>:    push    {lr}            ; (str lr, [sp, #-4]!)
   0xc0009728 <+16>:    bl      0xc000eba0 <__gnu_mcount_nc>
   0xc000972c <+20>:    mov     r3, sp
   0xc0009730 <+24>:    bic     r4, r3, #8128   ; 0x1fc0
   0xc0009734 <+28>:    bic     r4, r4, #63     ; 0x3f
   0xc0009738 <+32>:    ldr     r3, [r4, #4]
   0xc000973c <+36>:    add     r3, r3, #1
   0xc0009740 <+40>:    str     r3, [r4, #4]
   0xc0009744 <+44>:    ldr     r2, [pc, #76]   ; 0xc0009798 <vfp_sync_hwstate+128>
   0xc0009748 <+48>:    add     r3, r0, #248    ; 0xf8
   0xc000974c <+52>:    ldr     r0, [r2, #4]
   0xc0009750 <+56>:    cmp     r0, r3
   0xc0009754 <+60>:    beq     0xc0009780 <vfp_sync_hwstate+104>
   0xc0009758 <+64>:    ldr     r3, [r4, #4]
   0xc000975c <+68>:    sub     r3, r3, #1
   0xc0009760 <+72>:    cmp     r3, #0
   0xc0009764 <+76>:    str     r3, [r4, #4]
   0xc0009768 <+80>:    ldmne   sp, {r4, r5, r11, sp, pc}
   0xc000976c <+84>:    ldr     r3, [r4]
   0xc0009770 <+88>:    tst     r3, #2
   0xc0009774 <+92>:    ldmeq   sp, {r4, r5, r11, sp, pc}
   0xc0009778 <+96>:    bl      0xc0537420 <preempt_schedule>
   0xc000977c <+100>:   ldm     sp, {r4, r5, r11, sp, pc}
   0xc0009780 <+104>:   vmrs    r5, fpexc
   0xc0009784 <+108>:   orr     r1, r5, #1073741824     ; 0x40000000
   0xc0009788 <+112>:   vmsr    fpexc, r1
   0xc000978c <+116>:   bl      0xc0009bf4 <vfp_save_state>
   0xc0009790 <+120>:   vmsr    fpexc, r5
   0xc0009794 <+124>:   b       0xc0009758 <vfp_sync_hwstate+64>
   0xc0009798 <+128>:   addgt   pc, r2, r4, lsl #8
End of assembler dump.
(gdb) disassemble  vfp_save_state
Dump of assembler code for function vfp_save_state:
   0xc0009bf4 <+0>:     vstmia  r0!, {d0-d15}
   0xc0009bf8 <+4>:     vmrs    r2, fpscr
   0xc0009bfc <+8>:     tst     r1, #-2147483648        ; 0x80000000
   0xc0009c00 <+12>:    beq     0xc0009c14 <vfp_save_state+32>
   0xc0009c04 <+16>:    vmrs    r3, fpinst      @ Impl def
   0xc0009c08 <+20>:    tst     r1, #268435456  ; 0x10000000
   0xc0009c0c <+24>:    beq     0xc0009c14 <vfp_save_state+32>
   0xc0009c10 <+28>:    vmrs    r12, fpinst2    @ Impl def
   0xc0009c14 <+32>:    stm     r0, {r1, r2, r3, r12}
   0xc0009c18 <+36>:    bx      lr
End of assembler dump.

popcornmix · 2015-03-12T11:23:50Z

@P33M

While waiting on availability of the node.js implementation that causes this - perhaps a printk inside vfp_sync_hwstate() will tell us if this function is ever called without crashing?

I did add that printk and it occurs many times a second (without crashing).

johnspackman · 2015-03-16T09:12:08Z

still happens on the latest kernel:

pi@player-00002:~$ uname -a
Linux player-00002 3.18.7+ #755 PREEMPT Thu Feb 12 17:14:31 GMT 2015 armv6l GNU/Linux

pi@player-00002:~$ /opt/vc/bin/vcgencmd version
Feb 14 2015 22:23:03
Copyright (c) 2012 Broadcom
version 7789db485409720b0e523a3d6b86b12ed56fd152 (clean) (release)

pi@player-00002:~$ node -v
v0.10.26

stack dump

[Sat Mar 14 13:57:27 2015] BUG: unsupported FP instruction in kernel mode
[Sat Mar 14 13:57:27 2015] Internal error: Oops - undefined instruction: 0 [#1] PREEMPT ARM
[Sat Mar 14 13:57:27 2015] Modules linked in: tun snd_bcm2835 snd_pcm snd_timer snd evdev uio_pdrv_genirq uio
[Sat Mar 14 13:57:27 2015] CPU: 0 PID: 2205 Comm: node Not tainted 3.18.7+ #755
[Sat Mar 14 13:57:27 2015] task: d7370d80 ti: d59a2000 task.ti: d59a2000
[Sat Mar 14 13:57:27 2015] PC is at vfp_save_state+0x0/0x28
[Sat Mar 14 13:57:27 2015] LR is at vfp_sync_hwstate+0x7c/0x88
[Sat Mar 14 13:57:27 2015] pc : [<c0009d20>]    lr : [<c00098b0>]    psr: 60000013
[Sat Mar 14 13:57:27 2015] sp : d59a3e70  ip : d59a3e88  fp : d59a3e84
[Sat Mar 14 13:57:27 2015] r10: d73711d8  r9 : d59a2000  r8 : 00000000
[Sat Mar 14 13:57:27 2015] r7 : d59a3ec0  r6 : bef82590  r5 : 80000780  r4 : d59a2030
[Sat Mar 14 13:57:27 2015] r3 : d59a20f8  r2 : c0855904  r1 : c0000780  r0 : d59a20f8
[Sat Mar 14 13:57:27 2015] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[Sat Mar 14 13:57:27 2015] Control: 00c5387d  Table: 17344008  DAC: 00000015
[Sat Mar 14 13:57:27 2015] Process node (pid: 2205, stack limit = 0xd59a21b0)
[Sat Mar 14 13:57:27 2015] Stack: (0xd59a3e70 to 0xd59a4000)
[Sat Mar 14 13:57:27 2015] 3e60:                                     bef82488 d59a2000 d59a3ea4 d59a3e88
[Sat Mar 14 13:57:27 2015] 3e80: c0009ad4 c0009840 00000120 00000000 bef82398 00000000 d59a3ebc d59a3ea8
[Sat Mar 14 13:57:27 2015] 3ea0: c0011874 c0009ab0 d59a3fb0 bef82398 d59a3f84 d59a3ec0 c0011cb8 c00116e0
[Sat Mar 14 13:57:27 2015] 3ec0: 00529190 04000000 b6c73b20 7ffbfeff fffffffe 00000011 00000000 00040001
[Sat Mar 14 13:57:27 2015] 3ee0: 000004c4 000003e8 00000000 00000000 00000001 c0556c74 8bb01020 0000ad6d
[Sat Mar 14 13:57:27 2015] 3f00: d59a3f5c d59a3f10 c00266a8 c0026234 00000000 c0811200 00000001 00400000
[Sat Mar 14 13:57:27 2015] 3f20: c0823388 60000193 00000000 00000000 d59a3f5c d59a3f40 c0069544 c00c3f98
[Sat Mar 14 13:57:27 2015] 3f40: c08104c0 d59a2000 c0837108 d59a2008 d59a2000 00000011 d59a3fb0 d59a2008
[Sat Mar 14 13:57:27 2015] 3f60: d59a2000 00000000 d59a3fb0 00000000 d59a2000 0129b044 d59a3fac d59a3f88
[Sat Mar 14 13:57:27 2015] 3f80: c0011fcc c0011a04 00c5387d 00000001 4ea4dc40 60000010 f200b200 00c5387d
[Sat Mar 14 13:57:27 2015] 3fa0: 00000000 d59a3fb0 c000e918 c0011f18 5b7080a5 5b7080a5 5b7080a4 00002cf2
[Sat Mar 14 13:57:27 2015] 3fc0: 5b7080a5 00000002 0000008a 00000002 2ca241c5 2b8bfe61 0129b044 bef82690
[Sat Mar 14 13:57:27 2015] 3fe0: 4ea4dc00 bef82688 4ea9e9fc 4ea4dc40 60000010 ffffffff 177fa821 177fac21
[Sat Mar 14 13:57:27 2015] [<c0009d20>] (vfp_save_state) from [<c0009ad4>] (vfp_preserve_user_clear_hwstate+0x30/0x9c)
[Sat Mar 14 13:57:27 2015] [<c0009ad4>] (vfp_preserve_user_clear_hwstate) from [<c0011874>] (setup_sigframe+0x1a0/0x1b0)
[Sat Mar 14 13:57:27 2015] [<c0011874>] (setup_sigframe) from [<c0011cb8>] (do_signal+0x2c0/0x400)
[Sat Mar 14 13:57:27 2015] [<c0011cb8>] (do_signal) from [<c0011fcc>] (do_work_pending+0xc0/0x110)
[Sat Mar 14 13:57:27 2015] [<c0011fcc>] (do_work_pending) from [<c000e918>] (work_pending+0xc/0x20)
[Sat Mar 14 13:57:27 2015] Code: e12fff1e e1a0200d e1a0e009 eafffe76 (eca00b20) 
[Sat Mar 14 13:57:27 2015] ---[ end trace 0073c74fe454cde0 ]---
[Sat Mar 14 13:57:27 2015] note: node[2205] exited with preempt_count 2

popcornmix · 2015-03-16T12:24:26Z

We really need a test case for this. Clear instructions of what to type from a clean raspbian install to provoke the error. Ideally just "run this executable" and see panic.

johnspackman · 2015-03-16T13:02:19Z

I will be able to work on a reproducable test case later this week; the problem is that it can take hours to occur and I can't trap the signal (AFAIK) to resolve it to a line of JS code. The only clue I have is that every time this happens, there is a zombie process for tput (spawned by by node app).

However, this is starting to look like it's related to the screen - we've got quite a few of these in production, and I'm reasonably sure that there are quite a few examples of exactly same software running on them without this error (or if it does occur then it's nowhere near as frequent). What differs is the screen on each site, and I am having difficulties creating a 100% reliable mode detection & reset particularly on old LG's.

Is there some kind of debugging setup I can hook up to my Pi here so that I can report back when it crops up? I'm happy to open up ports in the firewall for someone to get direct access to the box if necessary

catschulze · 2015-03-18T16:34:44Z

I think I have found a hint what might be happening here. See the ARM1176JZF-S Technical Reference Manual, DDI0301H_arm1176jzfs_r0p7_trm.pdf (freely available on infocenter.arm.com), section 20.4.3, "Floating-point exception register, FPEXC" on page 20-16: The VFP11 coprocessor might detect a situation that requires reporting an exception too late to notify the ARM11 core, and in these cases the VFP11 sets the EX bit (0x8000'0000) in register FPEXC so that the exception is triggered when the next VFP operation is executed. As you can see from P33M's disassembly, the FPEXC register has been read and saved to ARM register r5 at the time of the kernel panic (and ARM register r1 contains the same value ORed with 0x4000'000, the EN(able) bit - r1 is what will be actually saved to the task state in the stm opcode at the end of vfp_save_state). From johnspackman's register dump, I see that r5 = 0x8000'0780, therefore the EX bit is apparently set. And the VFP11 now does exactly what the technical reference manual specifies, it bounces the vstmia opcode at the start of vfp_save_state. The kernel seems to be unprepared to deal with this exception in kernel space and dies. (Note that vfp_save_state actually contains code that checks for the EX bit, but unfortunately only to decide whether the FPINST and potentially also FPINST2 register(s) need to be saved to the task's context besides FPEXC as well.)

From my understanding of the VFP11 technical reference manual, it would be necessary to clear the EX bit in FPEXC before trying to execute VFP opcodes in the kernel (but saving the unmodified value of FPXEC to the task's context, so that the set EX bit will be restored when switching back to the task that had it set). To quote from the TRM: "EX must be cleared by the exception handling routine" - apparently this is also true for the state saving routine during context switch.

I still wonder what causes this to trigger only in rare cases (I have never observed it on my systems). Perhaps you need to have a "pending VFP exception" just at the time of context switch, i.e. the last VFP operation executed in userspace before a task switch has to cause a "delayed exception", and no other user space VFP instructions (that would cause the exception to be triggered in user space) come in between, so that the exception sits and waits to bomb when the first VFP instruction is executed in kernel space (vstmia to save all VFP11 registers during task switch)? (It would be interesting to see the FPINST and FPINST2 register values as well, I wonder what VFP instruction causes this exception.)

P33M · 2015-03-18T16:45:01Z

Thanks for that detailed analysis.

This may also help with creating a test case - have two userspace programs that make extensive use of the VFP unit and have them trigger FP exceptions regularly. Have them both running at the same priority so lots of context switches happen and see if we get some crashes...

radford-for-smpte · 2015-03-18T16:53:36Z

OP here, I've learned a few things that might help diagnose the problem. The node.js daemon that exhibited the crash was, among other things, forking two processes every few seconds: vcgencmd and raspistill. By removing both of those, I've been running for 12 days now with no problems.

I'm guessing a small node.js script that repeatable forks those tools every 10 sec or so might repeat the problem (after a few days of running).

johnspackman · 2015-03-19T10:46:21Z

I've just made it happen with a test script in under 2 hours; here's the script:

var child_process = require("child_process");

var log = { trace: console.log, debug: console.log, warn: console.log, error: console.log, info: console.log };

var util = {
process: {
  exec : function(cmd/* , args, opts, callback */) {
    var tmp = [].slice.call(arguments, 1);
    var args = Array.isArray(tmp[0]) ? tmp.shift() : [];
    var callback = typeof tmp[tmp.length - 1] === 'function' ? tmp.pop() : null;
    var opts = (typeof tmp[0] === 'object' ? tmp.shift() : null)||{};

    var proc = child_process.spawn(cmd, args, {
      encoding: "utf8"
    });
    log.debug("spawn: " + cmd + " " + JSON.stringify(args) + ", PID=" + proc.pid);
    var lnrStdout, lnrStderr;
    var outputReceived = false;
    proc.stdout.on("data", lnrStdout = function(data) {
      outputReceived = true;
      log.trace(cmd + " stdout: " + data);
      if (opts.onStdout)
        opts.onStdout(data);
      if (opts.onConsoleOutput)
        opts.onConsoleOutput(data);
      if (opts.copyToConsole)
        process.stdout.write(data);
    });
    proc.stderr.on("data", lnrStderr = function(data) {
      outputReceived = true;
      log.trace(cmd + " stderr: " + data);
      if (opts.onStderr)
        opts.onStderr(data);
      if (opts.onConsoleOutput)
        opts.onConsoleOutput(data);
      if (opts.copyToConsole)
        process.stdout.write(data);
    });
    var closed = false;
    var exited = false;
    var exitCode = null;
    var exitSignal = null;
    proc.on("close", function() {
      log.trace(cmd + " closed");
      proc.stdout.removeListener("data", lnrStdout);
      proc.stderr.removeListener("data", lnrStderr);
      closed = true;
      if (closed && exited && callback) {
        callback(null, exitCode, exitSignal);
        callback = null;
      }
    });
    proc.on("exit", function(code, signal) {
      log.trace(cmd + " exit code=" + code + ", signal=" + signal);
      exited = true;
      exitCode = code;
      exitSignal = signal;
      if (closed && exited && callback) {
        callback(null, exitCode, signal);
        callback = null;
      }
    });
    proc.on("error", function(err) {
      log.trace(cmd + " error=" + err);
      exited = !outputReceived;
      exitCode = -1;
      exitSignal = null;
      if (closed && exited && callback) {
        callback(null, exitCode, signal);
        callback = null;
      }
    });
    return proc;
  },

  execAndCapture : function(cmd, args, callback) {
    var stdout = [];
    function capture(data) {
      stdout.push(data);
    }
    var proc = this.exec(cmd, args, {
      onStdout: capture,
      onStderr: capture
    }, function(err, code, signal) {
      callback(err, stdout.join(""), code, signal);
    });
    return proc;
  },

  execToConsole : function(cmd, args, callback) {
    var proc = this.exec(cmd, args, {
      copyToConsole: true
    }, callback);
    return proc;
  }
}
};

function blankScreen(cb) {
  util.process.execToConsole("tput", ["clear"], function() {
    util.process.exec("cp", [ "/dev/zero", "/dev/fb0" ], cb);
  });
}

function test() {
  console.log("Blanking");
  blankScreen(function() {
    console.log("Blanked");
    setTimeout(test, 1);
  });
}

test();

Everything seemed fine until I turned the screen off with "tvservice -o" (because I realised that on this machine, the failures had always happened overnight and our app turns the screen off during out of hours); I checked back a short while later and this was in dmesg:

[ 6230.810677] BUG: unsupported FP instruction in kernel mode
[ 6230.816359] Internal error: Oops - undefined instruction: 0 [#1] PREEMPT ARM
[ 6230.823552] Modules linked in: tun snd_bcm2835 snd_pcm snd_timer snd uio_pdrv_genirq uio
[ 6230.831973] CPU: 0 PID: 2393 Comm: node Not tainted 3.18.7+ #755
[ 6230.838103] task: d60e6c00 ti: d736c000 task.ti: d736c000
[ 6230.843628] PC is at vfp_save_state+0x0/0x28
[ 6230.847997] LR is at vfp_sync_hwstate+0x7c/0x88
[ 6230.852628] pc : [<c0009d20>]    lr : [<c00098b0>]    psr: 60000013
[ 6230.852628] sp : d736de70  ip : d736de88  fp : d736de84
[ 6230.864306] r10: d60e7058  r9 : d736c000  r8 : 00000000
[ 6230.869632] r7 : d736dec0  r6 : be894628  r5 : 80000780  r4 : d736c030
[ 6230.876280] r3 : d736c0f8  r2 : c0855904  r1 : c0000780  r0 : d736c0f8
[ 6230.882926] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[ 6230.890192] Control: 00c5387d  Table: 173c0008  DAC: 00000015
[ 6230.896047] Process node (pid: 2393, stack limit = 0xd736c1b0)
[ 6230.901991] Stack: (0xd736de70 to 0xd736e000)
[ 6230.906442] de60:                                     be894520 d736c000 d736dea4 d736de88
[ 6230.914775] de80: c0009ad4 c0009840 00000120 00000000 be894430 00000000 d736debc d736dea8
[ 6230.923108] dea0: c0011874 c0009ab0 d736dfb0 be894430 d736df84 d736dec0 c0011cb8 c00116e0
[ 6230.931438] dec0: 00529190 04000000 b6c43b20 7ffbfeff fffffffe 00000011 00000000 00040001
[ 6230.939770] dee0: 00002efa 000003e8 00000000 00000000 00000003 c0556c74 b93c0690 000005aa
[ 6230.948101] df00: d736df5c d736df10 c00266a8 c0026234 00010000 c08121dc 00000001 00400000
[ 6230.956432] df20: c0823388 60000193 00000000 00000000 d736df5c d736df40 c0069544 c00c3f98
[ 6230.964765] df40: c08104c0 d736c000 c0837108 d736c008 d736c000 00000011 d736dfb0 d736c008
[ 6230.973095] df60: d736c000 00000000 d736dfb0 00000000 d736c000 01c57044 d736dfac d736df88
[ 6230.981425] df80: c0011fcc c0011a04 00c5387d 00000001 001fa83c 60000010 f200b200 00c5387d
[ 6230.989755] dfa0: 00000000 d736dfb0 c000e918 c0011f18 01c57040 00000000 01c57038 334080a8
[ 6230.998086] dfc0: 00000002 00301de8 be89473c 229080c9 b6a58dfd 26b4e068 01c57044 be894730
[ 6231.006417] dfe0: be894724 be894720 2ea0a35c 001fa83c 60000010 ffffffff 00000000 00000000
[ 6231.014769] [<c0009d20>] (vfp_save_state) from [<c0009ad4>] (vfp_preserve_user_clear_hwstate+0x30/0x9c)
[ 6231.024363] [<c0009ad4>] (vfp_preserve_user_clear_hwstate) from [<c0011874>] (setup_sigframe+0x1a0/0x1b0)
[ 6231.034130] [<c0011874>] (setup_sigframe) from [<c0011cb8>] (do_signal+0x2c0/0x400)
[ 6231.041951] [<c0011cb8>] (do_signal) from [<c0011fcc>] (do_work_pending+0xc0/0x110)
[ 6231.049772] [<c0011fcc>] (do_work_pending) from [<c000e918>] (work_pending+0xc/0x20)
[ 6231.057672] Code: e12fff1e e1a0200d e1a0e009 eafffe76 (eca00b20)
[ 6231.063891] ---[ end trace 6d5e72d57e400b9c ]---
[ 6231.068611] note: node[2393] exited with preempt_count 2

Although this might seem contrived, it turns out that during out of hours our app was needlessly calling blankScreen() every couple of seconds.

I'm running it again now to see if it happens again :)

johnspackman · 2015-03-19T11:28:47Z

that script's just done it a second time, this time 30 minutes after I turned the screen off

catschulze · 2015-03-19T17:07:39Z

I have tried to prepare a small user space program in C which could potentially trigger this bug. As my RPi is currently busy (and I cannot risk crashing it now), I have tried to perform the tests in QEmu. However, while I can actually see undefined operation exceptions hitting the CPU, QEmu apparently does not let me single step in exception handlers which have interrupts disabled (?), and it also seems that it does not model the delayed (asynchronous) exceptions the VFP11 uses. Perhaps someone else may be interested in compiling and running this test code on the real hardware? (It might be a good idea to run at least two instances in parallel.)

#include <math.h>
#include <stdio.h>

int main()
{
        float x; /* sqrt(-1.0) test, input */
        float y; /* sqrt(-1.0) test, output */
        float a; /* 1.0/0.0 test, input 1 */
        float b; /* 1.0/0.0 test, input 2 */
        float c; /* 1.0/0.0 test, output */
        int i;   /* busy loop counter */
        asm ( /* Activate all 6 exception traps in FPSCR */
                "vmrs\tr0, fpscr\n\t"
                "orr\tr0, r0, #0x9f00\n\t"
                "vmsr\tfpscr, r0"
                :
                :
                : "r0"
        );
        while(1) {
                x = -1.0;
                y =  0.0;
                asm ( /* Perform sqrt(-1.0) test */
                        "fsqrts\t%0, %1"
                        : "=w" (y)
                        : "w" (x)
                        :
                );
                for(i=0; i<=0x00ffffff; i++);
                printf("Result: x=%f y=%f\n", x, y);
                fflush(0);
                a = 1.0;
                b = 0.0;
                c = 0.0;
                asm ( /* Perform 1.0/0.0 test */
                        "fdivs\t%0, %1, %2"
                        : "=w" (c)
                        : "w" (a), "w" (b)
                        :
                );
                for(i=0; i<=0x00ffffff; i++);
                printf("Result: a=%f b=%f c=%f\n", a, b, c);
                fflush(0);
        }
        return 0;
}

The idea behind this is to configure the VFP11 to report all exceptions as interrupts (no quiet NaNs or INFs), and then provide a series of sqrt(-1.0) and 1.0/0.0 operations, each followed by a busy loop, so that the VFP11 delayed exception status is kept alive during this loop and the interrupt is only triggered at the time the VFP11 is used to print the results to stdout.

P33M · 2015-03-19T17:14:30Z

Interesting. If I do this on a Pi 2 everything works fine. I can spawn multiple instances.

pi@raspberrypi:~$ ./argh
Result: x=-1.000000 y=nan
Result: a=1.000000 b=0.000000 c=inf
Result: x=-1.000000 y=nan
Result: a=1.000000 b=0.000000 c=inf
Result: x=-1.000000 y=nan
Result: a=1.000000 b=0.000000 c=inf
Result: x=-1.000000 y=nan
Result: a=1.000000 b=0.000000 c=inf

If I do this on a Pi 1, I get

pi@raspberrypi:~$ gdb argh
GNU gdb (GDB) 7.4.1-debian
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "arm-linux-gnueabihf".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/pi/argh...done.
(gdb) run
Starting program: /home/pi/argh

Program received signal SIGFPE, Arithmetic exception.
0x00008434 in main () at test.c:23
23                      asm ( /* Perform sqrt(-1.0) test */

catschulze · 2015-03-19T17:38:47Z

The SIGFPE didn't show up in QEmu, either (emulating a vexpress-a9 board, which has a ARM Cortex A9 CPU, closer to the ARM Cortex A7 CPU of the Pi 2 than the ARM1176 CPU of the Pi 1). At least we know that the illegal operations really trigger an exception to be signalled from VFP11 to the ARM core (causing the kernel to signal SIGFPE to the program). I don't know why this doesn't happen on the Pi 2, but I also only have a TRM for fhe ARM11, not for the Cortex CPUs, so I cannot check whether they behave differently or need other flags to be set up. Apparently, you get the "correct" results NaN and +INF, so it might also caused by the kernel's VFP support handler just setting these values and returning to user space without signalling SIGFPE...

OK, having the program terminated by the SIGFPE is not what was intended. Can you install a signal handler for SIGFPE that just ignores this signal (calling signal() for SIGFPE with handler set to SIG_IGN), so that we can see what happens if the kernel performs a context switch while the delayed exception flag EX in register FPEXC is set?

radford-for-smpte · 2015-03-19T18:01:22Z

I can confirm that johnspackman's node sample did indeed reproduce the crash pretty quickly (30min for me).

Not sure what it means, but the c utility just does this:

pi@pi /tmp $ ./go
Floating point exception

catschulze · 2015-03-19T18:19:26Z

@Studio-Dude The "Floating point exception" message seems to happen only on the Pi 1, as P33M found out. My tests were done in QEmu and didn't show this message, so my code was not prepared to handle it. If you'd like to try, here's the updated code:

#include <math.h>
#include <stdio.h>
#include <signal.h>

int main()
{
        float x; /* sqrt(-1.0) test, input */
        float y; /* sqrt(-1.0) test, output */
        float a; /* 1.0/0.0 test, input 1 */
        float b; /* 1.0/0.0 test, input 2 */
        float c; /* 1.0/0.0 test, output */
        int i;   /* busy loop counter */

        signal(SIGFPE, SIG_IGN); /* Floating point exceptions are expected and not fatal */

        asm ( /* Activate all 6 exception traps in FPSCR */
                "vmrs\tr0, fpscr\n\t"
                "orr\tr0, r0, #0x9f00\n\t"
                "vmsr\tfpscr, r0"
                :
                :
                : "r0"
        );

        while(1) {
                x = -1.0;
                y =  0.0;
                asm ( /* Perform sqrt(-1.0) test */
                        "fsqrts\t%0, %1"
                        : "=w" (y)
                        : "w" (x)
                        :
                );
                for(i=0; i<=0x00ffffff; i++);
                printf("Result: x=%f y=%f\n", x, y);
                fflush(0);
                a = 1.0;
                b = 0.0;
                c = 0.0;
                asm ( /* Perform 1.0/0.0 test */
                        "fdivs\t%0, %1, %2"
                        : "=w" (c)
                        : "w" (a), "w" (b)
                        :
                );
                for(i=0; i<=0x00ffffff; i++);
                printf("Result: a=%f b=%f c=%f\n", a, b, c);
                fflush(0);
        }
        return 0;
}

This should now prevent the FPE from being reported (and the program subsequently aborted), but I could not try it on real ARM11/VFP11 hardware (it works flawlessly on an ARM Cortex A9 simulated in QEmu, however), so please report back if there are still issues with the code.

radford-for-smpte · 2015-03-19T18:26:51Z

OK here's what that new version does:

pi@pi /tmp $ ./new
Result: x=-1.000000 y=-1.000000
Segmentation fault
pi@pi /tmp $ ./new
Segmentation fault
pi@pi /tmp $ ./new
Result: x=-1.000000 y=nan
Segmentation fault
pi@pi /tmp $ ./new
Result: x=-1.000000 y=nan
Segmentation fault


 pi@pi /tmp $ dmesg
[1101281.083856] VFP: Error: unhandled bounce
[1101281.093060] VFP: EXC 0x40000000 SCR 0x00009f01 INST 0xed5b7a07
[1101281.104121] VFP: s 0: 0x00000000 s 1: 0x7ff80000
[1101281.113767] VFP: s 2: 0x00000000 s 3: 0x00000000
[1101281.123105] VFP: s 4: 0x00000000 s 5: 0x00000000
[1101281.132096] VFP: s 6: 0x00000000 s 7: 0x00000000
[1101281.140736] VFP: s 8: 0x00000000 s 9: 0x00000000
[1101281.149072] VFP: s10: 0x00000000 s11: 0x00000000
[1101281.157155] VFP: s12: 0x00000000 s13: 0xbff00000
[1101281.165022] VFP: s14: 0x00000000 s15: 0x7ff80000
[1101281.172551] VFP: s16: 0x7fc00000 s17: 0x00000000
[1101281.179949] VFP: s18: 0x00000000 s19: 0x00000000
[1101281.187249] VFP: s20: 0x00000000 s21: 0x00000000
[1101281.194422] VFP: s22: 0x00000000 s23: 0x00000000
[1101281.201644] VFP: s24: 0x00000000 s25: 0x00000000
[1101281.208887] VFP: s26: 0x00000000 s27: 0x00000000
[1101281.216133] VFP: s28: 0x00000000 s29: 0x00000000
[1101281.223270] VFP: s30: 0x00000000 s31: 0x00000000
[1101281.453740] VFP: Error: unhandled bounce
[1101281.460452] VFP: EXC 0x40000000 SCR 0x00009f01 INST 0xed5b7a07
[1101281.469259] VFP: s 0: 0x00000000 s 1: 0x7ff80000
[1101281.476987] VFP: s 2: 0x00000000 s 3: 0x00000000
[1101281.484619] VFP: s 4: 0x00000000 s 5: 0x00000000
[1101281.492364] VFP: s 6: 0x00000000 s 7: 0x00000000
[1101281.500157] VFP: s 8: 0x00000000 s 9: 0x00000000
[1101281.507891] VFP: s10: 0x00000000 s11: 0x00000000
[1101281.515668] VFP: s12: 0x00000000 s13: 0xbff00000
[1101281.523336] VFP: s14: 0x00000000 s15: 0x7ff80000
[1101281.530893] VFP: s16: 0x7ff80000 s17: 0x00000000
[1101281.538323] VFP: s18: 0x00000000 s19: 0x00000000
[1101281.545555] VFP: s20: 0x00000000 s21: 0x00000000

Christopher Alexander Tobias Schulze - May 2, 2015, 11:57 a.m. This patch fixes a problem with VFP state save and restore related to exception handling (panic with message "BUG: unsupported FP instruction in kernel mode") present on VFP11 floating point units (as used with ARM1176JZF-S CPUs, e.g. on first generation Raspberry Pi boards). This patch was developed and discussed on #859 A precondition to see the crashes is that floating point exception traps are enabled. In this case, the VFP11 might determine that a FPU operation needs to trap at a point in time when it is not possible to signal this to the ARM11 core any more. The VFP11 will then set the FPEXC.EX bit and store the trapped opcode in FPINST. (In some cases, a second opcode might have been accepted by the VFP11 before the exception was detected and could be reported to the ARM11 - in this case, the VFP11 also sets FPEXC.FP2V and stores the second opcode in FPINST2.) If FPEXC.EX is set, the VFP11 will "bounce" the next FPU opcode issued by the ARM11 CPU, which will be seen by the ARM11 as an undefined opcode trap. The VFP support code examines the FPEXC.EX and FPEXC.FP2V bits to decide what actions to take, i.e., whether to emulate the opcodes found in FPINST and FPINST2, and whether to retry the bounced instruction. If a user space application has left the VFP11 in this "pending trap" state, the next FPU opcode issued to the VFP11 might actually be the VSTMIA operation vfp_save_state() uses to store the FPU registers to memory (in our test cases, when building the signal stack frame). In this case, the kernel crashes as described above. This patch fixes the problem by making sure that vfp_save_state() is always entered with FPEXC.EX cleared. (The current value of FPEXC has already been saved, so this does not corrupt the context. Clearing FPEXC.EX has no effects on FPINST or FPINST2. Also note that many callers already modify FPEXC by setting FPEXC.EN before invoking vfp_save_state().) This patch also addresses a second problem related to FPEXC.EX: After returning from signal handling, the kernel reloads the VFP context from the user mode stack. However, the current code explicitly clears both FPEXC.EX and FPEXC.FP2V during reload. As VFP11 requires these bits to be preserved, this patch disables clearing them for VFP implementations belonging to architecture 1. There should be no negative side effects: the user can set both bits by executing FPU opcodes anyway, and while user code may now place arbitrary values into FPINST and FPINST2 (e.g., non-VFP ARM opcodes) the VFP support code knows which instructions can be emulated, and rejects other opcodes with "unhandled bounce" messages, so there should be no security impact from allowing reloading FPEXC.EX and FPEXC.FP2V. Signed-off-by: Christopher Alexander Tobias Schulze <[email protected]>

Christopher Alexander Tobias Schulze - May 2, 2015, 11:57 a.m. This patch fixes a problem with VFP state save and restore related to exception handling (panic with message "BUG: unsupported FP instruction in kernel mode") present on VFP11 floating point units (as used with ARM1176JZF-S CPUs, e.g. on first generation Raspberry Pi boards). This patch was developed and discussed on raspberrypi/linux#859 A precondition to see the crashes is that floating point exception traps are enabled. In this case, the VFP11 might determine that a FPU operation needs to trap at a point in time when it is not possible to signal this to the ARM11 core any more. The VFP11 will then set the FPEXC.EX bit and store the trapped opcode in FPINST. (In some cases, a second opcode might have been accepted by the VFP11 before the exception was detected and could be reported to the ARM11 - in this case, the VFP11 also sets FPEXC.FP2V and stores the second opcode in FPINST2.) If FPEXC.EX is set, the VFP11 will "bounce" the next FPU opcode issued by the ARM11 CPU, which will be seen by the ARM11 as an undefined opcode trap. The VFP support code examines the FPEXC.EX and FPEXC.FP2V bits to decide what actions to take, i.e., whether to emulate the opcodes found in FPINST and FPINST2, and whether to retry the bounced instruction. If a user space application has left the VFP11 in this "pending trap" state, the next FPU opcode issued to the VFP11 might actually be the VSTMIA operation vfp_save_state() uses to store the FPU registers to memory (in our test cases, when building the signal stack frame). In this case, the kernel crashes as described above. This patch fixes the problem by making sure that vfp_save_state() is always entered with FPEXC.EX cleared. (The current value of FPEXC has already been saved, so this does not corrupt the context. Clearing FPEXC.EX has no effects on FPINST or FPINST2. Also note that many callers already modify FPEXC by setting FPEXC.EN before invoking vfp_save_state().) This patch also addresses a second problem related to FPEXC.EX: After returning from signal handling, the kernel reloads the VFP context from the user mode stack. However, the current code explicitly clears both FPEXC.EX and FPEXC.FP2V during reload. As VFP11 requires these bits to be preserved, this patch disables clearing them for VFP implementations belonging to architecture 1. There should be no negative side effects: the user can set both bits by executing FPU opcodes anyway, and while user code may now place arbitrary values into FPINST and FPINST2 (e.g., non-VFP ARM opcodes) the VFP support code knows which instructions can be emulated, and rejects other opcodes with "unhandled bounce" messages, so there should be no security impact from allowing reloading FPEXC.EX and FPEXC.FP2V. Signed-off-by: Christopher Alexander Tobias Schulze <[email protected]>

Christopher Alexander Tobias Schulze - May 2, 2015, 11:57 a.m. This patch fixes a problem with VFP state save and restore related to exception handling (panic with message "BUG: unsupported FP instruction in kernel mode") present on VFP11 floating point units (as used with ARM1176JZF-S CPUs, e.g. on first generation Raspberry Pi boards). This patch was developed and discussed on #859 A precondition to see the crashes is that floating point exception traps are enabled. In this case, the VFP11 might determine that a FPU operation needs to trap at a point in time when it is not possible to signal this to the ARM11 core any more. The VFP11 will then set the FPEXC.EX bit and store the trapped opcode in FPINST. (In some cases, a second opcode might have been accepted by the VFP11 before the exception was detected and could be reported to the ARM11 - in this case, the VFP11 also sets FPEXC.FP2V and stores the second opcode in FPINST2.) If FPEXC.EX is set, the VFP11 will "bounce" the next FPU opcode issued by the ARM11 CPU, which will be seen by the ARM11 as an undefined opcode trap. The VFP support code examines the FPEXC.EX and FPEXC.FP2V bits to decide what actions to take, i.e., whether to emulate the opcodes found in FPINST and FPINST2, and whether to retry the bounced instruction. If a user space application has left the VFP11 in this "pending trap" state, the next FPU opcode issued to the VFP11 might actually be the VSTMIA operation vfp_save_state() uses to store the FPU registers to memory (in our test cases, when building the signal stack frame). In this case, the kernel crashes as described above. This patch fixes the problem by making sure that vfp_save_state() is always entered with FPEXC.EX cleared. (The current value of FPEXC has already been saved, so this does not corrupt the context. Clearing FPEXC.EX has no effects on FPINST or FPINST2. Also note that many callers already modify FPEXC by setting FPEXC.EN before invoking vfp_save_state().) This patch also addresses a second problem related to FPEXC.EX: After returning from signal handling, the kernel reloads the VFP context from the user mode stack. However, the current code explicitly clears both FPEXC.EX and FPEXC.FP2V during reload. As VFP11 requires these bits to be preserved, this patch disables clearing them for VFP implementations belonging to architecture 1. There should be no negative side effects: the user can set both bits by executing FPU opcodes anyway, and while user code may now place arbitrary values into FPINST and FPINST2 (e.g., non-VFP ARM opcodes) the VFP support code knows which instructions can be emulated, and rejects other opcodes with "unhandled bounce" messages, so there should be no security impact from allowing reloading FPEXC.EX and FPEXC.FP2V. Signed-off-by: Christopher Alexander Tobias Schulze <[email protected]>

Repeatable kernel crash in #737 #859

Repeatable kernel crash in #737 #859

Comments

radford-for-smpte commented Feb 28, 2015

popcornmix commented Feb 28, 2015

Uh oh!

radford-for-smpte commented Feb 28, 2015

Uh oh!

radford-for-smpte commented Mar 2, 2015

Uh oh!

popcornmix commented Mar 2, 2015

Uh oh!

popcornmix commented Mar 2, 2015

Uh oh!

Ferroin commented Mar 4, 2015

Uh oh!

popcornmix commented Mar 4, 2015

Uh oh!

Ferroin commented Mar 4, 2015

Uh oh!

Ferroin commented Mar 4, 2015

Uh oh!

popcornmix commented Mar 4, 2015

Uh oh!

Ferroin commented Mar 4, 2015

Uh oh!

P33M commented Mar 6, 2015

Uh oh!

radford-for-smpte commented Mar 6, 2015

Uh oh!

johnspackman commented Mar 12, 2015

Uh oh!

P33M commented Mar 12, 2015

Uh oh!

popcornmix commented Mar 12, 2015

Uh oh!

johnspackman commented Mar 16, 2015

Uh oh!

popcornmix commented Mar 16, 2015

Uh oh!

johnspackman commented Mar 16, 2015

Uh oh!

catschulze commented Mar 18, 2015

Uh oh!

P33M commented Mar 18, 2015

Uh oh!

radford-for-smpte commented Mar 18, 2015

Uh oh!

johnspackman commented Mar 19, 2015

Uh oh!

johnspackman commented Mar 19, 2015

Uh oh!

catschulze commented Mar 19, 2015

Uh oh!

P33M commented Mar 19, 2015

Uh oh!

catschulze commented Mar 19, 2015

Uh oh!

radford-for-smpte commented Mar 19, 2015

Uh oh!

catschulze commented Mar 19, 2015

Uh oh!

radford-for-smpte commented Mar 19, 2015

Uh oh!