NULL pointer dereference #1681

andersthomson · 2016-10-13T15:52:20Z

Hi,

I got this on a serial port and then the system rebooted.

popcornmix · 2016-10-13T16:07:00Z

Always include complete dmesg log, rather than just the point where the error occurs - there may have been something interesting earlier.
@P33M any ideas?

andersthomson · 2016-10-14T06:17:34Z

Here's the full log from power on.

Sifting further back in minicom I found another NULL pointer error which appears to be the same thing to the untrained eye (PC and LR is at the same place). Should I add it in this bug or start a new bug?

NULL_Pointer_dereference_1_full.txt

pelwell · 2016-10-14T07:11:28Z

Add it here - it's probably related.

The full log shows how much USB equipment you have connected to your Pi, but it doesn't indicate what was active at the time. Please state what the Pi was doing at the time of the crash, including stream bitrates etc.

andersthomson · 2016-10-14T10:33:13Z

NULL_Pointer_dereference_137505.txt

Here's the other one.

The First one crashed about 0300 in the morning. The only background action I have then is tvheadend scanning muxes for new channels.) That's something it does continuously using free tuners (I have two of them).

I'm guessing a bit here, but any tuning to a mux results in it all being pulled over usb (~say 10 channel, 2 mbps each), and tvheaded checks the transport stream for the channel, or channel info, required.

I've sustained watching two channels on differnet muxes, and recording a third. That means two full transport streams in, 2 * 2 mbps out over the eth, and 1 * 2mbps to the USB HD. So the midnight idle scanning should be fairly light in comparison.

P33M · 2016-10-14T11:49:07Z

Effectively you have 2x TV tuners and 1x USB HDD active, is that correct?

What do the Telldus/FTDI devices do - are they active as well?

andersthomson · 2016-10-14T12:12:56Z

Yes that is right.

Telldus is a home automation thing to speak over 433 MHz to gadgets (lamps on/off). It did not do anything at the time.

FTDI is a smartcard reader used while descrambling encoded TV channels. The effective process is that tvh notices that a stream is encrypted, sends a few keys (~ 32 bytes) over to the card (via USB/FTDI/card) and gets a few keys back. Those keys are used to decode a the mpgts until a new incoming key is detected.

No tv watching/recording should mean no descrambling going on.

P33M · 2016-10-14T12:35:24Z

A test you can do is to boot with the following parameter in /boot/cmdline.txt: dwc_otg.host_channels=4 - this may provoke the error earlier or more often. It will artificially limit performance, though - you may find that recording becomes intermittent.

andersthomson · 2016-10-14T12:43:12Z

Alright, I'll take that for a spin.

andersthomson · 2016-10-17T09:53:12Z

Tried this and the observable effect is that, under load, it appeared to drop packets (visible on the TVs). However, it did not trigger a hang/reboot/backtrace or any some such.

P33M · 2016-10-17T12:17:11Z

People have previously bumped into this before in rare cases - somehow the list of host channels gets corrupted and results in the null pointer dereference. Grepping where the list is manipulated results in code that is all under the global driver spinlock, one theory was that repeatedly running up against an empty list condition somehow triggered the bug.

Reducing the number of host channels to 4 would artificially provoke the empty-list case far more often (in your case visible by dropped packets). If the bug goes away when the maximum number of available host channels is reduced, this is the opposite of what I would expect.

Given that you have mostly high-speed devices, can you try without the FSM FIQ? Remove dwc_otg.host_channels=4 and instead add dwc_otg.fiq_enable=1 dwc_otg.fiq_fsm_enable=0 to /boot/cmdline.txt. You may get intermittent behaviour from the card reader/FTDI device though.

andersthomson · 2016-10-17T14:09:51Z

I won't go as far as to say that the bug goes away with 4 channels. I've just been able to observe that it didn't make things visibly worse.

To get real confidence, it needs to have exceeded 5-7 days of uptime at least.

Should I keep at it with 4 channels (the family will not like it), or should I go with the new fiq thing?

P33M · 2016-10-17T14:56:53Z

Restart with the FIQ parameters altered. It should provide a better differential diagnosis.

andersthomson · 2016-10-20T12:43:30Z

NULL_Pointer_dereference_329112.txt
Got an OOPS restart this afternoon (no TV watching active). It too is usb related it seems. Attached is full log.

The i2c errors prior to the oops is a regular occurrence. The effect is that the tuner in question becomes unusable, but the machine does not normally oops. (I've a script which catches this in dmesg and forces a reboot). In this case though, it seems to be suspiciously close to the oops in the timeline though...

I'll keep using the FIQ thing...

andersthomson · 2016-10-26T17:34:23Z

I had to drop the FIQ thing because it introduced massive drops on HD reception (family complained).

So, with that off, I got another reboot tonight. This one has slightly more info to it:

[261294.489044] WARN::dwc_otg_hcd_handle_hc_fsm:2619: Unexpected state received on hc=4 fsm=1 on transfer to device 10 ep 0x1
[286062.972080] Unable to handle kernel NULL pointer dereference at virtual address 00000034
[286062.980261] pgd = 80004000
[286062.983051] [00000034] *pgd=00000000
[286062.986723] Internal error: Oops: 805 [#1] SMP ARM

Full log attached.
NULL_Pointer_dereference_286076.txt

Did that give any useful clues?

andersthomson · 2016-11-15T16:02:47Z

Hi @P33M
I've now run with and without the fiq cmdline.txt option and I'm as certain as can be that you are on to something here. I've had one fiq run of 8 days (most uptime ever on this rip2!), and it' only got rebooted due to a house-wide power outage.

The only downside to running with fiq is that there is a little bit of packet drops (enough to bother the family).

What would be the next step in isolating this?

P33M · 2016-11-15T17:42:10Z

I've had another look at the logs and it seems like there's a pattern here.

In the cases prior to you getting a panic, in 3 of the logs you've provided things have gone wrong "for other reasons" - one "SYN flood detected (?)", one unexpected FSM state and one flurry of broken i2c transfer (that you say happens quite often) and each of these have resulted in dmesg errors - as you say, some of them suspiciously close in time to when you get a null pointer dereference.

I'd wager a broken error-path that only rarely gets triggered is causing the null pointer later on - gives an avenue of investigation at least.

andersthomson · 2016-11-16T16:17:50Z

The SYN flood thing can be traced to tvheadend having an odd GUI. For all first-connect-after-reboot to it, I get that warning. I suspect it terminates the tcp after each http response or some such and the object rich GUI yields a ton of connect(2)s in a short time.

i tend to agree with the guess of a broken error path. I have no idea what paths (error, or otherwise) are changed by the fiq cmdline hack though. Would moving to a newer (kernel.org) kernel change the relevant paths? if I read rpi-kernel@ right there has been substantial work on the USB host side (swapping one driver for another or some such).

Any other ideas on how to narrow things down?

P33M · 2017-05-18T11:17:14Z

Dupe of #830 - latest rpi-update has a fix for the crash. Reopen if this still occurs with latest updated kernel.

P33M mentioned this issue Oct 14, 2016

oops in dwc_otg #1654

Closed

P33M closed this as completed May 18, 2017

NULL pointer dereference #1681

NULL pointer dereference #1681

Comments

andersthomson commented Oct 13, 2016

popcornmix commented Oct 13, 2016

Uh oh!

andersthomson commented Oct 14, 2016

Uh oh!

pelwell commented Oct 14, 2016

Uh oh!

andersthomson commented Oct 14, 2016 • edited by pelwell Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

P33M commented Oct 14, 2016

Uh oh!

andersthomson commented Oct 14, 2016

Uh oh!

P33M commented Oct 14, 2016

Uh oh!

andersthomson commented Oct 14, 2016

Uh oh!

andersthomson commented Oct 17, 2016

Uh oh!

P33M commented Oct 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andersthomson commented Oct 17, 2016

Uh oh!

P33M commented Oct 17, 2016

Uh oh!

andersthomson commented Oct 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andersthomson commented Oct 26, 2016

Uh oh!

andersthomson commented Nov 15, 2016

Uh oh!

P33M commented Nov 15, 2016

Uh oh!

andersthomson commented Nov 16, 2016

Uh oh!

P33M commented May 18, 2017

Uh oh!

andersthomson commented Oct 14, 2016 •

edited by pelwell

Loading

P33M commented Oct 17, 2016 •

edited

Loading

andersthomson commented Oct 20, 2016 •

edited

Loading