Skip to content

NULL pointer dereference #1681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
andersthomson opened this issue Oct 13, 2016 · 18 comments
Closed

NULL pointer dereference #1681

andersthomson opened this issue Oct 13, 2016 · 18 comments

Comments

@andersthomson
Copy link

Hi,

I got this on a serial port and then the system rebooted.

NULL_Pointer_dereference_1.txt
usb.txt

@popcornmix
Copy link
Collaborator

Always include complete dmesg log, rather than just the point where the error occurs - there may have been something interesting earlier.
@P33M any ideas?

@andersthomson
Copy link
Author

Here's the full log from power on.

Sifting further back in minicom I found another NULL pointer error which appears to be the same thing to the untrained eye (PC and LR is at the same place). Should I add it in this bug or start a new bug?

NULL_Pointer_dereference_1_full.txt

@pelwell
Copy link
Contributor

pelwell commented Oct 14, 2016

Add it here - it's probably related.

The full log shows how much USB equipment you have connected to your Pi, but it doesn't indicate what was active at the time. Please state what the Pi was doing at the time of the crash, including stream bitrates etc.

@andersthomson
Copy link
Author

andersthomson commented Oct 14, 2016

NULL_Pointer_dereference_137505.txt

Here's the other one.

The First one crashed about 0300 in the morning. The only background action I have then is tvheadend scanning muxes for new channels.) That's something it does continuously using free tuners (I have two of them).

I'm guessing a bit here, but any tuning to a mux results in it all being pulled over usb (~say 10 channel, 2 mbps each), and tvheaded checks the transport stream for the channel, or channel info, required.

I've sustained watching two channels on differnet muxes, and recording a third. That means two full transport streams in, 2 * 2 mbps out over the eth, and 1 * 2mbps to the USB HD. So the midnight idle scanning should be fairly light in comparison.

@P33M
Copy link
Contributor

P33M commented Oct 14, 2016

Effectively you have 2x TV tuners and 1x USB HDD active, is that correct?

What do the Telldus/FTDI devices do - are they active as well?

@P33M P33M mentioned this issue Oct 14, 2016
@andersthomson
Copy link
Author

Yes that is right.

Telldus is a home automation thing to speak over 433 MHz to gadgets (lamps on/off). It did not do anything at the time.

FTDI is a smartcard reader used while descrambling encoded TV channels. The effective process is that tvh notices that a stream is encrypted, sends a few keys (~ 32 bytes) over to the card (via USB/FTDI/card) and gets a few keys back. Those keys are used to decode a the mpgts until a new incoming key is detected.

No tv watching/recording should mean no descrambling going on.

@P33M
Copy link
Contributor

P33M commented Oct 14, 2016

A test you can do is to boot with the following parameter in /boot/cmdline.txt: dwc_otg.host_channels=4 - this may provoke the error earlier or more often. It will artificially limit performance, though - you may find that recording becomes intermittent.

@andersthomson
Copy link
Author

Alright, I'll take that for a spin.

@andersthomson
Copy link
Author

Tried this and the observable effect is that, under load, it appeared to drop packets (visible on the TVs). However, it did not trigger a hang/reboot/backtrace or any some such.

@P33M
Copy link
Contributor

P33M commented Oct 17, 2016

People have previously bumped into this before in rare cases - somehow the list of host channels gets corrupted and results in the null pointer dereference. Grepping where the list is manipulated results in code that is all under the global driver spinlock, one theory was that repeatedly running up against an empty list condition somehow triggered the bug.

Reducing the number of host channels to 4 would artificially provoke the empty-list case far more often (in your case visible by dropped packets). If the bug goes away when the maximum number of available host channels is reduced, this is the opposite of what I would expect.

Given that you have mostly high-speed devices, can you try without the FSM FIQ? Remove dwc_otg.host_channels=4 and instead add dwc_otg.fiq_enable=1 dwc_otg.fiq_fsm_enable=0 to /boot/cmdline.txt. You may get intermittent behaviour from the card reader/FTDI device though.

@andersthomson
Copy link
Author

I won't go as far as to say that the bug goes away with 4 channels. I've just been able to observe that it didn't make things visibly worse.

To get real confidence, it needs to have exceeded 5-7 days of uptime at least.

Should I keep at it with 4 channels (the family will not like it), or should I go with the new fiq thing?

@P33M
Copy link
Contributor

P33M commented Oct 17, 2016

Restart with the FIQ parameters altered. It should provide a better differential diagnosis.

@andersthomson
Copy link
Author

andersthomson commented Oct 20, 2016

NULL_Pointer_dereference_329112.txt
Got an OOPS restart this afternoon (no TV watching active). It too is usb related it seems. Attached is full log.

The i2c errors prior to the oops is a regular occurrence. The effect is that the tuner in question becomes unusable, but the machine does not normally oops. (I've a script which catches this in dmesg and forces a reboot). In this case though, it seems to be suspiciously close to the oops in the timeline though...

I'll keep using the FIQ thing...

@andersthomson
Copy link
Author

I had to drop the FIQ thing because it introduced massive drops on HD reception (family complained).

So, with that off, I got another reboot tonight. This one has slightly more info to it:

[261294.489044] WARN::dwc_otg_hcd_handle_hc_fsm:2619: Unexpected state received on hc=4 fsm=1 on transfer to device 10 ep 0x1
[286062.972080] Unable to handle kernel NULL pointer dereference at virtual address 00000034
[286062.980261] pgd = 80004000
[286062.983051] [00000034] *pgd=00000000
[286062.986723] Internal error: Oops: 805 [#1] SMP ARM

Full log attached.
NULL_Pointer_dereference_286076.txt

Did that give any useful clues?

@andersthomson
Copy link
Author

Hi @P33M
I've now run with and without the fiq cmdline.txt option and I'm as certain as can be that you are on to something here. I've had one fiq run of 8 days (most uptime ever on this rip2!), and it' only got rebooted due to a house-wide power outage.

The only downside to running with fiq is that there is a little bit of packet drops (enough to bother the family).

What would be the next step in isolating this?

@P33M
Copy link
Contributor

P33M commented Nov 15, 2016

I've had another look at the logs and it seems like there's a pattern here.

In the cases prior to you getting a panic, in 3 of the logs you've provided things have gone wrong "for other reasons" - one "SYN flood detected (?)", one unexpected FSM state and one flurry of broken i2c transfer (that you say happens quite often) and each of these have resulted in dmesg errors - as you say, some of them suspiciously close in time to when you get a null pointer dereference.

I'd wager a broken error-path that only rarely gets triggered is causing the null pointer later on - gives an avenue of investigation at least.

@andersthomson
Copy link
Author

The SYN flood thing can be traced to tvheadend having an odd GUI. For all first-connect-after-reboot to it, I get that warning. I suspect it terminates the tcp after each http response or some such and the object rich GUI yields a ton of connect(2)s in a short time.

i tend to agree with the guess of a broken error path. I have no idea what paths (error, or otherwise) are changed by the fiq cmdline hack though. Would moving to a newer (kernel.org) kernel change the relevant paths? if I read rpi-kernel@ right there has been substantial work on the USB host side (swapping one driver for another or some such).

Any other ideas on how to narrow things down?

@P33M
Copy link
Contributor

P33M commented May 18, 2017

Dupe of #830 - latest rpi-update has a fix for the crash. Reopen if this still occurs with latest updated kernel.

@P33M P33M closed this as completed May 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants