Skip to content

HQ Camera stream crashed with a kernel stack trace - vchiq_core.c:843 recycle_func+0x13c/0x4f8 #5133

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
olokos opened this issue Aug 19, 2022 · 14 comments

Comments

@olokos
Copy link

olokos commented Aug 19, 2022

Describe the bug

I have a Pi 4B 4GB with HQ Camera hooked up via the ribbon cable
I am doing almost 24/7 recording with libcamera-vid timeout being 0

This setup works very stable with hours on end, but today the stream crashed and there was a stack trace in dmesg

Steps to reproduce the behaviour

libcamera-vid --denoise cdn_off --mode 1332:990:10 --framerate 90 --width 1332 --height 990 --level 4.2 -n -t 0 --tuning-file /usr/share/libcamera/ipa/raspberrypi/imx477.json --inline -o - | ffmpeg -i pipe: -codec copy -f rtsp rtsp://127.0.0.1:8554/mystream

which utilizes
https://github.com/aler9/rtsp-simple-server/ in order to get a valid RTSP stream out of this.
and then just wait until it crashes.

Device (s)

Raspberry Pi 4 Mod. B

System

Raspberry Pi reference 2022-04-04
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 226b479f8d32919c9fe36dd5b4c20c02682f8180, stage4

Aug 9 2022 13:44:40
Copyright (c) 2012 Broadcom
version 273b410636cf8854ca35af91fd738a3d5f8b39b6 (clean) (release) (start)

Linux pi 5.15.56-v7l+ #1575 SMP Fri Jul 22 20:29:46 BST 2022 armv7l GNU/Linux

Logs

messages-cropped.log
The error message is in line 717.

rtsp.log
Above is the log from the rtsp-simple-server which also provided an error message, but that was the only error message, previous lines are normal operation, until that error comes.

Additional context

I did one reboot with sudo reboot but I think that's the last reboot visible in the logs above

libcamera-apps build: e1beb4512178-intree 14-08-2022 (22:37:46)
libcamera build: v0.0.0+3830-dfc6d711

If logs from

@pelwell
Copy link
Contributor

pelwell commented Aug 19, 2022

Has this happened more than once? I only see one stack trace in the attached log.

@olokos
Copy link
Author

olokos commented Aug 19, 2022

This has happened only once, just today and I haven't really experienced a similar bug earlier at all, but it will most likely come up eventually again.

@pelwell
Copy link
Contributor

pelwell commented Aug 19, 2022

Aug 19 13:47:49 pi kernel: [ 4086.511832] vchiq: 0: sf ???@f6755625 (1052->1823) - invalid/closed service 1823
Aug 19 13:47:49 pi kernel: [ 4086.511891] vchiq: pfq - pos 2000008: header c00260bb, msgid 98393bc0, header->msgid 98393bc0, header->size 2000000

The bit of code in which the error occurred - the VCHIQ communications protocol - has been in all Raspberry Pis since day 1, largely unaltered (at least in the basic message handling), and I've never seen an error message like this before. The nature of the error smacks of memory corruption, which means that VCHIQ is likely to be an innocent victim - collateral damage - rather than the culprit. I say this because it implies that the cause of the crash will be much harder to discover, but I'd put money on it being the CSI peripheral.

@olokos
Copy link
Author

olokos commented Aug 19, 2022

The only thing that comes to my mind that could cause a collateral bug, would be heat and overheating in general.

Any ideas what are the downclocking and critical temps for Pi4B?
Is there perhaps some way to efficiently monitor the thermals of Pi4B?

I know there's a way to just display if it overheated in general since boot or not, but is there something that would allow me to link the overheating moment with specific temperature or at least time of throttling?

Perhaps there are some traces of overheating in this dmesg log above?

@pelwell
Copy link
Contributor

pelwell commented Aug 19, 2022

Overheating and under-voltage are possible explanations for memory corruption, but without any evidence for either I stand by my earlier comment:

I'd put money on it being the CSI peripheral.

by which I mean that data arriving from the camera is like water from a firehose which is being liberally sprayed by DMA into system memory, Any hiccups could easily cause image data to be written to the wrong location. Stream shutdown is the most likely scenario for a problem, which I don't think is the case here, but for me it's still the number one suspect.

@olokos
Copy link
Author

olokos commented Aug 19, 2022

pi-crash.zip
Here are all of the logs that I've been able to gather regarding this incident, including a massive messages.log from multiple days

@olokos
Copy link
Author

olokos commented Aug 19, 2022

In addition, there is some not yet discovered issue, which causes around 1% frames drop at 120FPS mode.

More details here:
raspberrypi/rpicam-apps#336 (comment)

CTRL+F 120 fps and drop around 1% of frames for fast-forwarding to the exact moment I'm reffering to.

I've had RTSP stream opened on another PC on LAN network and it reports some dropped frames:
image

But keep in mind that the above dropped frames on my screenshot, might be because of my network, using a wifi repeater that then connects to Pi via ethernet and it won't get much better than that, because of the placement of Pi being very far away from the original router and pulling RJ45 isn't viable right now.

Maybe there's some command that I could use with libcamera-vid to quickly get dropped frames (in case of an issue), but without using networking or preview?

I was thinking of testing this with
libcamera-vid --mode 1332:990:10 --framerate 120 --width 1332 --height 990 -n -t 5000 -o -
but maybe there's a better way to test for dropped frames, skipping networking and I/O and pushing it even harder?

@olokos
Copy link
Author

olokos commented Aug 22, 2022

It has just crashed once again for me around 2am, with basically the same error message as before.

Same place in code where it crashed aswell:
PC is at recycle_func+0x1e4/0x4f8

Here are fresh logs again:
https://pastebin.com/b7SeaGn0

What makes me wonder about those crashes is this:
vc4-drm gpu: [drm] Cannot find any crtc or sizes
imx477 10-001a: Device found is imx477
imx477 10-001a: Consider updating driver imx477 to match on endpoints

What does the dmesg refer to when mentioning "driver imx477"?
I am running the latest code from github for libcamera-apps and I'm around 10 commits behind libcamera itself.

I am most likely running the most recent kernel release, also on the most modern camera stack, RPi eeprom is also up-to-date.

Is the "imx477 driver" referring to the driver within the kernel source code, which is out of my reach, unless I build it myself?
Maybe I've missed something to update?

I think we can also dismiss thermals causing this error, since this time it happened at 02:45am, so it was most likely the coldest time of the day, while it was running perfectly fine at 70-80*C max and did not crash. (don't worry, I will soon get an additional fan, that will move the air around the heatsink, so those temps will drop)

I noticed something very interesting, as somebody who used to do kernel development for Android (Xperia Z1), namely when there's a kernel panic, the entire system should lock-up unless power cycled.
This isn't the case here (which is very good) as the system restarts the entire Pi once this bug happens, but the interesting part here, is that unless I do sudo reboot, the system won't automatically bring up the libcamera-vid powered RTSP stream, so the app consuming this stream would run as expected.

Since I didn't have much time in the morning, I did sudo reboot when I noticed that the camera crashed again and it's not live-streaming anymore, which works around the problem, but considering there's a full OS reboot during this crash, everything should be back to normal and my services should start as they always do, but they don't.

I might be a bit wrong about my last 2 paragraphs above, it might be due to some different issue that the camera doesn't come back up as expected after the crash, but when the system runs as normal and I do sudo reboot then it will every single time bring up the camera stack and the app to consume the stream, which I THINK isn't the case after this bug happens and the Pi reboots itself, despite having access to SSH I still had to sudo reboot after the crash to get back to normal state.

@olokos
Copy link
Author

olokos commented Aug 30, 2022

For several days the system hasn't crashed yet, despite running at 100FPS now, there was 0.9% frames dropped, but I think it's still a good result. I couldn't get a stream above 100fps to output any data

@pelwell
Copy link
Contributor

pelwell commented Sep 2, 2022

I'd put money on it being the CSI peripheral.

I should have placed that bet. See d331f45 for the fix.

@olokos
Copy link
Author

olokos commented Sep 2, 2022

Awesome news, I was just reading on the other issue! Good job @pelwell !
Will simply sudo raspi-update update my kernel to a version with this fix or it's not released yet?

@pelwell
Copy link
Contributor

pelwell commented Sep 2, 2022

I just whipped the other engineers until they fixed it.

We haven't yet released a kernel build including this fix, but there should be one in the next few days.

@olokos
Copy link
Author

olokos commented Sep 9, 2022

I just whipped the other engineers until they fixed it.

We haven't yet released a kernel build including this fix, but there should be one in the next few days.

Thank You! :D
I am now running
Linux pi 5.15.65-v7l+ #1583 SMP Wed Sep 7 15:41:50 BST 2022 armv7l
Since this kernel was built 2 days ago, I assume that the fix is already merged into it's source code, right? @pelwell
If it's indeed already merged, feel free to close this issue.

@pelwell
Copy link
Contributor

pelwell commented Sep 9, 2022

Yes, the fix went in on the 5th - see "Bump to 5.15.63" in this list of releases: https://github.com/raspberrypi/rpi-firmware/commits/master

@pelwell pelwell closed this as completed Sep 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants