Skip to content

Video decoder crash every few minutes #1627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Malvineous opened this issue Sep 19, 2021 · 10 comments
Closed

Video decoder crash every few minutes #1627

Malvineous opened this issue Sep 19, 2021 · 10 comments

Comments

@Malvineous
Copy link

There seems to be some bugs in the H264 decoder. I am streaming video from another Raspberry Pi, and it was working fine, until I made the video a few pixels taller so I could add the time, date and other info in a bar along the bottom of the video. (The video went from 1296x972 to 1280x1008. I had to reduce the width due to #1608 but I don't think it's related.)

Since then, the playback stops randomly. I am playing two video streams at the same time (only one of which has had its size changed and started this problem), and when this bug is triggered, both streams break. Sometimes part of the picture from one video ends up showing in the other video's window, sometimes the video becomes very badly corrupted (turns green and magenta) but mostly the video just stops updating leaving only the last frame visible.

In order to fix it I have to exit both video players and restart - closing only one player is not enough. Sometimes only restarting one player will show no video, but sometimes it comes up with the last stuck frame still visible. Exiting all players and reloading them all gets this working again, but only for another few minutes until it crashes again.

I get various messages in the logs when this happens. Here is one exerpt:

[ 1645.031404] bcm2835_mmal_vchiq: timed out waiting for sync completion
[ 1648.103401] bcm2835_mmal_vchiq: timed out waiting for sync completion
[ 1648.103439] bcm2835-codec bcm2835-codec: bcm2835_codec_stop_streaming: Failed disabling o/p port, ret -62
[ 1650.119420] bcm2835-codec bcm2835-codec: bcm2835_codec_stop_streaming: Timeout waiting for buffers to be returned - 19 outstanding
[ 1650.119534] ------------[ cut here ]------------
[ 1650.119577] WARNING: CPU: 2 PID: 773 at drivers/media/common/videobuf2/videobuf2-core.c:1945 __vb2_queue_cancel+0x1b0/0x220 [videobuf2_common]
[ 1650.119585] Modules linked in: rpcsec_gss_krb5 option usb_wwan vc4 usbserial btsdio cec bcm2835_codec(C) bcm2835_isp(C) bluetooth bcm2835_v4l2(C) bcm2835_mmal_vchiq(C) drm_kms_helper v4l2_mem2mem ecdh_generic videobuf2_vmalloc ecc videobuf2_dma_contig snd_soc_core videobuf2_memops videobuf2_v4l2 snd_compress videobuf2_common snd_pcm_dmaengine videodev v3d gpu_sched raspberrypi_hwmon iproc_rng200 mc snd_pcm snd_timer snd syscopyarea sysfillrect sysimgblt fb_sys_fops rpivid_mem uio_pdrv_genirq nvmem_rmem uio sch_fq_codel drm fuse drm_panel_orientation_quirks ip_tables x_tables ipv6 sr_mod cdrom brcmfmac cfg80211 brcmutil
[ 1650.119832] CPU: 2 PID: 773 Comm: video_decoder Tainted: G         C        5.10.52-4-ARCH #1
[ 1650.119842] Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT)
[ 1650.119855] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
[ 1650.119874] pc : __vb2_queue_cancel+0x1b0/0x220 [videobuf2_common]
[ 1650.119893] lr : __vb2_queue_cancel+0x34/0x220 [videobuf2_common]
[ 1650.119901] sp : ffffffc01a61baf0
[ 1650.119909] x29: ffffffc01a61baf0 x28: 0000007f94b91750 
[ 1650.119925] x27: ffffffc010fdc000 x26: ffffffc01a61be50 
[ 1650.119941] x25: ffffff8043dab868 x24: ffffff8069c05f40 
[ 1650.119955] x23: ffffff8043d602a0 x22: ffffff8043baa300 
[ 1650.119971] x21: ffffff80428d7028 x20: ffffff80428d70d0 
[ 1650.119985] x19: ffffff80428d7028 x18: 0000000000000030 
[ 1650.120000] x17: 0000000000000000 x16: 0000000000000000 
[ 1650.120015] x15: ffffff8069c05e80 x14: 00000000000001d2 
[ 1650.120029] x13: ffffff8069c05e80 x12: ffffffc01a61b760 
[ 1650.120043] x11: ffffffc01123bae0 x10: 00000000fffff000 
[ 1650.120058] x9 : ffffffc01123bae0 x8 : ffffffc0111e3ae0 
[ 1650.120072] x7 : ffffffc01123bae0 x6 : 0000000000000000 
[ 1650.120086] x5 : 0000000000000000 x4 : ffffff80fb85d6c8 
[ 1650.120101] x3 : ffffffc010fdc000 x2 : 6c8354effb15d400 
[ 1650.120116] x1 : 00000000000055ac x0 : 0000000000000013 
[ 1650.120132] Call trace:
[ 1650.120152]  __vb2_queue_cancel+0x1b0/0x220 [videobuf2_common]
[ 1650.120170]  vb2_core_queue_release+0x24/0x50 [videobuf2_common]
[ 1650.120188]  vb2_queue_release+0x10/0x20 [videobuf2_v4l2]
[ 1650.120210]  v4l2_m2m_ctx_release+0x20/0x40 [v4l2_mem2mem]
[ 1650.120225]  bcm2835_codec_release+0x5c/0xf4 [bcm2835_codec]
[ 1650.120288]  v4l2_release+0xbc/0xf0 [videodev]
[ 1650.120304]  __fput+0x78/0x230
[ 1650.120314]  ____fput+0x10/0x20
[ 1650.120327]  task_work_run+0x80/0x140
[ 1650.120339]  do_exit+0x32c/0xa10
[ 1650.120348]  do_group_exit+0x38/0xa0
[ 1650.120359]  get_signal+0x10c/0x844
[ 1650.120372]  do_notify_resume+0x188/0x9d0
[ 1650.120381]  work_pending+0xc/0x43c
[ 1650.120390] ---[ end trace 5fd4ec3580a88085 ]---
[ 1650.120406] videobuf2_common: driver bug: stop_streaming operation is leaving buf 00000000ac96b1de in active state
[ 1650.120420] videobuf2_common: driver bug: stop_streaming operation is leaving buf 000000003219955b in active state
[ 1650.120432] videobuf2_common: driver bug: stop_streaming operation is leaving buf 00000000e7922934 in active state
[ 1650.120444] videobuf2_common: driver bug: stop_streaming operation is leaving buf 00000000253d780c in active state
[ 1650.120454] videobuf2_common: driver bug: stop_streaming operation is leaving buf 000000003380b69f in active state
[ 1650.120463] videobuf2_common: driver bug: stop_streaming operation is leaving buf 000000002566c086 in active state
[ 1650.120474] videobuf2_common: driver bug: stop_streaming operation is leaving buf 000000005c8d5e2a in active state
[ 1650.120485] videobuf2_common: driver bug: stop_streaming operation is leaving buf 000000009293a3da in active state
[ 1650.120494] videobuf2_common: driver bug: stop_streaming operation is leaving buf 00000000f9455a5d in active state
[ 1650.120504] videobuf2_common: driver bug: stop_streaming operation is leaving buf 00000000b31fb82e in active state
[ 1650.120515] videobuf2_common: driver bug: stop_streaming operation is leaving buf 0000000060988941 in active state
[ 1650.120525] videobuf2_common: driver bug: stop_streaming operation is leaving buf 000000001d57d7a5 in active state
[ 1650.120535] videobuf2_common: driver bug: stop_streaming operation is leaving buf 000000000688a4ea in active state
[ 1650.120544] videobuf2_common: driver bug: stop_streaming operation is leaving buf 000000007eb7ac4f in active state
[ 1650.120556] videobuf2_common: driver bug: stop_streaming operation is leaving buf 00000000f938963d in active state
[ 1650.120565] videobuf2_common: driver bug: stop_streaming operation is leaving buf 0000000089ad923c in active state
[ 1650.120576] videobuf2_common: driver bug: stop_streaming operation is leaving buf 000000000a274fa8 in active state
[ 1650.120586] videobuf2_common: driver bug: stop_streaming operation is leaving buf 000000006ccfd1f0 in active state
[ 1650.120596] videobuf2_common: driver bug: stop_streaming operation is leaving buf 000000000995f2d6 in active state
[ 1650.728701] bcm2835_mmal_vchiq: received invalid message context 25!
[ 1650.728741] bcm2835_mmal_vchiq: drvbuf.client_context(80) is invalid
[ 1650.728755] bcm2835_mmal_vchiq: drvbuf.client_context(81) is invalid
[ 1650.728769] bcm2835_mmal_vchiq: drvbuf.client_context(82) is invalid
[ 1650.728782] bcm2835_mmal_vchiq: drvbuf.client_context(63) is invalid
[ 1650.728796] bcm2835_mmal_vchiq: drvbuf.client_context(64) is invalid
[ 1650.728810] bcm2835_mmal_vchiq: drvbuf.client_context(65) is invalid
[ 1650.728824] bcm2835_mmal_vchiq: drvbuf.client_context(66) is invalid
[ 1650.728837] bcm2835_mmal_vchiq: drvbuf.client_context(67) is invalid
[ 1650.728886] bcm2835_mmal_vchiq: drvbuf.client_context(68) is invalid
[ 1650.728903] bcm2835_mmal_vchiq: drvbuf.client_context(70) is invalid
[ 1650.728917] bcm2835_mmal_vchiq: drvbuf.client_context(69) is invalid
[ 1650.728980] bcm2835_mmal_vchiq: drvbuf.client_context(71) is invalid
[ 1650.728995] bcm2835_mmal_vchiq: drvbuf.client_context(72) is invalid
[ 1650.729010] bcm2835_mmal_vchiq: drvbuf.client_context(73) is invalid
[ 1650.729022] bcm2835_mmal_vchiq: drvbuf.client_context(74) is invalid
[ 1650.729037] bcm2835_mmal_vchiq: drvbuf.client_context(75) is invalid
[ 1650.729051] bcm2835_mmal_vchiq: drvbuf.client_context(76) is invalid
[ 1650.729065] bcm2835_mmal_vchiq: drvbuf.client_context(77) is invalid
[ 1650.729078] bcm2835_mmal_vchiq: drvbuf.client_context(78) is invalid
[ 1650.828863] bcm2835-codec bcm2835-codec: bcm2835_codec_stop_streaming: Failed enabling component, ret -22
[ 1650.828961] bcm2835_mmal_vchiq: received invalid message context 21!
[ 1653.991452] bcm2835_mmal_vchiq: timed out waiting for sync completion
[ 1657.063441] bcm2835_mmal_vchiq: timed out waiting for sync completion
[ 1657.063481] bcm2835_mmal_vchiq: vchiq_mmal_component_init: failed to create component -62 (Not enough GPU mem?)
[ 1657.063495] bcm2835-codec bcm2835-codec: bcm2835_codec_create_component: failed to create component ril.video_decode
[ 1660.139442] bcm2835_mmal_vchiq: timed out waiting for sync completion
[ 1660.139483] bcm2835_mmal_vchiq: vchiq_mmal_component_init: failed to create component -62 (Not enough GPU mem?)
[ 1660.139496] bcm2835-codec bcm2835-codec: bcm2835_codec_create_component: failed to create component ril.video_decode
[ 1663.207476] bcm2835_mmal_vchiq: timed out waiting for sync completion
[ 1663.207516] bcm2835_mmal_vchiq: vchiq_mmal_component_init: failed to create component -62 (Not enough GPU mem?)
[ 1663.207535] bcm2835-codec bcm2835-codec: bcm2835_codec_create_component: failed to create component ril.video_decode
[ 1663.626347] bcm2835_mmal_vchiq: received invalid message context 5!
[ 1663.628625] bcm2835-codec bcm2835-codec: bcm2835_codec_create_component: failed to create component ril.video_decode
[ 1663.629371] bcm2835_mmal_vchiq: received invalid message context 1!
[ 1663.629395] bcm2835_mmal_vchiq: received invalid message context 1!
[ 1663.629626] bcm2835_mmal_vchiq: received invalid message context 1!
[ 1668.839518] bcm2835-codec bcm2835-codec: bcm2835_codec_stop_streaming: Timeout waiting for buffers to be returned - 19 outstanding

I tried increasing the GPU mem as suggested. It was initially gpu_mem=128 which was fine until I enlarged the video, now I get the same messages even if I specify gpu_mem=384. I can't go large than this or the Pi won't boot. I have cma=300M but changing this value doesn't seem to do anything.

System

  • Which model of Raspberry Pi? Pi 4
  • Which OS and version (cat /etc/rpi-issue)? Arch Linux ARM 64-bit
  • Which firmware version (vcgencmd version)?
Sep 14 2021 16:51:17 
Copyright (c) 2012 Broadcom
version 1f74fc33fa785915fe0d7bd6d1cdfa220bc1ed7b (clean) (release) (start)
  • Which kernel version (uname -a)?
    Linux 5.10.63-8-ARCH #1 SMP PREEMPT Thu Sep 16 14:24:31 UTC 2021 aarch64 GNU/Linux
@hailfinger
Copy link

This may be related to raspberrypi/linux#4592
Setting over_voltage=2 in config.txt may help. For me, over_voltage=2 helped on some devices, and on others it increased the time until hang. Increasing over_voltage up to 6 may be necessary.

Please check if "vcdbg log msg" contains messages with "recycling bust".

@Malvineous
Copy link
Author

Thanks for the tip, will give it a go! I've actually spent the last 12 hours replacing the Pi4 with an old Intel NUC because the problem was getting so bad. Not only was the video freezing but the Pi would get into a reboot loop and it would only display maybe 10 seconds of video before rebooting. After five or six reboots it would freeze at the rainbow screen, and since the watchdog can't be activated during boot, it was frozen until it was power cycled. I'm currently investigating watchdog alternatives as it seems it is very easy for the Pi to freeze during boot - I've lost count of the number of times I have had to power cycle it over the last few days. I tried to compare the UART output from a frozen and successful boot but there wasn't anything glaringly obvious, although sometimes it would randomly request a different file from the TFTP server which is odd.

I never thought to try over_voltage but already now we're approaching summer I've noticed it's started flashing up over-temperature warnings all the time so I imagine over_voltage won't help with that. I think I'll have to find a fan for it before proceeding.

@hailfinger
Copy link

Ah. In my tests with 100+ Pi 4B, I've noticed that you absolutely want a case with sufficient thermal capacity or active cooling. Try a metal case with pillars on the hot chips. The Pi will slow down heavily if it gets too hot. Try to stay below 80°C if you want any reasonable performance. For my other use case (video capture + encoding) the Pi in the default plastic case did reach >80°C within 5-10 minutes and then the encoded frame rate went from 25 fps to 0.5 fps. A good metal case lowered the peak temperature to 65°C even in a 30°C room. Decoding is generating less heat than encoding, but the advice about overheating applies in either case.

You can monitor temperature by running vcgencmd measure_temp and check if any throttling applies (should be 0x0) by running vcgencmd get_throttled .

@Malvineous
Copy link
Author

That's very interesting! I hadn't noticed much of a slowdown with the decoding (no visible framerate drop) but it was definitely throttling as the temperature was reported at 85°C which I believe is the max. Possibly it was hitting this so quickly (also 30°C ambient) that I was only ever seeing the throttled framerate.

I did buy a PoE hat for it since it had a fan on it but never got a chance to install it. I read somewhere online someone from the Pi team saying it should be impossible to damage the Pi from overheating as it will just throttle itself, so I presumed the video freezing and reboots were firmware/driver bugs rather than thermal issues, however now I'm beginning to wonder.

It's pretty good you have over 100 devices to test with, that gives a lot of valuable information. What are you using them for, out of curiosity? I'm using ffplay for video and Chromium in kiosk mode for displaying web page content, with command-line parameters to position the windows so I don't need to run a window manager.

@JamesH65
Copy link
Contributor

The 50x slowdown under throttling doesn't make a lot of sense. The ARM clock speed goes down from 1500 to 700, which is only a x2 change, so not sure where the rest of the speed drop is coming from. Odd.

As for @Malvineous issues, to me, that sounds like a Pi that is failing/defective. Getting that hot that quickly sounds odd, although 30 ambient is pretty warm. I would not expect Pi freezes under throttling conditions, it's just slows down. And since most people are not seeing the level of video lock up etc I doubt its bugs in the decoder stack. Worth trying a fan or heat shedding case. Fan more likely to help with ambient that high.

@hailfinger
Copy link

@JamesH65 The 50x slowdown in my setup may be caused by GStreamer dropping buffers which took too long to encode.

@Malvineous Each device in my fleet hosts a GStreamer-based open source bidirectional (either transmit or receive, not both at the same time) video live streaming solution (including image/text overlays for transmission) for club lectures, churches, anyone with low budget. Centrally managed (for stream start/stop) with local interaction (e.g. speaker name as text overlay, slide display, other useful stuff). Right now only churches have adopted that stuff, but a lot of them. The whole thing is nonprofit because I believe that nonprofits should not pay through the nose for remote participation when they're trying to use physical/social distancing. I plan to submit a talk about it to the next FOSDEM conference.

@Malvineous
Copy link
Author

I ordered some metal cases with "pillars" pressing down on the chips and thermal pads to fit between the Pi chips and the pillars. The first one (a box shape with grooves in the top) lowered the temperature from 85°C down to a maximum of 53°C and completely stopped the freezing problem.

The second case had a fan as well and dropped the temperature by a further 10°C to 43°C and likewise the Pi has been running for over 24 hours now without a single unexpected reboot.

So this means the problem is definitely due to overheating, and apparently the throttling not being able to throttle back enough.

I know the ARM core gets throttled when the temperature is too high, but does the VideoCore get throttled? In my case the ARM is sitting at around 22% CPU (~50% CPU on two cores, 0% on the other two cores) however I am using V4L2M2M to hardware decode two H264 streams. It would seem the H264 decoding is what is contributing most of the heat, and possibly that isn't getting throttled which is why it kept overheating?

@popcornmix
Copy link
Contributor

@Malvineous have you tried with over_voltage=2 as suggested earlier?
If your h264 silicon is borderline at default voltage, it may fail when temperature is higher.

Note: the h264 hardware creates negligible heat when compared to the arm.

@Malvineous
Copy link
Author

Sorry no I never got that far. I took the top of the heatsink case off to try it - 32°C ambient today so good day to test. I left it as-is to confirm it would reboot without any heatsink which it did, once it got to around 80°C.

I then switched it off, set over_voltage=2, then switched it back on again. Despite being off for 30 seconds or so, it still locked up on the rainbow screen. I then set over_voltage=6 and that time it booted successfully, said 76°C just after the H264 playback started, then about 15 seconds later it rebooted. When it rebooted it got past the rainbow screen but then froze on a black screen, before any kernel output appeared.

I then removed over_voltage and put the heatsink case back on, and it's been behaving normally for around an hour now, reporting just under 44°C.

I had a pedestal fan in the room for the first test, and this accidentally provided enough of a breeze to keep the temperature at around 72°C and it was behaving normally there too. It wasn't until I blocked off enough airflow for it to get above 80°C that it rebooted.

So looks like it's down to the throttling not being aggressive enough (or this unit being sensitive above 80°C rather than 85°C) and it's just overheating.

@Malvineous
Copy link
Author

Going to close this one now as it definitely looks like it was due to overheating. I suppose you could argue that it's still a bug in that the thermal throttling isn't fully working, but then you can also make the case that running it right up past the 80 degree mark is not very wise either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants