mmc0: timeout waiting for hardware interrupt. #3446

olifre · 2020-02-02T13:19:25Z

Describe the bug
Randomly, I encounter this error message following by a lot of debug traces. Just now, this corrupted the FS.

To reproduce
It happens randomly, not even bound to high load.

System

Which model of Raspberry Pi? Pi3B+
Which OS and version Raspbian Buster, ``Linux raspiserv 4.19.75-v7+ solved issue of mirroring screen after rotation. #1270 SMP Tue Sep 24 18:45:11 BST 2019 armv7l GNU/Linux`

$ vcgencmd version
Sep 24 2019 17:37:19 
Copyright (c) 2012 Broadcom
version 6820edeee4ef3891b95fc01cf02a7abd7ca52f17 (clean) (release) (start)

Logs
mmcerr.txt

Additional context
The SD card is 3 years old now, it could be that the card is just starting to fail (or the 3 year old power supply, but nothing concerning power supply shows up in dmesg).
I hope the logs are more conclusive to an expert to identify whether this is a HW failure or a kernel bug.

The text was updated successfully, but these errors were encountered:

pelwell · 2020-02-02T14:48:06Z

The trace shows two separate 10 second timeouts waiting to write a group of sectors. It's impossible to know what the cause was - whether the card failed, or whether the DMA operation somehow went wrong.

What you can do is attempt to rule out DMA by disabling it and seeing if the problem recurs. Try adding dtparam=sd_pio_limit=999999, a ridiculously large value that will force all transactions of fewer sectors (i.e. all transactions) to use PIO instead of DMA. You may notice a performance penalty, but it shouldn't be awful.

olifre · 2020-02-02T20:28:22Z

@pelwell Thanks for the insight!
In an attempt to rule out the card being faulty, I have taken it out and done a full read with ddrescue in a PCIe SD card reader in a Linux machine. While everything was readable, read speed dropped from the usual 60 MB/s to 2 MB/s during prolonged periods. So I tend to indeed blame the card.

Rereading the card again, these speed drops become less (almost gone). Assuming intelligence of the microSD (evo+, 64 GB), maybe that means it reallocated sectors? Is such behaviour known?
If so, feel free to close, then it's clear this is not a kernel bug at all.

pelwell · 2020-02-02T20:33:49Z

SD cards are meant to have a number of spare sectors to allow faulty ones to be mapped out, so some kind of recovery might be possible. If we don't hear from you in a month or so we'll close the issue.

olifre · 2020-02-02T20:38:08Z

@pelwell Thanks, then I presume this happened. I will replace the card in any case in my production system and then perform tests on the potentially broken one, so if the issue reoccurs within the next 30 days in the same system with a new card, we know it should be software.
I had hoped the card would last longer due to regular fstrim (presuming DISCARD is propagated to the card), but maybe 3 years is already good enough.

P33M · 2020-02-02T22:27:10Z

Discards aren't propagated to the card. They get translated into block erasures for the unallocated holes in the filesystem.

Regularly running fstrim may have unintended side effects - by forcing the card to erase sectors you're likely messing with the wear-leveling/hot-cold page algorithms that the card is using. The card's internal erase block sizes are huge compared to the size of a 4k filesystem sector, so there's a large amplification of the number of flash cells that get forcibly erased when trimming small discontiguous ranges.

olifre · 2020-02-03T00:25:07Z

@P33M Sorry for asking back on the issue tracker, will try to keep it short (feel free to direct this discussion to a forum or another place you are active in)...
Could you let me know which of the following is true:

Rare fstrim (monthly?) is still expected to be beneficial to life time, since that's the only way to inform the card of unused sectors (i.e. can be remapped without rewrite, while cold sectors may still contain important data)? For the record, I did this weekly, since I believed in misguided(?) information in the Raspbian forums.
There is no concept of "unused sectors" at all in the world of SD cards and wear-leveling is purely done by hot/coldness and moving data in any case, since all blocks are considered used?

Thanks in advance!

P33M · 2020-02-03T10:35:58Z

Flash erase blocks generally have 3 states - erased, partially written and fully written.

Erased blocks have been untouched since being erased
Partially-written blocks have some flash pages used, and can still accept more writes
Full blocks can't accept any more writes, but the point at which this occurs may be less than the total number of flash pages in a block due to an effect known as Program Disturb.

The SD card's flash translation layer manages which blocks get written to via a logical-to-physical mapping and manages the reclaim of "full" blocks back to "erased" blocks by doing copy-on-write.

Doing a block erase will likely force the card into doing copy-on-write for the allocated flash pages if a there are remapped sectors in there. It also counts towards the total lifetime erasure count for the underlying flash.

In older versions of the SD spec there is no such thing as a a) a background operation or b) a discard operation - cards are either busy doing stuff as a result of a host-initiated command or they are idle[0], because the interface is designed to be hot-swappable. This means that all flash maintenance operations happen during a read or write command.

[0] Recent versions of the eMMC and SD spec introduce apps-class performance categories and support for background maintenance tasks - but these are intended to be implemented on hosts where the card is captive (e.g. inside a phone).
https://www.sdcard.org/press/thoughtleadership/170321Applications_in_Action_Introducing_the_Newest_Application_Performance_Class.html

olifre · 2020-02-03T17:07:02Z

@P33M Very enlightening indeed!
While I believe it might be advantageous to have support for the new background maintenance tasks inside the kernel in general one day (so they can be used for effectively-captive situations such as in a Raspberry Pi), I would interpret your statements such that in general, "erase" is never benefical for life time, since the card in any case does CoW of full blocks.

This would mean for an almost-full card (from the point of view of the card, i.e. after filling it with data once, never doing "erase"), data is effectively CoWed to less-used blocks when writing and there is no real gain from having many "erased blocks" available. Only now, I understand the heaviness of the write amplification of this model. That might be one of the many reasons for UFS and other standards coming up. Thanks for enlightening me!

JamesH65 · 2020-02-13T16:21:56Z

Happy to be closed?

olifre · 2020-02-13T19:14:35Z

@JamesH65 Yes, it did not happen again at least up to now, and I learnt a lot from this thread 😄.

pelwell added the Close within 30 days Issue will be closed within 30 days unless requested to stay open label Feb 2, 2020

olifre closed this as completed Feb 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mmc0: timeout waiting for hardware interrupt. #3446

mmc0: timeout waiting for hardware interrupt. #3446

olifre commented Feb 2, 2020

pelwell commented Feb 2, 2020

Uh oh!

olifre commented Feb 2, 2020

Uh oh!

pelwell commented Feb 2, 2020

Uh oh!

olifre commented Feb 2, 2020

Uh oh!

P33M commented Feb 2, 2020

Uh oh!

olifre commented Feb 3, 2020

Uh oh!

P33M commented Feb 3, 2020 •

edited

Loading

Uh oh!

olifre commented Feb 3, 2020

Uh oh!

JamesH65 commented Feb 13, 2020

Uh oh!

olifre commented Feb 13, 2020

Uh oh!

mmc0: timeout waiting for hardware interrupt. #3446

mmc0: timeout waiting for hardware interrupt. #3446

Comments

olifre commented Feb 2, 2020

pelwell commented Feb 2, 2020

Uh oh!

olifre commented Feb 2, 2020

Uh oh!

pelwell commented Feb 2, 2020

Uh oh!

olifre commented Feb 2, 2020

Uh oh!

P33M commented Feb 2, 2020

Uh oh!

olifre commented Feb 3, 2020

Uh oh!

P33M commented Feb 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

olifre commented Feb 3, 2020

Uh oh!

JamesH65 commented Feb 13, 2020

Uh oh!

olifre commented Feb 13, 2020

Uh oh!

P33M commented Feb 3, 2020 •

edited

Loading