Skip to content

mmc0: timeout waiting for hardware interrupt. #3446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
olifre opened this issue Feb 2, 2020 · 10 comments
Closed

mmc0: timeout waiting for hardware interrupt. #3446

olifre opened this issue Feb 2, 2020 · 10 comments
Labels
Close within 30 days Issue will be closed within 30 days unless requested to stay open

Comments

@olifre
Copy link
Contributor

olifre commented Feb 2, 2020

Describe the bug
Randomly, I encounter this error message following by a lot of debug traces. Just now, this corrupted the FS.

To reproduce
It happens randomly, not even bound to high load.

System

$ vcgencmd version
Sep 24 2019 17:37:19 
Copyright (c) 2012 Broadcom
version 6820edeee4ef3891b95fc01cf02a7abd7ca52f17 (clean) (release) (start)

Logs
mmcerr.txt

Additional context
The SD card is 3 years old now, it could be that the card is just starting to fail (or the 3 year old power supply, but nothing concerning power supply shows up in dmesg).
I hope the logs are more conclusive to an expert to identify whether this is a HW failure or a kernel bug.

@pelwell
Copy link
Contributor

pelwell commented Feb 2, 2020

The trace shows two separate 10 second timeouts waiting to write a group of sectors. It's impossible to know what the cause was - whether the card failed, or whether the DMA operation somehow went wrong.

What you can do is attempt to rule out DMA by disabling it and seeing if the problem recurs. Try adding dtparam=sd_pio_limit=999999, a ridiculously large value that will force all transactions of fewer sectors (i.e. all transactions) to use PIO instead of DMA. You may notice a performance penalty, but it shouldn't be awful.

@olifre
Copy link
Contributor Author

olifre commented Feb 2, 2020

@pelwell Thanks for the insight!
In an attempt to rule out the card being faulty, I have taken it out and done a full read with ddrescue in a PCIe SD card reader in a Linux machine. While everything was readable, read speed dropped from the usual 60 MB/s to 2 MB/s during prolonged periods. So I tend to indeed blame the card.

Rereading the card again, these speed drops become less (almost gone). Assuming intelligence of the microSD (evo+, 64 GB), maybe that means it reallocated sectors? Is such behaviour known?
If so, feel free to close, then it's clear this is not a kernel bug at all.

@pelwell
Copy link
Contributor

pelwell commented Feb 2, 2020

SD cards are meant to have a number of spare sectors to allow faulty ones to be mapped out, so some kind of recovery might be possible. If we don't hear from you in a month or so we'll close the issue.

@pelwell pelwell added the Close within 30 days Issue will be closed within 30 days unless requested to stay open label Feb 2, 2020
@olifre
Copy link
Contributor Author

olifre commented Feb 2, 2020

@pelwell Thanks, then I presume this happened. I will replace the card in any case in my production system and then perform tests on the potentially broken one, so if the issue reoccurs within the next 30 days in the same system with a new card, we know it should be software.
I had hoped the card would last longer due to regular fstrim (presuming DISCARD is propagated to the card), but maybe 3 years is already good enough.

@P33M
Copy link
Contributor

P33M commented Feb 2, 2020

Discards aren't propagated to the card. They get translated into block erasures for the unallocated holes in the filesystem.

Regularly running fstrim may have unintended side effects - by forcing the card to erase sectors you're likely messing with the wear-leveling/hot-cold page algorithms that the card is using. The card's internal erase block sizes are huge compared to the size of a 4k filesystem sector, so there's a large amplification of the number of flash cells that get forcibly erased when trimming small discontiguous ranges.

@olifre
Copy link
Contributor Author

olifre commented Feb 3, 2020

@P33M Sorry for asking back on the issue tracker, will try to keep it short (feel free to direct this discussion to a forum or another place you are active in)...
Could you let me know which of the following is true:

  1. Rare fstrim (monthly?) is still expected to be beneficial to life time, since that's the only way to inform the card of unused sectors (i.e. can be remapped without rewrite, while cold sectors may still contain important data)? For the record, I did this weekly, since I believed in misguided(?) information in the Raspbian forums.
  2. There is no concept of "unused sectors" at all in the world of SD cards and wear-leveling is purely done by hot/coldness and moving data in any case, since all blocks are considered used?

Thanks in advance!

@P33M
Copy link
Contributor

P33M commented Feb 3, 2020

Flash erase blocks generally have 3 states - erased, partially written and fully written.

  • Erased blocks have been untouched since being erased
  • Partially-written blocks have some flash pages used, and can still accept more writes
  • Full blocks can't accept any more writes, but the point at which this occurs may be less than the total number of flash pages in a block due to an effect known as Program Disturb.

The SD card's flash translation layer manages which blocks get written to via a logical-to-physical mapping and manages the reclaim of "full" blocks back to "erased" blocks by doing copy-on-write.

Doing a block erase will likely force the card into doing copy-on-write for the allocated flash pages if a there are remapped sectors in there. It also counts towards the total lifetime erasure count for the underlying flash.

In older versions of the SD spec there is no such thing as a a) a background operation or b) a discard operation - cards are either busy doing stuff as a result of a host-initiated command or they are idle[0], because the interface is designed to be hot-swappable. This means that all flash maintenance operations happen during a read or write command.

[0] Recent versions of the eMMC and SD spec introduce apps-class performance categories and support for background maintenance tasks - but these are intended to be implemented on hosts where the card is captive (e.g. inside a phone).
https://www.sdcard.org/press/thoughtleadership/170321Applications_in_Action_Introducing_the_Newest_Application_Performance_Class.html

@olifre
Copy link
Contributor Author

olifre commented Feb 3, 2020

@P33M Very enlightening indeed!
While I believe it might be advantageous to have support for the new background maintenance tasks inside the kernel in general one day (so they can be used for effectively-captive situations such as in a Raspberry Pi), I would interpret your statements such that in general, "erase" is never benefical for life time, since the card in any case does CoW of full blocks.

This would mean for an almost-full card (from the point of view of the card, i.e. after filling it with data once, never doing "erase"), data is effectively CoWed to less-used blocks when writing and there is no real gain from having many "erased blocks" available. Only now, I understand the heaviness of the write amplification of this model. That might be one of the many reasons for UFS and other standards coming up. Thanks for enlightening me!

@JamesH65
Copy link
Contributor

Happy to be closed?

@olifre
Copy link
Contributor Author

olifre commented Feb 13, 2020

@JamesH65 Yes, it did not happen again at least up to now, and I learnt a lot from this thread 😄.

@olifre olifre closed this as completed Feb 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Close within 30 days Issue will be closed within 30 days unless requested to stay open
Projects
None yet
Development

No branches or pull requests

4 participants