-
Notifications
You must be signed in to change notification settings - Fork 5.2k
SPI transfers sometimes corrupt last received byte #2200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Interesting. Can you clarify a few points?:
|
I did not test that initially, but I have just now verified that a 4.9.50 kernel (built from 12bafda, non-mainline) on a Raspberry Pi 2 B is also affected (for 5 MHz, 64 bytes).
No, connecting BCM 9 and 10 (pins 19 and 21 on the header) is sufficient.
Sorry, forgot to mention that. The Raspberry Pi 3 is running an AArch64 mainline kernel and the dtparam does not seem to work there. It works fine for the Raspberry Pi 2 I used in the test above. |
Thanks for the clarifications. This is a very niche issue so is going to be a low priority, but the easier it is to reproduce the more likely it is to get looked at. How long does it typically take to see a failure? |
It depends. The average for all my tests is probably several ten thousand iterations with the script above. Unfortunately, the real-world application with which I first hit the problem, seems to be more likely to see it. I use those Raspberry Pis to test kernel drivers for TPMs (tpm_tis_spi in this case). For those devices, the protocol specifies a maximum transfer size of 64 bytes, which lies perfectly in the affected range. There I see the problem with only a few hundred high-level commands (that might be split up into ~10 SPI transfers each). Is there anything else I could test? Is there an easy way to get the Linux kernel out of the way and run code directly on the board, so that we can distinguish between a kernel bug and a hardware bug? There was an attempt to port MicroPython to a (bare metal) Raspberry Pi, but I am not sure whether this is usable yet. |
My gut feel is that this is a hardware problem, but in which block? All data is transferred by DMA, and the transfer completion interrupt should only occur once the requested number of bytes has been transferred, so an apparent short transfer with old data left behind can only be explained by DMA miscounting or by a cache problem. If only the last byte is lost and the buffers are at least word aligned, a cache maintenance problem seems unlikely. |
I have seen the error many times now, and it was always the last byte and only the last byte that was wrong. When you are talking about DMA you mean the hardware peripheral, not the kernel driver, right? The kernel driver uses DMA only for transfers of 96 bytes and more. So if this is indeed a hardware bug, the question is: can we fix it in software? This reminds me of a separate problem we had with the Raspberry Pi's SPI master: Sometimes it seemed to release the CS line too early, while the last bit was still in transmission. We worked around that in the TPM driver (5cc0101), but never investigated the problem further. Should I create another issue for that? |
|
One thought: with smaller data sizes the CPU is reading the FIFO rather than DMA. Is the CPU running in polled mode or irq-driven mode when reading the last byte? What happens if a (small) delay is inserted in the reading loop in bcm2835_rd_fifo(), say 1uS or so? |
That seems to fix the issue, at least for my test of 64-byte transfers at 5 MHz. It also makes the whole test slower by about 25% (the 500k iterations of the script above need roughly 2 minutes on my Raspberry Pi 2, with the fix it's 2.5 minutes). 128-byte transfers are also fixed with this change, so there seems to be something else that prevents the driver from using DMA, even for larger transfers. This is the patch I used: diff --git a/drivers/spi/spi-bcm2835.c b/drivers/spi/spi-bcm2835.c
index f35cc10..73aa9c1 100644
--- a/drivers/spi/spi-bcm2835.c
+++ b/drivers/spi/spi-bcm2835.c
@@ -107,6 +107,7 @@ static inline void bcm2835_rd_fifo(struct bcm2835_spi *bs)
while ((bs->rx_len) &&
(bcm2835_rd(bs, BCM2835_SPI_CS) & BCM2835_SPI_CS_RXD)) {
+ udelay(1);
byte = bcm2835_rd(bs, BCM2835_SPI_FIFO);
if (bs->rx_buf)
*bs->rx_buf++ = byte; |
If adding a small delay fixes the issue in the definite polling case, then it's probably a case of FIFO pointers not getting updated before FIFO flags getting updated. It may be sufficient to just do a dummy read (or several) of another register inside the BCM2835 SPI register block in place of the udelay() call in order to force a few cycles of idle time. Maybe the CDIV register? |
https://github.com/raspberrypi/linux/blob/rpi-4.9.y/drivers/spi/spi-bcm2835.c#L94 - bcm2835_rd|wr use readl() and writel() respectively - so there should be barriers in the right places. |
readl and writel on ARM use rmb and wmb respectively. Theoretically rmb should only provided synchronisation with respect to reads, and wmb to writes, but AFAICT on ARM they are stronger than that and the SPI driver should be safe in that regard. I'm not convinced the rx_len and rx_buf handling is safe. I can't see a failure mechanism yet, but you could add a "wmb();" after
|
I did some more tests:
|
And just to be sure, does sprinkling a few
|
@pelwell: No, that does not help. |
Thanks for trying. |
This appears to have stalled, but clearly is an issue. Anyone have any further ideas/comments/fixes? |
I tried everything that has been suggested so far, but nothing seems to help. I'm open to try more ideas, if there are any. Did anyone try to reproduce the issue? For now I'm running the SPI clock outside of the affected range, though that is not a nice solution. |
@pelwell Anything changed on SPI since the last post on here? |
No, the driver hasn't really changed in the last two years. |
Two users have, independently, discovered and fixed the driver bug that is almost certainly responsible for this data loss. @niklasekstrom submitted a Pull Request with a fix, but instead we are waiting for Lukas Wunner's set of improvements to be accepted upstream. |
@pelwell Was that fix ever upstreamed? |
They're in 5.0 and later. It isn't a clean back-port, but could be done. |
@pelwell I guess depends on when we will be moving to 5.x |
I use several Raspberry Pis to communicate with other devices via SPI. There, I observed a strange data corruption that I was able to reproduce with a Raspberry Pi 3 B alone, i.e. with SPI in loopback mode (MOSI connected to MISO). I did several experiments to narrow down the problem:
For my experiments on the Raspberry Pi 3 (with kernel 4.13-rc2) I used the following Python script:
When running this script, you should see that as soon as it detects a mismatch, the wrong byte belongs to the previous transfer (for example, when it transferred 64 times the value 111, the last byte received will still have the value 110 from the previous transfer).
And for reference, this is the device tree overlay used to activate spidev:
The text was updated successfully, but these errors were encountered: