Skip to content

SPI transfers sometimes corrupt last received byte #2200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
webmeister opened this issue Sep 18, 2017 · 25 comments
Closed

SPI transfers sometimes corrupt last received byte #2200

webmeister opened this issue Sep 18, 2017 · 25 comments
Labels
Waiting for external input Waiting for a comment from the originator of the issue, or a collaborator. Waiting for internal comment Waiting for comment from a member of the Raspberry Pi engineering team

Comments

@webmeister
Copy link
Contributor

I use several Raspberry Pis to communicate with other devices via SPI. There, I observed a strange data corruption that I was able to reproduce with a Raspberry Pi 3 B alone, i.e. with SPI in loopback mode (MOSI connected to MISO). I did several experiments to narrow down the problem:

  1. The problem happens only for data that is received, not for data that is sent.
  2. The data on the line looks fine, my logic analyzer shows the correct data.
  3. It is always the last byte of the transfer that is affected. It seems that this byte is not received at all but instead taken from the previous transfer.
  4. Only certain transfer sizes seem to be affected (at 5 MHz). I tested 1 to 150 byte transfers and saw the problem only for 56 to 72 byte transfers and for 111 to 136 byte transfers. Notice how these are roughly the same ranges, just twice the size. For all other sizes 500k transfers of random data did not show the problem.
  5. Only certain clock frequencies seem to be affected (for 64 byte transfers). I tested 1 to 62.5 MHz (in steps of 500 kHz; with 63 MHz and beyond no reliable communication seems possible) and saw the problem only for 4 to 9.5 MHz.
  6. Raspberry Pi 2 B is affected as well.
  7. (Almost) current mainline kernel (4.13-rc2) is affected, as well as some older version (4.11-rc4).

For my experiments on the Raspberry Pi 3 (with kernel 4.13-rc2) I used the following Python script:

import spidev

spi = spidev.SpiDev()
spi.open(32766, 1)

spi.max_speed_hz = 5000000
size = 64

for iteration in range(500000):
	data_send = size * [iteration % 256]
	data_recv = spi.xfer(data_send)
	if data_send != data_recv:
		print(iteration)
		print(data_send)
		print(data_recv)
		break

When running this script, you should see that as soon as it detects a mismatch, the wrong byte belongs to the previous transfer (for example, when it transferred 64 times the value 111, the last byte received will still have the value 110 from the previous transfer).

And for reference, this is the device tree overlay used to activate spidev:

/dts-v1/;
/plugin/;

/ {
	fragment@0 {
		target-path = "/soc/gpio@7e200000";
		__overlay__ {
			spi0_pins: spi0_pins {
				brcm,pins = <9 10 11>;
				brcm,function = <4>;
			};

			spi0_cs_pins: spi0_cs_pins {
				brcm,pins = <8 7>;
				brcm,function = <1>;
			};
		};
	};

	fragment@1 {
		target-path = "/soc/spi@7e204000";
		__overlay__ {
			#address-cells = <1>;
			#size-cells = <0>;
			pinctrl-names = "default";
			pinctrl-0 = <&spi0_pins &spi0_cs_pins>;
			status = "okay";

		        spidev1: spi@1{
				compatible = "spidev";
				reg = <1>;
				spi-max-frequency = <5000000>;
			};
		};
	};
};
@pelwell
Copy link
Contributor

pelwell commented Sep 18, 2017

Interesting. Can you clarify a few points?:

  1. Is it the case that the current 4.9 kernels don't exhibit this corruption?
  2. Does your test need any external wiring other than a patch cable between GPIOs 9 and 10?
  3. Why do you use an overlay to enable SPI when "dtparam=spi=on" ought to do the same thing?

@webmeister
Copy link
Contributor Author

  1. Is it the case that the current 4.9 kernels don't exhibit this corruption?

I did not test that initially, but I have just now verified that a 4.9.50 kernel (built from 12bafda, non-mainline) on a Raspberry Pi 2 B is also affected (for 5 MHz, 64 bytes).

  1. Does your test need any external wiring other than a patch cable between GPIOs 9 and 10?

No, connecting BCM 9 and 10 (pins 19 and 21 on the header) is sufficient.

  1. Why do you use an overlay to enable SPI when "dtparam=spi=on" ought to do the same thing?

Sorry, forgot to mention that. The Raspberry Pi 3 is running an AArch64 mainline kernel and the dtparam does not seem to work there. It works fine for the Raspberry Pi 2 I used in the test above.

@pelwell
Copy link
Contributor

pelwell commented Sep 18, 2017

Thanks for the clarifications. This is a very niche issue so is going to be a low priority, but the easier it is to reproduce the more likely it is to get looked at. How long does it typically take to see a failure?

@webmeister
Copy link
Contributor Author

How long does it typically take to see a failure?

It depends. The average for all my tests is probably several ten thousand iterations with the script above.

Unfortunately, the real-world application with which I first hit the problem, seems to be more likely to see it. I use those Raspberry Pis to test kernel drivers for TPMs (tpm_tis_spi in this case). For those devices, the protocol specifies a maximum transfer size of 64 bytes, which lies perfectly in the affected range. There I see the problem with only a few hundred high-level commands (that might be split up into ~10 SPI transfers each).

Is there anything else I could test? Is there an easy way to get the Linux kernel out of the way and run code directly on the board, so that we can distinguish between a kernel bug and a hardware bug? There was an attempt to port MicroPython to a (bare metal) Raspberry Pi, but I am not sure whether this is usable yet.

@pelwell
Copy link
Contributor

pelwell commented Sep 18, 2017

My gut feel is that this is a hardware problem, but in which block? All data is transferred by DMA, and the transfer completion interrupt should only occur once the requested number of bytes has been transferred, so an apparent short transfer with old data left behind can only be explained by DMA miscounting or by a cache problem. If only the last byte is lost and the buffers are at least word aligned, a cache maintenance problem seems unlikely.

@webmeister
Copy link
Contributor Author

I have seen the error many times now, and it was always the last byte and only the last byte that was wrong.

When you are talking about DMA you mean the hardware peripheral, not the kernel driver, right? The kernel driver uses DMA only for transfers of 96 bytes and more.

So if this is indeed a hardware bug, the question is: can we fix it in software?

This reminds me of a separate problem we had with the Raspberry Pi's SPI master: Sometimes it seemed to release the CS line too early, while the last bit was still in transmission. We worked around that in the TPM driver (5cc0101), but never investigated the problem further. Should I create another issue for that?

@P33M
Copy link
Contributor

P33M commented Oct 9, 2017

Assuming this smells like a clock-crossing bug (which seems likely), can we force synchronous clocks to the SPI peripheral in the same manner as we do for the eMMC host? It may be sufficient to just run the SPI clock from the same PLL source as core clock.
Edit: the divisor is internal to the SPI peripheral and runs from the core clock, so it's synchronous.

@P33M
Copy link
Contributor

P33M commented Oct 9, 2017

One thought: with smaller data sizes the CPU is reading the FIFO rather than DMA. Is the CPU running in polled mode or irq-driven mode when reading the last byte? What happens if a (small) delay is inserted in the reading loop in bcm2835_rd_fifo(), say 1uS or so?

@webmeister
Copy link
Contributor Author

What happens if a (small) delay is inserted in the reading loop in bcm2835_rd_fifo(), say 1uS or so?

That seems to fix the issue, at least for my test of 64-byte transfers at 5 MHz. It also makes the whole test slower by about 25% (the 500k iterations of the script above need roughly 2 minutes on my Raspberry Pi 2, with the fix it's 2.5 minutes). 128-byte transfers are also fixed with this change, so there seems to be something else that prevents the driver from using DMA, even for larger transfers.

This is the patch I used:

diff --git a/drivers/spi/spi-bcm2835.c b/drivers/spi/spi-bcm2835.c
index f35cc10..73aa9c1 100644
--- a/drivers/spi/spi-bcm2835.c
+++ b/drivers/spi/spi-bcm2835.c
@@ -107,6 +107,7 @@ static inline void bcm2835_rd_fifo(struct bcm2835_spi *bs)
 
 	while ((bs->rx_len) &&
 	       (bcm2835_rd(bs, BCM2835_SPI_CS) & BCM2835_SPI_CS_RXD)) {
+		udelay(1);
 		byte = bcm2835_rd(bs, BCM2835_SPI_FIFO);
 		if (bs->rx_buf)
 			*bs->rx_buf++ = byte;

@P33M
Copy link
Contributor

P33M commented Oct 10, 2017

If adding a small delay fixes the issue in the definite polling case, then it's probably a case of FIFO pointers not getting updated before FIFO flags getting updated.

It may be sufficient to just do a dummy read (or several) of another register inside the BCM2835 SPI register block in place of the udelay() call in order to force a few cycles of idle time. Maybe the CDIV register?

@popcornmix
Copy link
Collaborator

@pelwell could there be a missing memory barrier in driver?
e.g. like b84ce14

@P33M
Copy link
Contributor

P33M commented Oct 10, 2017

https://github.com/raspberrypi/linux/blob/rpi-4.9.y/drivers/spi/spi-bcm2835.c#L94 - bcm2835_rd|wr use readl() and writel() respectively - so there should be barriers in the right places.

@pelwell
Copy link
Contributor

pelwell commented Oct 11, 2017

readl and writel on ARM use rmb and wmb respectively. Theoretically rmb should only provided synchronisation with respect to reads, and wmb to writes, but AFAICT on ARM they are stronger than that and the SPI driver should be safe in that regard.

I'm not convinced the rx_len and rx_buf handling is safe. I can't see a failure mechanism yet, but you could add a "wmb();" after *bs->rx_buf++ = byte;:

		byte = bcm2835_rd(bs, BCM2835_SPI_FIFO);
		if (bs->rx_buf) {
			*bs->rx_buf++ = byte;
			wmb();
		}
		bs->rx_len--;

@webmeister
Copy link
Contributor Author

I did some more tests:

  • ndelay(10) does not seem to be really different from udelay(1), i.e. it also fixes the problem, but the test takes roughly the same time.
  • Adding the udelay(1) call only for the last byte (prefix call with if (bs->rxlen == 1)), does not fix the problem.
  • Reading the same register twice in the condition of the while loop does not fix the problem.
  • The wmb() call does not fix the problem.

@pelwell
Copy link
Contributor

pelwell commented Oct 13, 2017

And just to be sure, does sprinkling a few mb()s do anything?

		byte = bcm2835_rd(bs, BCM2835_SPI_FIFO);
		mb();
		if (bs->rx_buf) {
			*bs->rx_buf++ = byte;
			mb();
		}
		bs->rx_len--;
		mb();

@webmeister
Copy link
Contributor Author

@pelwell: No, that does not help.

@pelwell
Copy link
Contributor

pelwell commented Oct 13, 2017

Thanks for trying.

@JamesH65
Copy link
Contributor

JamesH65 commented Dec 5, 2017

This appears to have stalled, but clearly is an issue. Anyone have any further ideas/comments/fixes?

@JamesH65 JamesH65 added Waiting for internal comment Waiting for comment from a member of the Raspberry Pi engineering team Waiting for external input Waiting for a comment from the originator of the issue, or a collaborator. labels Dec 5, 2017
@webmeister
Copy link
Contributor Author

I tried everything that has been suggested so far, but nothing seems to help. I'm open to try more ideas, if there are any. Did anyone try to reproduce the issue?

For now I'm running the SPI clock outside of the affected range, though that is not a nice solution.

@JamesH65
Copy link
Contributor

@pelwell Anything changed on SPI since the last post on here?

@pelwell
Copy link
Contributor

pelwell commented Jun 27, 2018

No, the driver hasn't really changed in the last two years.

@pelwell
Copy link
Contributor

pelwell commented Nov 15, 2018

Two users have, independently, discovered and fixed the driver bug that is almost certainly responsible for this data loss. @niklasekstrom submitted a Pull Request with a fix, but instead we are waiting for Lukas Wunner's set of improvements to be accepted upstream.

@JamesH65
Copy link
Contributor

@pelwell Was that fix ever upstreamed?

@pelwell
Copy link
Contributor

pelwell commented Jul 31, 2019

They're in 5.0 and later. It isn't a clean back-port, but could be done.

@JamesH65
Copy link
Contributor

@pelwell I guess depends on when we will be moving to 5.x

@pelwell pelwell closed this as completed Apr 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Waiting for external input Waiting for a comment from the originator of the issue, or a collaborator. Waiting for internal comment Waiting for comment from a member of the Raspberry Pi engineering team
Projects
None yet
Development

No branches or pull requests

5 participants