-
Notifications
You must be signed in to change notification settings - Fork 5.2k
ttyAMA0 (PL011) driver data corruption #4453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Your results are consistent with FIFO overflow - when data is received while the UART RX FIFO is full the UART sets the overflow flag and drops the data. The inserted 00 is the point where the overflow is detected by the UART driver. Whether you appear to lose, gain or simply change data bytes depends on the number of contiguous overflowed bytes and the alignment with respect to the start of your test blocks. Overflow is always going to be a possibility unless you connect the RTS and CTS lines and enable flow control on the port (CRTSCTS). |
Thanks for your comment pelwell. That would make sense in terms of the situations when this is more/less likely to occur. Is there some way to influence that? i.e. can I compile the driver with a setting that increases the FIFO space? I have no clue where in the driver code I would start looking for the details and how to find out what to change and how. Ultimately this should not be a show stopper for me, since I am working on my own driver for the HAT I have developed. Using the tty driver is just a stop gap measure until I have implemented direct UART access in the driver proper. Handshake lines are no solution in this case, as data will fall on the floor one way or the other (I can't stop or otherwise throttle my ultimate data sources). The only upside would be that I might be able to avoid packet destruction. But my data stream is resistent to that and will automatically re-synchronize. |
The FIFO is in the UART hardware, so there is no possibility to increase its size. I'm not convinced that writing your own driver is going to help significantly - the issue here is interrupt latency. At 3Mbaud it take approximately 100us to fill the FIFO. Once one of the cores is in the interrupt handler it will be able to drain the FIFO pretty quickly, but if it misses that 100us window then it's game over. You might be able to use the interrupt affinity mechanism to allocate the UART interrupt (and only the UART interrupt) to one of the ARM cores; take a look at /proc/interrupts and /proc/irq//smp_affinity*:
Notice how core 3 has started to service UART interrupts. I suggest you write a short script that searches through /proc/irq/*/smp_affinity_list, replacing "0-3" with "0-2", then changes smp_affinity_list of the uart-pl011 interrupt to 3. |
Thanks for the suggestion. I'll have a look at that. At least at this point I know what I am up against. |
Note that flow control might help you if the sending device has some additional buffering beyond whatever UART FIFOs it might have - the problem for the receiving Pi isn't that it can't keep up with the overall data rate but that it is occasionally slightly too late in responding. |
In fact it's better than that - the latency tolerance of a flow controlled system is based on the total size of the FIFOs on both sides (for the direction in question, i.e. source TX FIFO and sink RX FIFO), so even a small TX FIFO should help provided the source side is writing in small chunks rather than a whole FIFO at a time. |
Yes, I hear you. As almost always, the real world is a bit more involved though. First of all, I am writing the driver not because I think I can do UART better than someone else. There are numerous technical reasons why I need that driver. For one thing, one of my ultimate data sources/sinks on the HAT is CAN. I need to present that on the Pi as SocketCAN. That alone demands that I implement my interface as a driver. Besides, the driver is written and working. Using the tty driver is just a stop gap measure until I have the UART interface fully integrated in the driver. At the moment it's actually quite awkward routing all the data through user space. In fact, I suspect that is part of the FIFO problem. Knowing now what the problem is, perhaps I can improve on that. but mostly this path is a bit of a time sink. I am actually surprised that the tty driver is not using DMA to clear out the FIFO. So far, in my mind, the FIFO was in RAM. That's why I was assuming it could be enlarged. The processing on my HAT is implemented on an M3 core. I am not using interrupts there, it's all handled with DMA. And yes, I can make that a bit larger. I had in mind to use DMA in interfacing with the UART. One thing I was thinking of, was implementing a sort of XON/XOFF software handshake using my exchange protocol. But that feels more like a last resort. It might still end up dropping data. The only up side would be that it would drop complete transmission packets, instead of randomly destroying them. |
In fact, I am just thinking I might be able to tune the way I am transmitting data on the HAT to make the job a bit easier for the Pi by modifying how I do bursts. I have to see... |
I am experiencing a similar issue on the Raspberry Pi 3B+ board running Raspberry Pi OS Lite with kernel 5.4.83-v7+.
It looks like the baud rate or the UART clock rate of ttyAMA0 becomes unstable. Is there any workaround? Any information would be appreciated. Test programs used for reproducing the issueSender program:
Receiver program:
Log from the receiver program when the issue was reproduced:
|
ttyAMA0 has a dedicated clock, so doesn't suffer from variable baud rates. What is the sending device? Do you have a ground connection between the two? |
Thanks for your comment pelwell. Sending device is also a RPi 3B+ board, and a ground is connected between the two boards. I just tested same programs on a CM3 I/O board where ttyAMA0 (pin14, pin15) and ttyS0 (pin32, pin33) are wired together, then data corruption ocurred when sending data from ttyS0 to ttyAMA0 but did not occur if the direction was reversed. |
The clock of UART1 (which appears as ttyS0) is dependent on the VPU core clock. The firmware should prevent the core clock from changing if UART1 is enabled in the Device Tree (
Add those lines to the config.txt file on the devices where ttyS0 is being used. |
Thanks for your suggestion about locking the clock of ttyS0, but that did not solve the problem. It occurs even if not involving ttyS0, i.e. sending data from an external RPi through ttyAMA0 or ttyUSB0 (a USB serial adapter), of course a ground is connected. It would be appreciated if you could try the test programs and see what's happening. |
I'm seeing a problem similar to what @ktgoto reported in #4453 (comment) This is with kernel 6.1.21-v8+ on a Raspberry Pi CM4. My scenario is that I have the UART output of a GPS board connected to
the output starts with some corrupted data. The amount of corrupted data varies from time to time: it can be anything from 0 to hundreds of bytes. But it always eventually recovers and starts delivering correct data. If I do
before and after, I can see the number of framing errors has increased when there is corruption. The problem occurs when I have The board I'm seeing it on right now is the TimeBeat LEA-M8F module: https://store.timebeat.app/products/gnss-raspberry-pi-cm4-module?variant=42280855699627 (from @lasselj) I have connected lots of other modules without seeing this problem. But the problem is not specific to this board. I've seen it with a cheap $10 GPS also at a speed of 9600. If I change the speed to 38400, the problem happens but significantly less frequently. Here's an example of corrupted output (with
The |
|
@pelwell I tried doing as you suggest, and I get the same problem. When doing one stty command, then multiple cat commands, each of the cat commands can produce corrupted output at the start. |
I wonder if something else is holding the UART open. Try this:
|
|
It's strange that ttyS0 is better than ttyAMA0 - in my experience the break detection (and therefore synchronisation) is better on ttyAMA0. It's also strange that you are the first person with this problem. A number of Timebeat devs have been active here - @lasselj, @chronosfin - and I'm sure they would have reported this issue if they'd come across it. Have you looked at the signal on a scope or logic analyser? It should be echoed to pins 8 & 10 on the 40-pin header. I was going to suggest trying to starting the
Is there any corruption when invoked this way? Feel free to change the way you schedule the various tasks - multiple parallel shells, background jobs etc. - as long as the order is the same. One difference running at 38400 baud is going to be the larger gaps between the characters (or groups of characters). The larger the gap the easier it is for UART to detect a break and resynchronise, and I'm wondering how much downtime there is in the 9600 baud case - it may only be between lines (NMEA sentences). I don't think it's coincidence that the first clear character is the I suspect you are expected to drain everything up to the first |
The Bug
Data received on ttyAMA0 (on the Raspberry Pi 4B GPIO header) exhibits data corruption on received data. The os is Raspbian with kernel verison 5.4.72-v7l+ (Linux raspberrypi 5.4.72-v7l+ #1356 SMP Thu Oct 22 13:57:51 BST 2020 armv7l GNU/Linux). I am seeing this at high data rates (3Mb/s) but I am sure it is not a hardware issue (see details below). The type of corruption is not related to changed values, due to fliped bits or such. It involves missing and added bytes.
To reproduce
Send data to the serial port. This may be done from an external device or simply by looping the port back to itself. Compare the received data against the original data. I have verified the signal quality with an oscilloscope and do not see any issues. The signal is very clean. I also use a USB serial adapter to simultaneously receive from the same line. The USB adapter (FTDI based) receives the data clean, while ttyAMA0 shows the problem. Note that the occurrence is unpredictable. Sometimes it shows up after only a few transmissions, sometimes it requires well over 100k transmissions before it shows.
The occurrence seems to be much higher with a high transmission frequency (every few ms) and data packets of varying sizes (random data). But this might simply be a time thing. Other activity on the machine also seems to have an influence, the more activity the more likely a data corruption occurs. It is my impression that a race condition in the driver may be involved.
In situations where transmissions are single byte or very short, this might look like the transmission was not received or the data byte was corrupted and might be mistaken by an observer for a hardware issue.
I am attaching C source code for an application that can run the test, including simultaneous reception from a second serial port. If started with -s on the command line it will automatically stop upon encountering the problem and produce a dump of the faulty data. To run the test, jumper the serial port on the GPIO adapter to loop it back to itself. To verify good data reception, connect a 3.3V logic level serial/USB adapter to the Raspberry Pi and list the path for that adapter (probably /dev/ttyUSB0) on the command line.
Expected behavior
Data is received exactly as sent.
Actual behavior
The received data may be corrupted in all sorts of ways. There may be more data than sent (quite rare), there may be less data than sent with changed bytes (quite frequent) there might be the same amount, but bytes have changed.
A pattern as the following is quite frequent.
^ indicates the location where the (first) error was detected.
T |44|AE|93|2E|F7|49|47|90|1C|48|DD|BE|06|40|82|48|85|43|BF|16|DD|27|79|53|9F|08|6D|F4|9D|03|74|61|30|87|0E|A6|CF|54|B5|6A|
P |44|AE|93|2E|F7|49|47|90|1C|48|DD|BE|06|40|82|48|85|43|BF|16|DD|27|79|53|9F|08|6D|F4|9D|03|74|61|87|00|0E|A6|CF|54|B5|6A|
R |44|AE|93|2E|F7|49|47|90|1C|48|DD|BE|06|40|82|48|85|43|BF|16|DD|27|79|53|9F|08|6D|F4|9D|03|74|61|30|87|0E|A6|CF|54|B5|6A|
^
Note that 30h is missing, 87h is correct, but shows in place of the missing 30h, followed by a 00h that does not exist in the original data. Then the data stream goes on without error. The reference data received from the USB adapter matches the transmitted packet. The missing byte and the next byte followed by 00h shows up a lot, also in connection with other types of corruption, such as fewer total bytes received.
Logs
The following logs show a few select scenarios.
Error showed after 119845 packet transmissions. In this case data was received from the tty driver in two fragments.
Transmission count = 119845
Error count = 1
Max skip count = 16
Error type = 5
Bytes transmitted = 45
Primary received = 44
Reference received = 45
Data:
T |91|D8|C6|4B|F2|8E|50|0A|19|B0|F2|5B|D0|C8|7C|50|B4|74|30|8F|03|39|83|6E|13|8E|97|FA|E8|4F|A4|F8|A6|E9|C3|99|F6|92|22|10|42|14|6A|92|5B|
P |91|D8|C6|4B|F2|8E|50|0A|19|B0|F2|5B|D0|C8|7C|50|B4|74|30|8F|03|39|83|6E|13|8E|97|FA|E8|4F|A4|F8|C3|00|99|F6|92|22|10|42|14|6A|92|5B|
R |91|D8|C6|4B|F2|8E|50|0A|19|B0|F2|5B|D0|C8|7C|50|B4|74|30|8F|03|39|83|6E|13|8E|97|FA|E8|4F|A4|F8|A6|E9|C3|99|F6|92|22|10|42|14|6A|92|5B|
^
Primary channel fragments
|91|D8|C6|4B|F2|8E|50|0A|19|B0|F2|5B|D0|C8|7C|50|B4|74|30|8F|03|39|83|6E|13|8E|97|FA|E8|4F|A4|F8|C3|00|99|
|F6|92|22|10|42|14|6A|92|5B|
Transmission count = 118443
Error count = 1
Max skip count = 16
Error type = 5
Bytes transmitted = 59
Primary received = 57
Reference received = 59
Data:
T |E8|8B|82|B6|24|D9|4E|2C|38|C3|B8|DE|C1|7C|57|43|0B|50|BF|2A|E5|2E|B9|BD|BE|15|1B|39|08|39|23|EF|43|A4|26|67|FD|F2|12|35|B5|C9|93|76|C4|E9|B8|CF|39|F6|F8|1F|25|31|5B|62|46|75|1A|
P |E8|8B|82|B6|24|D9|4E|2C|38|C3|B8|DE|C1|7C|57|43|0B|50|BF|2A|E5|2E|B9|BD|BE|15|1B|39|08|39|23|EF|43|A4|26|67|FD|F2|12|35|B5|C9|93|76|C4|E9|B8|CF|39|F6|F8|5B|00|62|46|75|1A|
R |E8|8B|82|B6|24|D9|4E|2C|38|C3|B8|DE|C1|7C|57|43|0B|50|BF|2A|E5|2E|B9|BD|BE|15|1B|39|08|39|23|EF|43|A4|26|67|FD|F2|12|35|B5|C9|93|76|C4|E9|B8|CF|39|F6|F8|1F|25|31|5B|62|46|75|1A|
^
Primary channel fragments
|E8|8B|82|B6|24|D9|4E|2C|38|C3|B8|DE|C1|7C|57|43|0B|50|BF|
|2A|E5|2E|B9|BD|BE|15|1B|39|08|39|23|EF|43|A4|26|67|FD|F2|12|35|B5|C9|93|76|C4|E9|B8|CF|39|F6|F8|5B|00|62|46|75|1A|
Transmission count = 1
Error count = 1
Max skip count = 0
Error type = 1
Bytes transmitted = 29
Primary received = 30
Reference received = 29
Data:
T |98|A3|56|54|BF|F2|FD|FA|7A|6C|53|15|14|EA|E3|2E|52|8F|20|57|09|58|28|A8|06|D5|D1|53|83|
P |98|00|A3|56|54|BF|F2|FD|FA|7A|6C|53|15|14|EA|E3|2E|52|8F|20|57|09|58|28|A8|06|D5|D1|53|83|
R |98|A3|56|54|BF|F2|FD|FA|7A|6C|53|15|14|EA|E3|2E|52|8F|20|57|09|58|28|A8|06|D5|D1|53|83|
^
UARTTest.c.txt
The text was updated successfully, but these errors were encountered: