-
Notifications
You must be signed in to change notification settings - Fork 5.2k
4.14.27-v7+ / 3+ VLAN hw csum failure #2458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm using vlans as well on up2date raspbian and think I'm affected by the same issue. Additional sympoms include high cpu usage (because of interrupt storm?) and very bad network connection. Thank you for sharing your workarounds. I will try to implement them tomorrow on a couple of 3B+'s that are kind of useless because of this issue :( If there are any tests or if there is any information I can contribute: please let me know. Kind regards, Ruben
top on rb3plus system:
2796 packets transmitted, 260 packets received, 90.7% packet loss 2743 packets transmitted, 773 packets received, +1 duplicates, 71.8% packet loss (ping stats from other systems on network) |
Same problem here. Thanks for sharing the workaround. But I cannot confirm the workaround. Setting MTU to 1496 on the vlan still produces hw csum error. |
Sorry, I haven't got a VLAN environment set up at the moment (I'll try to remember next week), but in looking at the throughput issues I've stumbled across the setup for the watchdog to abort frames that are too long.
to give a small amount more headroom before frames get aborted. Decreasing the mtu likewise returns you to a default packet size and are therefore within the standard timeout. |
Upgraded my raspbian-stretch-lite image to kernel version 4.14.31-v7+. But that does not change the problem. |
I wouldn't expect it to seeing as the issue is reported on 4.14.27 and there have been no significant changes between that and 4.14.31. |
same problem. works fine with raspberry pi 3 b but doesn't work with 3 b+ |
The frame abort timeout being set by lan78xx_set_rx_max_frame_length didn't account for any VLAN headers, resulting in very low throughput if used with tagged VLANs. Use VLAN_ETH_HLEN instead of ETH_HLEN to correct for this. See raspberrypi#2458 Signed-off-by: Dave Stevenson <[email protected]>
I've confirmed to my own satisfaction that the timeout I suspected was to blame. |
The frame abort timeout being set by lan78xx_set_rx_max_frame_length didn't account for any VLAN headers, resulting in very low throughput if used with tagged VLANs. Use VLAN_ETH_HLEN instead of ETH_HLEN to correct for this. See #2458 Signed-off-by: Dave Stevenson <[email protected]>
See: raspberrypi/linux#2458 kernel: Revert lan78xx: Simple patch to prevent some crashes kernel: lan78xx: Connect phy early kernel: lan78xx: Don't reset the interface on open See: raspberrypi/linux#2437 See: raspberrypi/linux#2442 See: raspberrypi/linux#2457 firmware: clockman: Don't use OSC for pixel clock See: https://www.raspberrypi.org/forums/viewtopic.php?f=29&t=24679&start=150#p1297298
See: raspberrypi/linux#2458 kernel: Revert lan78xx: Simple patch to prevent some crashes kernel: lan78xx: Connect phy early kernel: lan78xx: Don't reset the interface on open See: raspberrypi/linux#2437 See: raspberrypi/linux#2442 See: raspberrypi/linux#2457 firmware: clockman: Don't use OSC for pixel clock See: https://www.raspberrypi.org/forums/viewtopic.php?f=29&t=24679&start=150#p1297298
Latest rpi-update kernel has a potential fix for this issue. Please test. |
Thanks for the fix. I've upgraded to kernel 4.14.32-v7+ and tested in on a pi3 b+. A part of the problem seems to be fixed. The frequency of the error is very much reduced. But it still appears from time to time. I found that I can trigger the error by pinging the vlan interface from the same vlan. I see no dropped packages/transmit errors on the interface. This is the dmesg output on a not by ping triggered error:
|
@kgottschalk Can you try @sinistermidget Could you retest under your original conditions please? |
i have the same behavior as @kgottschalk with |
for some reason dns does not work over the vlan interface |
I confirm, that I do just ping . No other options. Standard 64 byte packet size. Both incoming and outgoing pings produce the hw csum error with the above listed dmesg output. The interface does not report dropped packets. Ping does not report packet loss. eth0.40: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 Communication over the vlan seem to be brocken. I have phones connected to the vlan and they are not able to register to the server. It's probably the same reason why @tilosp found DNS not working over the vlan. |
Can you try disabling IPv6 and retesting? The 3B LAN chip had an issue with IPv6 checksum offload when IPv4 was OK. |
I've disabled IPv6. No change. The ping still triggers the hw csum error. eth0.40: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 [ 221.483287] eth0.40: hw csum failure |
I have disabled ipv6 using |
Thanks for testing - it was just a hunch. Now people are reporting no DNS (UDP) or ping (ICMP). Were those working before and we've got a regression by increasing the timeout, or were they failing before and nobody noticed? If a regression then I'm very puzzled by it. I tested with iperf3 over a tagged VLAN (as well as untagged as a second interface) to a Pi3B and got ~92Mbit/s happily. Admittedly I was on static IP addresses and iperf uses TCP, but to get nothing is bizarre. Is DHCP working? Anyway, I'll investigate more tomorrow. |
I tried to disable all csum offloads using the ethtool. |
I think there were two problems that are to vlans only: A high rate of dropped packets and the hw csum error. It could be that they have different causes. @sinistermidget did open the record and reported that a reduced MTU of 1496 is a workaround. In @fuxjezz's post I see that he experienced packet loss on the vlan interface. His ping output shows a 70% packet loss. It seems that this problem was solved by using VLAN_ETH_HLEN instead of ETH_HLEN. I don't see any dropped packets on my vlan interface anymore. But the hw csum error must be caused by something else. I wrote in my first post that I'm not able to confirm that a reduced MTU works as a workaround. I've always tested with ping from a host connected to the same vlan and I always saw a hw csum error for each ping packet. This problem still persists. Ping packets are only 64 bytes + header. With these we never hit the MTU. Same is DNS packets. I'm just guessing: Maybe the vlan header part is left in the payload of the packet. That would explain the checksum error. |
I see a lot of errors related to csum only when the traffic goes through vlan. I guess you're right - vlan tag is in the payload. That's why we have the csum error (only for tagged packets). |
If they really have blundered on what is included in the csum calc, then that is a big blunder in the hardware design. Seeing as it is correct in other earlier chips I'd be surprised if it were wrong here. |
Does anyone verified the hardware calculated checksum? Is there a pattern (wrong endianess, alignment issue, static values or something else)? |
I've replicated the issue now. The pings are working fine, just logging this message in the kernel log. Generally the VLANs all seem to be running fine for me.
|
So, we only have the issue with ipv6 on vlan then. |
Statement or question? Your second line then says DHCPv4 isn't working, so I'm totally confused by your post. |
There appears to be some issue in the LAN78xx where the checksum computed on a VLAN tagged packet is incorrect, or at least not in the form that the kernel is after. This is most easily shown by pinging a device via a VLAN tagged interface and it will dump out the error message and stack trace from netdev_rx_csum_fault. It has also been seen with standard TCP and UDP packets. Until this is fully understood, request that the network stack computes the checksum on packets signalled as having a VLAN tag applied. See #2458 Signed-off-by: Dave Stevenson <[email protected]>
HW_VLAN_CTAG_FILTER was partially implemented, but not fully to Linux. Complete the implementation of this. See #2458. Signed-off-by: Dave Stevenson <[email protected]>
The chip supports stripping the VLAN tag and reporting it in metadata. Implement this as it also appears to solve the issues observed in checksum computation. See #2458. Signed-off-by: Dave Stevenson <[email protected]>
With HW_VLAN_CTAG_RX enabled we don't observe the checksum issue, so amend the workaround to only drop back to s/w checksums if VLAN offload is disabled. See #2458. Signed-off-by: Dave Stevenson <[email protected]>
The frame abort timeout being set by lan78xx_set_rx_max_frame_length didn't account for any VLAN headers, resulting in very low throughput if used with tagged VLANs. Use VLAN_ETH_HLEN instead of ETH_HLEN to correct for this. See #2458 Signed-off-by: Dave Stevenson <[email protected]>
There appears to be some issue in the LAN78xx where the checksum computed on a VLAN tagged packet is incorrect, or at least not in the form that the kernel is after. This is most easily shown by pinging a device via a VLAN tagged interface and it will dump out the error message and stack trace from netdev_rx_csum_fault. It has also been seen with standard TCP and UDP packets. Until this is fully understood, request that the network stack computes the checksum on packets signalled as having a VLAN tag applied. See #2458 Signed-off-by: Dave Stevenson <[email protected]>
HW_VLAN_CTAG_FILTER was partially implemented, but not fully to Linux. Complete the implementation of this. See #2458. Signed-off-by: Dave Stevenson <[email protected]>
The chip supports stripping the VLAN tag and reporting it in metadata. Implement this as it also appears to solve the issues observed in checksum computation. See #2458. Signed-off-by: Dave Stevenson <[email protected]>
With HW_VLAN_CTAG_RX enabled we don't observe the checksum issue, so amend the workaround to only drop back to s/w checksums if VLAN offload is disabled. See #2458. Signed-off-by: Dave Stevenson <[email protected]>
The frame abort timeout being set by lan78xx_set_rx_max_frame_length didn't account for any VLAN headers, resulting in very low throughput if used with tagged VLANs. Use VLAN_ETH_HLEN instead of ETH_HLEN to correct for this. See #2458 Signed-off-by: Dave Stevenson <[email protected]>
There appears to be some issue in the LAN78xx where the checksum computed on a VLAN tagged packet is incorrect, or at least not in the form that the kernel is after. This is most easily shown by pinging a device via a VLAN tagged interface and it will dump out the error message and stack trace from netdev_rx_csum_fault. It has also been seen with standard TCP and UDP packets. Until this is fully understood, request that the network stack computes the checksum on packets signalled as having a VLAN tag applied. See #2458 Signed-off-by: Dave Stevenson <[email protected]>
HW_VLAN_CTAG_FILTER was partially implemented, but not fully to Linux. Complete the implementation of this. See #2458. Signed-off-by: Dave Stevenson <[email protected]>
The chip supports stripping the VLAN tag and reporting it in metadata. Implement this as it also appears to solve the issues observed in checksum computation. See #2458. Signed-off-by: Dave Stevenson <[email protected]>
With HW_VLAN_CTAG_RX enabled we don't observe the checksum issue, so amend the workaround to only drop back to s/w checksums if VLAN offload is disabled. See #2458. Signed-off-by: Dave Stevenson <[email protected]>
The frame abort timeout being set by lan78xx_set_rx_max_frame_length didn't account for any VLAN headers, resulting in very low throughput if used with tagged VLANs. Use VLAN_ETH_HLEN instead of ETH_HLEN to correct for this. See #2458 Signed-off-by: Dave Stevenson <[email protected]>
There appears to be some issue in the LAN78xx where the checksum computed on a VLAN tagged packet is incorrect, or at least not in the form that the kernel is after. This is most easily shown by pinging a device via a VLAN tagged interface and it will dump out the error message and stack trace from netdev_rx_csum_fault. It has also been seen with standard TCP and UDP packets. Until this is fully understood, request that the network stack computes the checksum on packets signalled as having a VLAN tag applied. See #2458 Signed-off-by: Dave Stevenson <[email protected]>
HW_VLAN_CTAG_FILTER was partially implemented, but not fully to Linux. Complete the implementation of this. See #2458. Signed-off-by: Dave Stevenson <[email protected]>
The chip supports stripping the VLAN tag and reporting it in metadata. Implement this as it also appears to solve the issues observed in checksum computation. See #2458. Signed-off-by: Dave Stevenson <[email protected]>
With HW_VLAN_CTAG_RX enabled we don't observe the checksum issue, so amend the workaround to only drop back to s/w checksums if VLAN offload is disabled. See #2458. Signed-off-by: Dave Stevenson <[email protected]>
The frame abort timeout being set by lan78xx_set_rx_max_frame_length didn't account for any VLAN headers, resulting in very low throughput if used with tagged VLANs. Use VLAN_ETH_HLEN instead of ETH_HLEN to correct for this. See #2458 Signed-off-by: Dave Stevenson <[email protected]>
There appears to be some issue in the LAN78xx where the checksum computed on a VLAN tagged packet is incorrect, or at least not in the form that the kernel is after. This is most easily shown by pinging a device via a VLAN tagged interface and it will dump out the error message and stack trace from netdev_rx_csum_fault. It has also been seen with standard TCP and UDP packets. Until this is fully understood, request that the network stack computes the checksum on packets signalled as having a VLAN tag applied. See #2458 Signed-off-by: Dave Stevenson <[email protected]>
HW_VLAN_CTAG_FILTER was partially implemented, but not fully to Linux. Complete the implementation of this. See #2458. Signed-off-by: Dave Stevenson <[email protected]>
The chip supports stripping the VLAN tag and reporting it in metadata. Implement this as it also appears to solve the issues observed in checksum computation. See #2458. Signed-off-by: Dave Stevenson <[email protected]>
With HW_VLAN_CTAG_RX enabled we don't observe the checksum issue, so amend the workaround to only drop back to s/w checksums if VLAN offload is disabled. See #2458. Signed-off-by: Dave Stevenson <[email protected]>
BugLink: http://bugs.launchpad.net/bugs/1784025 The frame abort timeout being set by lan78xx_set_rx_max_frame_length didn't account for any VLAN headers, resulting in very low throughput if used with tagged VLANs. Use VLAN_ETH_HLEN instead of ETH_HLEN to correct for this. See raspberrypi/linux#2458 Signed-off-by: Dave Stevenson <[email protected]> (cherry picked from commit 2259b7a64d71f27311a19fd7a5bed47413d75985) Signed-off-by: Paolo Pisati <[email protected]> Acked-by: Stefan Bader <[email protected]> Acked-by: Kleber Sacilotto de Souza <[email protected]> Signed-off-by: Kleber Sacilotto de Souza <[email protected]>
The frame abort timeout being set by lan78xx_set_rx_max_frame_length didn't account for any VLAN headers, resulting in very low throughput if used with tagged VLANs. Use VLAN_ETH_HLEN instead of ETH_HLEN to correct for this. See raspberrypi/linux#2458 Signed-off-by: Dave Stevenson <[email protected]>
There appears to be some issue in the LAN78xx where the checksum computed on a VLAN tagged packet is incorrect, or at least not in the form that the kernel is after. This is most easily shown by pinging a device via a VLAN tagged interface and it will dump out the error message and stack trace from netdev_rx_csum_fault. It has also been seen with standard TCP and UDP packets. Until this is fully understood, request that the network stack computes the checksum on packets signalled as having a VLAN tag applied. See raspberrypi/linux#2458 Signed-off-by: Dave Stevenson <[email protected]>
The frame abort timeout being set by lan78xx_set_rx_max_frame_length didn't account for any VLAN headers, resulting in very low throughput if used with tagged VLANs. Use VLAN_ETH_HLEN instead of ETH_HLEN to correct for this. See raspberrypi/linux#2458 Signed-off-by: Dave Stevenson <[email protected]>
There appears to be some issue in the LAN78xx where the checksum computed on a VLAN tagged packet is incorrect, or at least not in the form that the kernel is after. This is most easily shown by pinging a device via a VLAN tagged interface and it will dump out the error message and stack trace from netdev_rx_csum_fault. It has also been seen with standard TCP and UDP packets. Until this is fully understood, request that the network stack computes the checksum on packets signalled as having a VLAN tag applied. See raspberrypi/linux#2458 Signed-off-by: Dave Stevenson <[email protected]>
Adding a VLAN to eth0 and then putting any traffic over it results in the following error regularly repeating:
Using ethtool to turn off hw csum offload as a workaround stops the message from reappearing.
Furthermore, attempting to copy a large file over the VLAN interface causes scp to stall. Adjusting the interface's MTU from 1500 to 1496 as a workaround resolves that issue.
There are no issues using the same SD card on an original Pi 3B (no +).
The issue has been replicated on two separate 3B+ boards using multiple power supplies, including the newly sanctioned 2.5A model.
These results have been replicated with the following system configurations:
The text was updated successfully, but these errors were encountered: