Skip to content

4.14.27-v7+ / 3+ VLAN hw csum failure #2458

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sinistermidget opened this issue Mar 21, 2018 · 58 comments
Closed

4.14.27-v7+ / 3+ VLAN hw csum failure #2458

sinistermidget opened this issue Mar 21, 2018 · 58 comments

Comments

@sinistermidget
Copy link

sinistermidget commented Mar 21, 2018

Adding a VLAN to eth0 and then putting any traffic over it results in the following error regularly repeating:

[ 1349.736843] eth0.20: hw csum failure
[ 1349.736865] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G         C      4.14.27-v7+ #1100
[ 1349.736870] Hardware name: BCM2835
[ 1349.736904] [<8010fff8>] (unwind_backtrace) from [<8010c260>] (show_stack+0x20/0x24)
[ 1349.736922] [<8010c260>] (show_stack) from [<8076dd84>] (dump_stack+0xd4/0x118)
[ 1349.736943] [<8076dd84>] (dump_stack) from [<8067665c>] (netdev_rx_csum_fault+0x44/0x48)
[ 1349.736962] [<8067665c>] (netdev_rx_csum_fault) from [<80668f7c>] (__skb_checksum_complete+0xb4/0xb8)
[ 1349.736980] [<80668f7c>] (__skb_checksum_complete) from [<806f997c>] (icmp_rcv+0xd0/0x388)
[ 1349.736997] [<806f997c>] (icmp_rcv) from [<806bfaf4>] (ip_local_deliver_finish+0xe4/0x330)
[ 1349.737012] [<806bfaf4>] (ip_local_deliver_finish) from [<806c0360>] (ip_local_deliver+0x54/0xdc)
[ 1349.737026] [<806c0360>] (ip_local_deliver) from [<806bff7c>] (ip_rcv_finish+0x23c/0x494)
[ 1349.737038] [<806bff7c>] (ip_rcv_finish) from [<806c0704>] (ip_rcv+0x31c/0x554)
[ 1349.737053] [<806c0704>] (ip_rcv) from [<80673c24>] (__netif_receive_skb_core+0x340/0xc84)
[ 1349.737068] [<80673c24>] (__netif_receive_skb_core) from [<806767d0>] (__netif_receive_skb+0x20/0x7c)
[ 1349.737083] [<806767d0>] (__netif_receive_skb) from [<806768c4>] (process_backlog+0x98/0x148)
[ 1349.737100] [<806768c4>] (process_backlog) from [<8067aba0>] (net_rx_action+0x2e8/0x45c)
[ 1349.737116] [<8067aba0>] (net_rx_action) from [<80101694>] (__do_softirq+0x18c/0x3d8)
[ 1349.737132] [<80101694>] (__do_softirq) from [<801237b4>] (irq_exit+0xe0/0x144)
[ 1349.737149] [<801237b4>] (irq_exit) from [<801754d8>] (__handle_domain_irq+0x70/0xc4)
[ 1349.737165] [<801754d8>] (__handle_domain_irq) from [<80101504>] (bcm2836_arm_irqchip_handle_irq+0xa8/0xac)
[ 1349.737181] [<80101504>] (bcm2836_arm_irqchip_handle_irq) from [<807899bc>] (__irq_svc+0x5c/0x7c)
[ 1349.737188] Exception stack(0x80c01ef0 to 0x80c01f38)
[ 1349.737198] 1ee0:                                     00000000 0574e8c0 397c6000 00000000
[ 1349.737210] 1f00: 80c00000 80c03dcc 80c03d68 80c81ffe 00000001 80b5fa30 babffa40 80c01f4c
[ 1349.737221] 1f20: 80c04174 80c01f40 80108a6c 80108a70 60000013 ffffffff
[ 1349.737239] [<807899bc>] (__irq_svc) from [<80108a70>] (arch_cpu_idle+0x34/0x4c)
[ 1349.737255] [<80108a70>] (arch_cpu_idle) from [<80789114>] (default_idle_call+0x34/0x48)
[ 1349.737271] [<80789114>] (default_idle_call) from [<80161170>] (do_idle+0xd8/0x150)
[ 1349.737285] [<80161170>] (do_idle) from [<80161484>] (cpu_startup_entry+0x28/0x2c)
[ 1349.737299] [<80161484>] (cpu_startup_entry) from [<80782e64>] (rest_init+0xbc/0xc0)
[ 1349.737317] [<80782e64>] (rest_init) from [<80b00df8>] (start_kernel+0x3d4/0x3e0)

Using ethtool to turn off hw csum offload as a workaround stops the message from reappearing.

Furthermore, attempting to copy a large file over the VLAN interface causes scp to stall. Adjusting the interface's MTU from 1500 to 1496 as a workaround resolves that issue.

There are no issues using the same SD card on an original Pi 3B (no +).

The issue has been replicated on two separate 3B+ boards using multiple power supplies, including the newly sanctioned 2.5A model.

These results have been replicated with the following system configurations:

  • 2018-03-13-raspbian-stretch-lite image with no modifications (kernel version 4.9.80-v7+)
  • 2018-03-13-raspbian-stretch-lite after running rpi-update (kernel version 4.14.27-v7+)
  • Gentoo with kernel version 4.9.80-v7
  • Gentoo with kernel version 4.15.10-v7
@fuxjezz
Copy link

fuxjezz commented Mar 24, 2018

I'm using vlans as well on up2date raspbian and think I'm affected by the same issue. Additional sympoms include high cpu usage (because of interrupt storm?) and very bad network connection.
Putting the sdcards back into "regular" Pi 3B's = no trouble whatsoever with those same installations/sdcards).

Thank you for sharing your workarounds. I will try to implement them tomorrow on a couple of 3B+'s that are kind of useless because of this issue :(

If there are any tests or if there is any information I can contribute: please let me know.

Kind regards,

Ruben

[15610.881778] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W       4.14.29-v7+ #1101
[15610.881781] Hardware name: BCM2835
[15610.881797] [<8010fff8>] (unwind_backtrace) from [<8010c260>] (show_stack+0x20/0x24)
[15610.881810] [<8010c260>] (show_stack) from [<8076e024>] (dump_stack+0xd4/0x118)
[15610.881824] [<8076e024>] (dump_stack) from [<80676898>] (netdev_rx_csum_fault+0x44/0x48)
[15610.881839] [<80676898>] (netdev_rx_csum_fault) from [<806691b8>] (__skb_checksum_complete+0xb4/0xb8)
[15610.881853] [<806691b8>] (__skb_checksum_complete) from [<806e8680>] (tcp_v4_rcv+0x648/0xca4)
[15610.881866] [<806e8680>] (tcp_v4_rcv) from [<806bfd2c>] (ip_local_deliver_finish+0xe4/0x330)
[15610.881878] [<806bfd2c>] (ip_local_deliver_finish) from [<806c0598>] (ip_local_deliver+0x54/0xdc)
[15610.881890] [<806c0598>] (ip_local_deliver) from [<806c01b4>] (ip_rcv_finish+0x23c/0x494)
[15610.881901] [<806c01b4>] (ip_rcv_finish) from [<806c093c>] (ip_rcv+0x31c/0x554)
[15610.881913] [<806c093c>] (ip_rcv) from [<80673e60>] (__netif_receive_skb_core+0x340/0xc84)
[15610.881926] [<80673e60>] (__netif_receive_skb_core) from [<80676a0c>] (__netif_receive_skb+0x20/0x7c)
[15610.881941] [<80676a0c>] (__netif_receive_skb) from [<8067a6c8>] (netif_receive_skb_internal+0x30/0xe0)
[15610.881956] [<8067a6c8>] (netif_receive_skb_internal) from [<8067a79c>] (netif_receive_skb+0x24/0x98)
[15610.882015] [<8067a79c>] (netif_receive_skb) from [<7f87140c>] (br_netif_receive_skb+0x48/0x60 [bridge])
[15610.882117] [<7f87140c>] (br_netif_receive_skb [bridge]) from [<7f8714dc>] (br_pass_frame_up+0xb8/0x11c [bridge])
[15610.882219] [<7f8714dc>] (br_pass_frame_up [bridge]) from [<7f871694>] (br_handle_frame_finish+0x110/0x4e0 [bridge])
[15610.882320] [<7f871694>] (br_handle_frame_finish [bridge]) from [<7f871bdc>] (br_handle_frame+0x178/0x2d4 [bridge])
[15610.882379] [<7f871bdc>] (br_handle_frame [bridge]) from [<80673ef0>] (__netif_receive_skb_core+0x3d0/0xc84)
[15610.882393] [<80673ef0>] (__netif_receive_skb_core) from [<80676a0c>] (__netif_receive_skb+0x20/0x7c)
[15610.882407] [<80676a0c>] (__netif_receive_skb) from [<80676b00>] (process_backlog+0x98/0x148)
[15610.882421] [<80676b00>] (process_backlog) from [<8067addc>] (net_rx_action+0x2e8/0x45c)
[15610.882435] [<8067addc>] (net_rx_action) from [<80101694>] (__do_softirq+0x18c/0x3d8)
[15610.882446] [<80101694>] (__do_softirq) from [<801237b4>] (irq_exit+0xe0/0x144)
[15610.882460] [<801237b4>] (irq_exit) from [<80175554>] (__handle_domain_irq+0x70/0xc4)
[15610.882473] [<80175554>] (__handle_domain_irq) from [<80101504>] (bcm2836_arm_irqchip_handle_irq+0xa8/0xac)
[15610.882486] [<80101504>] (bcm2836_arm_irqchip_handle_irq) from [<80789c3c>] (__irq_svc+0x5c/0x7c)
[15610.882491] Exception stack(0x80c01ef0 to 0x80c01f38)
[15610.882499] 1ee0:                                     00000000 26bdd9d0 3c363000 00000000

top on rb3plus system:

top - 20:38:27 up  4:20,  1 user,  load average: 3.85, 1.89, 1.27
Tasks:  85 total,   3 running,  46 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.2 us,  0.1 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem :   994136 total,   768520 free,    40664 used,   184952 buff/cache
KiB Swap:  1106812 total,  1106812 free,        0 used.   836096 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    7 root      20   0       0      0      0 R  89.1  0.0  21:23.75 ksoftirqd/0

2796 packets transmitted, 260 packets received, 90.7% packet loss
round-trip min/avg/max/stddev = 1.425/333511.060/1076825.483/310975.815 ms

2743 packets transmitted, 773 packets received, +1 duplicates, 71.8% packet loss
round-trip min/avg/max/stddev = 0.268/418017.065/1241204.172/344407.543 ms

(ping stats from other systems on network)

@klpgo
Copy link

klpgo commented Mar 29, 2018

Same problem here. Thanks for sharing the workaround. But I cannot confirm the workaround. Setting MTU to 1496 on the vlan still produces hw csum error.
My configuration is: raspbian-stretch-lite image with no modifications (kernel version 4.9.80-v7+)

@6by9
Copy link
Contributor

6by9 commented Mar 29, 2018

Sorry, I haven't got a VLAN environment set up at the moment (I'll try to remember next week), but in looking at the throughput issues I've stumbled across the setup for the watchdog to abort frames that are too long.
https://github.com/raspberrypi/linux/blob/rpi-4.14.y/drivers/net/usb/lan78xx.c#L2155
VLAN tagging will add 4 bytes to your packet, taking you right up to the abort watchdog time.
If you can rebuild your own kernel, then could you try increasing the value to

buf |= (((size + 12) << MAC_RX_MAX_SIZE_SHIFT_) & MAC_RX_MAX_SIZE_MASK_);

to give a small amount more headroom before frames get aborted. Decreasing the mtu likewise returns you to a default packet size and are therefore within the standard timeout.

@klpgo
Copy link

klpgo commented Mar 29, 2018

Upgraded my raspbian-stretch-lite image to kernel version 4.14.31-v7+. But that does not change the problem.

@6by9
Copy link
Contributor

6by9 commented Mar 29, 2018

Upgraded my raspbian-stretch-lite image to kernel version 4.14.31-v7+. But that does not change the problem.

I wouldn't expect it to seeing as the issue is reported on 4.14.27 and there have been no significant changes between that and 4.14.31.

@vintozver
Copy link

same problem. works fine with raspberry pi 3 b but doesn't work with 3 b+

6by9 added a commit to 6by9/linux that referenced this issue Apr 4, 2018
The frame abort timeout being set by lan78xx_set_rx_max_frame_length
didn't account for any VLAN headers, resulting in very low
throughput if used with tagged VLANs.
Use VLAN_ETH_HLEN instead of ETH_HLEN to correct for this.

See raspberrypi#2458

Signed-off-by: Dave Stevenson <[email protected]>
@6by9
Copy link
Contributor

6by9 commented Apr 4, 2018

I've confirmed to my own satisfaction that the timeout I suspected was to blame.
Pull Request created, and then hopefully VLAN tagged data work as expected.

pelwell pushed a commit that referenced this issue Apr 4, 2018
The frame abort timeout being set by lan78xx_set_rx_max_frame_length
didn't account for any VLAN headers, resulting in very low
throughput if used with tagged VLANs.
Use VLAN_ETH_HLEN instead of ETH_HLEN to correct for this.

See #2458

Signed-off-by: Dave Stevenson <[email protected]>
popcornmix added a commit to raspberrypi/firmware that referenced this issue Apr 4, 2018
See: raspberrypi/linux#2458

kernel: Revert lan78xx: Simple patch to prevent some crashes
kernel: lan78xx: Connect phy early
kernel: lan78xx: Don't reset the interface on open
See: raspberrypi/linux#2437
See: raspberrypi/linux#2442
See: raspberrypi/linux#2457

firmware: clockman: Don't use OSC for pixel clock
See: https://www.raspberrypi.org/forums/viewtopic.php?f=29&t=24679&start=150#p1297298
popcornmix added a commit to Hexxeh/rpi-firmware that referenced this issue Apr 4, 2018
See: raspberrypi/linux#2458

kernel: Revert lan78xx: Simple patch to prevent some crashes
kernel: lan78xx: Connect phy early
kernel: lan78xx: Don't reset the interface on open
See: raspberrypi/linux#2437
See: raspberrypi/linux#2442
See: raspberrypi/linux#2457

firmware: clockman: Don't use OSC for pixel clock
See: https://www.raspberrypi.org/forums/viewtopic.php?f=29&t=24679&start=150#p1297298
@popcornmix
Copy link
Collaborator

Latest rpi-update kernel has a potential fix for this issue. Please test.

@klpgo
Copy link

klpgo commented Apr 4, 2018

Thanks for the fix. I've upgraded to kernel 4.14.32-v7+ and tested in on a pi3 b+. A part of the problem seems to be fixed. The frequency of the error is very much reduced. But it still appears from time to time.

I found that I can trigger the error by pinging the vlan interface from the same vlan. I see no dropped packages/transmit errors on the interface.

This is the dmesg output on a not by ping triggered error:

[  279.552533] eth0.40: hw csum failure
[  279.552565] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G         C      4.14.32-v7+ #1106
[  279.552571] Hardware name: BCM2835
[  279.552619] [<8010fff8>] (unwind_backtrace) from [<8010c260>] (show_stack+0x20/0x24)
[  279.552645] [<8010c260>] (show_stack) from [<80783ca4>] (dump_stack+0xd4/0x118)
[  279.552678] [<80783ca4>] (dump_stack) from [<8068a898>] (netdev_rx_csum_fault+0x44/0x48)
[  279.552701] [<8068a898>] (netdev_rx_csum_fault) from [<8067d1b8>] (__skb_checksum_complete+0xb4/0xb8)
[  279.552720] [<8067d1b8>] (__skb_checksum_complete) from [<8070e2ec>] (icmp_rcv+0xd0/0x388)
[  279.552739] [<8070e2ec>] (icmp_rcv) from [<806d3e34>] (ip_local_deliver_finish+0xe4/0x330)
[  279.552758] [<806d3e34>] (ip_local_deliver_finish) from [<806d46ec>] (ip_local_deliver+0x54/0xdc)
[  279.552780] [<806d46ec>] (ip_local_deliver) from [<806d42fc>] (ip_rcv_finish+0x27c/0x4e0)
[  279.552800] [<806d42fc>] (ip_rcv_finish) from [<806d4a90>] (ip_rcv+0x31c/0x554)
[  279.552822] [<806d4a90>] (ip_rcv) from [<80687e60>] (__netif_receive_skb_core+0x340/0xc84)
[  279.552844] [<80687e60>] (__netif_receive_skb_core) from [<8068aa0c>] (__netif_receive_skb+0x20/0x7c)
[  279.552866] [<8068aa0c>] (__netif_receive_skb) from [<8068ab00>] (process_backlog+0x98/0x148)
[  279.552887] [<8068ab00>] (process_backlog) from [<8068eddc>] (net_rx_action+0x2e8/0x45c)
[  279.552905] [<8068eddc>] (net_rx_action) from [<80101694>] (__do_softirq+0x18c/0x3d8)
[  279.552922] [<80101694>] (__do_softirq) from [<801237b4>] (irq_exit+0xe0/0x144)
[  279.552941] [<801237b4>] (irq_exit) from [<80175554>] (__handle_domain_irq+0x70/0xc4)
[  279.552958] [<80175554>] (__handle_domain_irq) from [<80101504>] (bcm2836_arm_irqchip_handle_irq+0xa8/0xac)
[  279.552975] [<80101504>] (bcm2836_arm_irqchip_handle_irq) from [<8079f8bc>] (__irq_svc+0x5c/0x7c)
[  279.552990] Exception stack(0x80c01ef0 to 0x80c01f38)
[  279.553009] 1ee0:                                     00000000 01027c3c 397c4000 00000000
[  279.553027] 1f00: 80c00000 80c03dcc 80c03d68 80c88172 00000001 80b60a30 babffa40 80c01f4c
[  279.553041] 1f20: 80c04174 80c01f40 80108a6c 80108a70 60000013 ffffffff
[  279.553061] [<8079f8bc>] (__irq_svc) from [<80108a70>] (arch_cpu_idle+0x34/0x4c)
[  279.553078] [<80108a70>] (arch_cpu_idle) from [<8079f034>] (default_idle_call+0x34/0x48)
[  279.553095] [<8079f034>] (default_idle_call) from [<801611ec>] (do_idle+0xd8/0x150)
[  279.553109] [<801611ec>] (do_idle) from [<80161500>] (cpu_startup_entry+0x28/0x2c)
[  279.553130] [<80161500>] (cpu_startup_entry) from [<80798d84>] (rest_init+0xbc/0xc0)
[  279.553151] [<80798d84>] (rest_init) from [<80b00df8>] (start_kernel+0x3d4/0x3e0)

@6by9
Copy link
Contributor

6by9 commented Apr 5, 2018

@kgottschalk Can you try sudo ethtool -K eth0 rx off to confirm that disabling hw rx checksumming solves the problem? I think you are seeing something unrelated, so need to diagnose that in isolation now that we have (hopefully) resolved the first issue.
Pings are normally never anywhere near the full MTU, so your issue won't be the rx jabber watchdog firing. Are you really doing just ping <Pi IP addr>, or is there something else involved?

@sinistermidget Could you retest under your original conditions please?

@tilosp
Copy link

tilosp commented Apr 5, 2018

i have the same behavior as @kgottschalk with 4.14.32-v7+
and i can confirm that sudo ethtool -K eth0 rx off solves the problem.

@tilosp
Copy link

tilosp commented Apr 5, 2018

for some reason dns does not work over the vlan interface

@klpgo
Copy link

klpgo commented Apr 5, 2018

I confirm, that I do just ping . No other options. Standard 64 byte packet size. Both incoming and outgoing pings produce the hw csum error with the above listed dmesg output. The interface does not report dropped packets. Ping does not report packet loss.

eth0.40: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.40.71 netmask 255.255.255.0 broadcast 192.168.40.255
inet6 fe80::a241:257f:2878:580d prefixlen 64 scopeid 0x20
ether b8:27:eb:b1:26:e4 txqueuelen 1000 (Ethernet)
RX packets 1267 bytes 754672 (736.9 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 987 bytes 420166 (410.3 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Communication over the vlan seem to be brocken. I have phones connected to the vlan and they are not able to register to the server. It's probably the same reason why @tilosp found DNS not working over the vlan.

@6by9
Copy link
Contributor

6by9 commented Apr 5, 2018

Can you try disabling IPv6 and retesting? The 3B LAN chip had an issue with IPv6 checksum offload when IPv4 was OK.

@klpgo
Copy link

klpgo commented Apr 5, 2018

I've disabled IPv6. No change. The ping still triggers the hw csum error.

eth0.40: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.40.71 netmask 255.255.255.0 broadcast 192.168.40.255
ether b8:27:eb:b1:26:e4 txqueuelen 1000 (Ethernet)
RX packets 564 bytes 327373 (319.7 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 602 bytes 234489 (228.9 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

[ 221.483287] eth0.40: hw csum failure
[ 221.483299] CPU: 0 PID: 1827 Comm: tmux: server Tainted: G C 4.14.32-v7+ #1106
[ 221.483301] Hardware name: BCM2835
[ 221.483321] [<8010fff8>] (unwind_backtrace) from [<8010c260>] (show_stack+0x20/0x24)
[ 221.483332] [<8010c260>] (show_stack) from [<80783ca4>] (dump_stack+0xd4/0x118)
[ 221.483343] [<80783ca4>] (dump_stack) from [<8068a898>] (netdev_rx_csum_fault+0x44/0x48)
[ 221.483352] [<8068a898>] (netdev_rx_csum_fault) from [<8067d1b8>] (__skb_checksum_complete+0xb4/0xb8)
[ 221.483361] [<8067d1b8>] (__skb_checksum_complete) from [<8070e2ec>] (icmp_rcv+0xd0/0x388)
[ 221.483370] [<8070e2ec>] (icmp_rcv) from [<806d3e34>] (ip_local_deliver_finish+0xe4/0x330)
[ 221.483378] [<806d3e34>] (ip_local_deliver_finish) from [<806d46ec>] (ip_local_deliver+0x54/0xdc)
[ 221.483384] [<806d46ec>] (ip_local_deliver) from [<806d42fc>] (ip_rcv_finish+0x27c/0x4e0)
[ 221.483390] [<806d42fc>] (ip_rcv_finish) from [<806d4a90>] (ip_rcv+0x31c/0x554)
[ 221.483397] [<806d4a90>] (ip_rcv) from [<80687e60>] (__netif_receive_skb_core+0x340/0xc84)
[ 221.483404] [<80687e60>] (__netif_receive_skb_core) from [<8068aa0c>] (__netif_receive_skb+0x20/0x7c)
[ 221.483411] [<8068aa0c>] (__netif_receive_skb) from [<8068ab00>] (process_backlog+0x98/0x148)
[ 221.483419] [<8068ab00>] (process_backlog) from [<8068eddc>] (net_rx_action+0x2e8/0x45c)
[ 221.483427] [<8068eddc>] (net_rx_action) from [<80101694>] (__do_softirq+0x18c/0x3d8)
[ 221.483434] [<80101694>] (__do_softirq) from [<801237b4>] (irq_exit+0xe0/0x144)
[ 221.483442] [<801237b4>] (irq_exit) from [<80175554>] (__handle_domain_irq+0x70/0xc4)
[ 221.483450] [<80175554>] (__handle_domain_irq) from [<80101504>] (bcm2836_arm_irqchip_handle_irq+0xa8/0xac)
[ 221.483459] [<80101504>] (bcm2836_arm_irqchip_handle_irq) from [<8079fc0c>] (__irq_usr+0x4c/0x60)
[ 221.483462] Exception stack(0x9d3fdfb0 to 0x9d3fdff8)
[ 221.483467] dfa0: 0050e635 008a0fc9 0000006c 00000073
[ 221.483472] dfc0: 008a0f80 0050e634 76f98ce8 004f6ed8 00000032 0000000c 00527b90 7ebfcecc
[ 221.483477] dfe0: 00527bc0 7ebfce18 004e178c 76db2414 20000010 ffffffff

@tilosp
Copy link

tilosp commented Apr 5, 2018

I have disabled ipv6 using sysctl -w net.ipv6.conf.all.disable_ipv6=1 and sysctl -w net.ipv6.conf.default.disable_ipv6=1.
And without ipv6 dns still does not work. It work only after running ethtool -K eth0 rx off.

@6by9
Copy link
Contributor

6by9 commented Apr 5, 2018

Thanks for testing - it was just a hunch.
Please could you clarify the situation. Originally the report was issues when using the full MTU of 1500. Was everything else working OK?

Now people are reporting no DNS (UDP) or ping (ICMP). Were those working before and we've got a regression by increasing the timeout, or were they failing before and nobody noticed?

If a regression then I'm very puzzled by it. I tested with iperf3 over a tagged VLAN (as well as untagged as a second interface) to a Pi3B and got ~92Mbit/s happily. Admittedly I was on static IP addresses and iperf uses TCP, but to get nothing is bizarre. Is DHCP working? Anyway, I'll investigate more tomorrow.

@vintozver
Copy link

I tried to disable all csum offloads using the ethtool.
Result is: errors disappear, but traffic doesn't you through either. ipv6 and udp are completely busted.

@klpgo
Copy link

klpgo commented Apr 5, 2018

I think there were two problems that are to vlans only: A high rate of dropped packets and the hw csum error. It could be that they have different causes.

@sinistermidget did open the record and reported that a reduced MTU of 1496 is a workaround. In @fuxjezz's post I see that he experienced packet loss on the vlan interface. His ping output shows a 70% packet loss. It seems that this problem was solved by using VLAN_ETH_HLEN instead of ETH_HLEN. I don't see any dropped packets on my vlan interface anymore.

But the hw csum error must be caused by something else. I wrote in my first post that I'm not able to confirm that a reduced MTU works as a workaround. I've always tested with ping from a host connected to the same vlan and I always saw a hw csum error for each ping packet. This problem still persists. Ping packets are only 64 bytes + header. With these we never hit the MTU. Same is DNS packets.

I'm just guessing: Maybe the vlan header part is left in the payload of the packet. That would explain the checksum error.

@vintozver
Copy link

I see a lot of errors related to csum only when the traffic goes through vlan.
Bringing the vlan down results in stopping the traces. Traffic goes normally (both ipv4 & ipv6) through the interface.

I guess you're right - vlan tag is in the payload. That's why we have the csum error (only for tagged packets).

@6by9
Copy link
Contributor

6by9 commented Apr 5, 2018

If they really have blundered on what is included in the csum calc, then that is a big blunder in the hardware design. Seeing as it is correct in other earlier chips I'd be surprised if it were wrong here.
Microchip are the only ones who can really answer that question, but I will be doing some further experimentation tomorrow to see if we can understand the problem better.

@lategoodbye
Copy link
Contributor

Does anyone verified the hardware calculated checksum? Is there a pattern (wrong endianess, alignment issue, static values or something else)?

@6by9
Copy link
Contributor

6by9 commented Apr 6, 2018

I've replicated the issue now. The pings are working fine, just logging this message in the kernel log.
The LAN78xx can flag that it has been unable to compute the checksum (rx_cmd_a & RX_CMD_A_ICSM_ condition in lan78xx_rx_csum_offload), and I am seeing that flag set a moderate amount, including (I believe) on the ping frames.
I'm trying to confirm the exact conditions for the ping packets at the moment.

Generally the VLANs all seem to be running fine for me.
My setup is:

  • a TP-Link TL-SG108E switch.
  • a Draytek 2830 off a port configured with 4 VLANs as tagged, set up for DHCP on each VLAN.
  • a Pi3B+ off a port configured with 4 VLANs as tagged.
  • Four other ports on the switch set as untagged on each of those 4 VLANs. Connect a Pi3B (not 3B+) to each of those ports in turn. Draytek is handing out an IP address on the different subnets correctly, and then I can run iperf3 and get ~92Mbit/s in either direction between the 3B and 3B+. iperf3 also tested in UDP mode. Ping is working (though logging the csum failure in dmesg).

@vintozver
Copy link

So, we only have the issue with ipv6 on vlan then.
I have Netgear GS108T, Pi is connected to the port with tagging for vlan 101. both dhcpv4 and dhcpv6 are inoperable, ipv6 ping traffic doesn't go through.

@6by9
Copy link
Contributor

6by9 commented Apr 6, 2018

So, we only have the issue with ipv6 on vlan then.

Statement or question? Your second line then says DHCPv4 isn't working, so I'm totally confused by your post.

popcornmix pushed a commit that referenced this issue Jul 30, 2018
There appears to be some issue in the LAN78xx where the checksum
computed on a VLAN tagged packet is incorrect, or at least not
in the form that the kernel is after. This is most easily shown
by pinging a device via a VLAN tagged interface and it will dump
out the error message and stack trace from netdev_rx_csum_fault.
It has also been seen with standard TCP and UDP packets.

Until this is fully understood, request that the network stack
computes the checksum on packets signalled as having a VLAN tag
applied.

See #2458

Signed-off-by: Dave Stevenson <[email protected]>
popcornmix pushed a commit that referenced this issue Jul 30, 2018
HW_VLAN_CTAG_FILTER was partially implemented, but not fully to Linux.
Complete the implementation of this.

See #2458.

Signed-off-by: Dave Stevenson <[email protected]>
popcornmix pushed a commit that referenced this issue Jul 30, 2018
The chip supports stripping the VLAN tag and reporting it
in metadata. Implement this as it also appears to solve the
issues observed in checksum computation.

See #2458.

Signed-off-by: Dave Stevenson <[email protected]>
popcornmix pushed a commit that referenced this issue Jul 30, 2018
With HW_VLAN_CTAG_RX enabled we don't observe the checksum
issue, so amend the workaround to only drop back to s/w
checksums if VLAN offload is disabled.

See #2458.

Signed-off-by: Dave Stevenson <[email protected]>
popcornmix pushed a commit that referenced this issue Aug 7, 2018
The frame abort timeout being set by lan78xx_set_rx_max_frame_length
didn't account for any VLAN headers, resulting in very low
throughput if used with tagged VLANs.
Use VLAN_ETH_HLEN instead of ETH_HLEN to correct for this.

See #2458

Signed-off-by: Dave Stevenson <[email protected]>
popcornmix pushed a commit that referenced this issue Aug 7, 2018
There appears to be some issue in the LAN78xx where the checksum
computed on a VLAN tagged packet is incorrect, or at least not
in the form that the kernel is after. This is most easily shown
by pinging a device via a VLAN tagged interface and it will dump
out the error message and stack trace from netdev_rx_csum_fault.
It has also been seen with standard TCP and UDP packets.

Until this is fully understood, request that the network stack
computes the checksum on packets signalled as having a VLAN tag
applied.

See #2458

Signed-off-by: Dave Stevenson <[email protected]>
popcornmix pushed a commit that referenced this issue Aug 7, 2018
HW_VLAN_CTAG_FILTER was partially implemented, but not fully to Linux.
Complete the implementation of this.

See #2458.

Signed-off-by: Dave Stevenson <[email protected]>
popcornmix pushed a commit that referenced this issue Aug 7, 2018
The chip supports stripping the VLAN tag and reporting it
in metadata. Implement this as it also appears to solve the
issues observed in checksum computation.

See #2458.

Signed-off-by: Dave Stevenson <[email protected]>
popcornmix pushed a commit that referenced this issue Aug 7, 2018
With HW_VLAN_CTAG_RX enabled we don't observe the checksum
issue, so amend the workaround to only drop back to s/w
checksums if VLAN offload is disabled.

See #2458.

Signed-off-by: Dave Stevenson <[email protected]>
popcornmix pushed a commit that referenced this issue Aug 14, 2018
The frame abort timeout being set by lan78xx_set_rx_max_frame_length
didn't account for any VLAN headers, resulting in very low
throughput if used with tagged VLANs.
Use VLAN_ETH_HLEN instead of ETH_HLEN to correct for this.

See #2458

Signed-off-by: Dave Stevenson <[email protected]>
popcornmix pushed a commit that referenced this issue Aug 14, 2018
There appears to be some issue in the LAN78xx where the checksum
computed on a VLAN tagged packet is incorrect, or at least not
in the form that the kernel is after. This is most easily shown
by pinging a device via a VLAN tagged interface and it will dump
out the error message and stack trace from netdev_rx_csum_fault.
It has also been seen with standard TCP and UDP packets.

Until this is fully understood, request that the network stack
computes the checksum on packets signalled as having a VLAN tag
applied.

See #2458

Signed-off-by: Dave Stevenson <[email protected]>
popcornmix pushed a commit that referenced this issue Aug 14, 2018
HW_VLAN_CTAG_FILTER was partially implemented, but not fully to Linux.
Complete the implementation of this.

See #2458.

Signed-off-by: Dave Stevenson <[email protected]>
popcornmix pushed a commit that referenced this issue Aug 14, 2018
The chip supports stripping the VLAN tag and reporting it
in metadata. Implement this as it also appears to solve the
issues observed in checksum computation.

See #2458.

Signed-off-by: Dave Stevenson <[email protected]>
popcornmix pushed a commit that referenced this issue Aug 14, 2018
With HW_VLAN_CTAG_RX enabled we don't observe the checksum
issue, so amend the workaround to only drop back to s/w
checksums if VLAN offload is disabled.

See #2458.

Signed-off-by: Dave Stevenson <[email protected]>
popcornmix pushed a commit that referenced this issue Aug 22, 2018
The frame abort timeout being set by lan78xx_set_rx_max_frame_length
didn't account for any VLAN headers, resulting in very low
throughput if used with tagged VLANs.
Use VLAN_ETH_HLEN instead of ETH_HLEN to correct for this.

See #2458

Signed-off-by: Dave Stevenson <[email protected]>
popcornmix pushed a commit that referenced this issue Aug 22, 2018
There appears to be some issue in the LAN78xx where the checksum
computed on a VLAN tagged packet is incorrect, or at least not
in the form that the kernel is after. This is most easily shown
by pinging a device via a VLAN tagged interface and it will dump
out the error message and stack trace from netdev_rx_csum_fault.
It has also been seen with standard TCP and UDP packets.

Until this is fully understood, request that the network stack
computes the checksum on packets signalled as having a VLAN tag
applied.

See #2458

Signed-off-by: Dave Stevenson <[email protected]>
popcornmix pushed a commit that referenced this issue Aug 22, 2018
HW_VLAN_CTAG_FILTER was partially implemented, but not fully to Linux.
Complete the implementation of this.

See #2458.

Signed-off-by: Dave Stevenson <[email protected]>
popcornmix pushed a commit that referenced this issue Aug 22, 2018
The chip supports stripping the VLAN tag and reporting it
in metadata. Implement this as it also appears to solve the
issues observed in checksum computation.

See #2458.

Signed-off-by: Dave Stevenson <[email protected]>
popcornmix pushed a commit that referenced this issue Aug 22, 2018
With HW_VLAN_CTAG_RX enabled we don't observe the checksum
issue, so amend the workaround to only drop back to s/w
checksums if VLAN offload is disabled.

See #2458.

Signed-off-by: Dave Stevenson <[email protected]>
popcornmix pushed a commit that referenced this issue Aug 29, 2018
The frame abort timeout being set by lan78xx_set_rx_max_frame_length
didn't account for any VLAN headers, resulting in very low
throughput if used with tagged VLANs.
Use VLAN_ETH_HLEN instead of ETH_HLEN to correct for this.

See #2458

Signed-off-by: Dave Stevenson <[email protected]>
popcornmix pushed a commit that referenced this issue Aug 29, 2018
There appears to be some issue in the LAN78xx where the checksum
computed on a VLAN tagged packet is incorrect, or at least not
in the form that the kernel is after. This is most easily shown
by pinging a device via a VLAN tagged interface and it will dump
out the error message and stack trace from netdev_rx_csum_fault.
It has also been seen with standard TCP and UDP packets.

Until this is fully understood, request that the network stack
computes the checksum on packets signalled as having a VLAN tag
applied.

See #2458

Signed-off-by: Dave Stevenson <[email protected]>
popcornmix pushed a commit that referenced this issue Aug 29, 2018
HW_VLAN_CTAG_FILTER was partially implemented, but not fully to Linux.
Complete the implementation of this.

See #2458.

Signed-off-by: Dave Stevenson <[email protected]>
popcornmix pushed a commit that referenced this issue Aug 29, 2018
The chip supports stripping the VLAN tag and reporting it
in metadata. Implement this as it also appears to solve the
issues observed in checksum computation.

See #2458.

Signed-off-by: Dave Stevenson <[email protected]>
popcornmix pushed a commit that referenced this issue Aug 29, 2018
With HW_VLAN_CTAG_RX enabled we don't observe the checksum
issue, so amend the workaround to only drop back to s/w
checksums if VLAN offload is disabled.

See #2458.

Signed-off-by: Dave Stevenson <[email protected]>
APokorny pushed a commit to ubports/ubuntu_kernel_xenial that referenced this issue Oct 11, 2018
BugLink: http://bugs.launchpad.net/bugs/1784025

The frame abort timeout being set by lan78xx_set_rx_max_frame_length
didn't account for any VLAN headers, resulting in very low
throughput if used with tagged VLANs.
Use VLAN_ETH_HLEN instead of ETH_HLEN to correct for this.

See raspberrypi/linux#2458

Signed-off-by: Dave Stevenson <[email protected]>
(cherry picked from commit 2259b7a64d71f27311a19fd7a5bed47413d75985)
Signed-off-by: Paolo Pisati <[email protected]>
Acked-by: Stefan Bader <[email protected]>
Acked-by: Kleber Sacilotto de Souza <[email protected]>
Signed-off-by: Kleber Sacilotto de Souza <[email protected]>
jai-raptee pushed a commit to jai-raptee/iliteck1 that referenced this issue Apr 30, 2024
The frame abort timeout being set by lan78xx_set_rx_max_frame_length
didn't account for any VLAN headers, resulting in very low
throughput if used with tagged VLANs.
Use VLAN_ETH_HLEN instead of ETH_HLEN to correct for this.

See raspberrypi/linux#2458

Signed-off-by: Dave Stevenson <[email protected]>
jai-raptee pushed a commit to jai-raptee/iliteck1 that referenced this issue Apr 30, 2024
There appears to be some issue in the LAN78xx where the checksum
computed on a VLAN tagged packet is incorrect, or at least not
in the form that the kernel is after. This is most easily shown
by pinging a device via a VLAN tagged interface and it will dump
out the error message and stack trace from netdev_rx_csum_fault.
It has also been seen with standard TCP and UDP packets.

Until this is fully understood, request that the network stack
computes the checksum on packets signalled as having a VLAN tag
applied.

See raspberrypi/linux#2458

Signed-off-by: Dave Stevenson <[email protected]>
jai-raptee pushed a commit to jai-raptee/iliteck1 that referenced this issue Apr 30, 2024
The frame abort timeout being set by lan78xx_set_rx_max_frame_length
didn't account for any VLAN headers, resulting in very low
throughput if used with tagged VLANs.
Use VLAN_ETH_HLEN instead of ETH_HLEN to correct for this.

See raspberrypi/linux#2458

Signed-off-by: Dave Stevenson <[email protected]>
jai-raptee pushed a commit to jai-raptee/iliteck1 that referenced this issue Apr 30, 2024
There appears to be some issue in the LAN78xx where the checksum
computed on a VLAN tagged packet is incorrect, or at least not
in the form that the kernel is after. This is most easily shown
by pinging a device via a VLAN tagged interface and it will dump
out the error message and stack trace from netdev_rx_csum_fault.
It has also been seen with standard TCP and UDP packets.

Until this is fully understood, request that the network stack
computes the checksum on packets signalled as having a VLAN tag
applied.

See raspberrypi/linux#2458

Signed-off-by: Dave Stevenson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants