Skip to content

4.14.74-v7+ 3b+ VLAN udp IPv4/6 - hw csum failure #2713

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Wireheadbe opened this issue Oct 12, 2018 · 54 comments
Closed

4.14.74-v7+ 3b+ VLAN udp IPv4/6 - hw csum failure #2713

Wireheadbe opened this issue Oct 12, 2018 · 54 comments
Labels
Waiting for internal comment Waiting for comment from a member of the Raspberry Pi engineering team

Comments

@Wireheadbe
Copy link
Contributor

Wireheadbe commented Oct 12, 2018

Hi All - seems like issue #2458 is popping up again.

Setup is eth0 on local lan; eth0.30 with tagged data. Running IPv6 as well. (both over eth0.30 as tun0 for openvpn)
Having OpenVPN traffic comming in over eth0.30 cause lots of these errors:

[  862.694409] eth0: hw csum failure
[  862.694452] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G         C      4.14.74-v7+ #1149
[  862.694458] Hardware name: BCM2835
[  862.694495] [<8010ffd4>] (unwind_backtrace) from [<8010c240>] (show_stack+0x20/0x24)
[  862.694514] [<8010c240>] (show_stack) from [<80788604>] (dump_stack+0xd4/0x118)
[  862.694537] [<80788604>] (dump_stack) from [<8068e870>] (netdev_rx_csum_fault+0x44/0x48)
[  862.694567] [<8068e870>] (netdev_rx_csum_fault) from [<80681144>] (__skb_checksum_complete+0xbc/0xc0)
[  862.694587] [<80681144>] (__skb_checksum_complete) from [<807331c8>] (nf_ip_checksum+0xd4/0x130)
[  862.694704] [<807331c8>] (nf_ip_checksum) from [<7f558a4c>] (udp_error+0x138/0x1c8 [nf_conntrack])
[  862.694894] [<7f558a4c>] (udp_error [nf_conntrack]) from [<7f551994>] (nf_conntrack_in+0xec/0x560 [nf_conntrack])
[  862.695004] [<7f551994>] (nf_conntrack_in [nf_conntrack]) from [<7f59f2dc>] (ipv4_conntrack_in+0x28/0x2c [nf_conntrack_ipv4])
[  862.695031] [<7f59f2dc>] (ipv4_conntrack_in [nf_conntrack_ipv4]) from [<806d04dc>] (nf_hook_slow+0x4c/0xd0)
[  862.695056] [<806d04dc>] (nf_hook_slow) from [<806d8df4>] (ip_rcv+0x460/0x514)
[  862.695075] [<806d8df4>] (ip_rcv) from [<8068be4c>] (__netif_receive_skb_core+0x340/0xc84)
[  862.695094] [<8068be4c>] (__netif_receive_skb_core) from [<8068e9e4>] (__netif_receive_skb+0x20/0x7c)
[  862.695112] [<8068e9e4>] (__netif_receive_skb) from [<8068ead8>] (process_backlog+0x98/0x148)
[  862.695128] [<8068ead8>] (process_backlog) from [<80692df0>] (net_rx_action+0x2e8/0x45c)
[  862.695144] [<80692df0>] (net_rx_action) from [<80101694>] (__do_softirq+0x18c/0x3d8)
[  862.695161] [<80101694>] (__do_softirq) from [<80123870>] (irq_exit+0x108/0x164)
[  862.695179] [<80123870>] (irq_exit) from [<801759a8>] (__handle_domain_irq+0x70/0xc4)
[  862.695195] [<801759a8>] (__handle_domain_irq) from [<80101504>] (bcm2836_arm_irqchip_handle_irq+0xa8/0xac)
[  862.695211] [<80101504>] (bcm2836_arm_irqchip_handle_irq) from [<807a41bc>] (__irq_svc+0x5c/0x7c)
[  862.695218] Exception stack(0x80c01ef0 to 0x80c01f38)
[  862.695228] 1ee0:                                     00000000 045b6fe8 397c2000 00000000
[  862.695241] 1f00: 80c00000 80c03dcc 80c03d68 80c885b2 00000001 80b60a30 babff9c0 80c01f4c
[  862.695252] 1f20: 80c04174 80c01f40 80108a4c 80108a50 60000013 ffffffff
[  862.695268] [<807a41bc>] (__irq_svc) from [<80108a50>] (arch_cpu_idle+0x34/0x4c)
[  862.695291] [<80108a50>] (arch_cpu_idle) from [<807a393c>] (default_idle_call+0x34/0x48)
[  862.695309] [<807a393c>] (default_idle_call) from [<801614b8>] (do_idle+0xd8/0x150)
[  862.695324] [<801614b8>] (do_idle) from [<801617cc>] (cpu_startup_entry+0x28/0x2c)
[  862.695339] [<801617cc>] (cpu_startup_entry) from [<8079d664>] (rest_init+0xbc/0xc0)
[  862.695359] [<8079d664>] (rest_init) from [<80b00df8>] (start_kernel+0x3d4/0x3e0)

Seems like issue 2458 is popping up again or wasn't fixed completely. Running

ethtool -K eth0 gro off 
ethtool -K eth0 rx-vlan-hw-parse off

Still results in these errors being generated.

@Wireheadbe
Copy link
Contributor Author

#2659 seems similar - openvpn and iptables involved as well

@Wireheadbe
Copy link
Contributor Author

Does look like it's coming in from ipv4 instead looking at the trace

@6by9
Copy link
Contributor

6by9 commented Oct 12, 2018

@Wireheadbe
Do you have an easy way to reproduce?
I'm assuming that this has been running without logging errors in the past, or has this never worked cleanly for you?

Can you try reverting to an earlier kernel and retesting? sudo rpi-update <hash> where hash is the commit ID from https://github.com/Hexxeh/rpi-firmware/commits/master. Even a rough idea of when this has started failing would be useful.

We haven't made any changes since #2458 regarding the lan78xx driver, but it's possible something in the core has been backported to stable and is causing grief.
The fact that we're also seeing hw csum errors being reported without VLANs involved, and on SMSC95xx (#2712), imply it is something more generic and not #2458 having returned.

@Wireheadbe
Copy link
Contributor Author

It's a pretty recent install. Only thing I added yesterday was DHCPv6 next to radvd. This morning I connected via OpenVPN and noticed them. That's the only change I had in terms of functionality. I was getting them on an earlier kernel; so i actually did an rpi-update. The issue remained. The offloading disable does make them happen less often.

@Wireheadbe
Copy link
Contributor Author

went back to ec9d84e - seems to be gone at the moment? I'll keep an eye on it before confirming 100%

@Wireheadbe
Copy link
Contributor Author

Wireheadbe commented Oct 12, 2018

Did some speedtests / ipv4 / ipv6. VPN traffic etc etc.. still quiet at the moment on ec9d84e - 4.14.54-v7+ SMP Mon Jul 9 16:41:01 BST 2018 armv7l GNU/Linux

@6by9
Copy link
Contributor

6by9 commented Oct 12, 2018

Thank you - that's a brilliant data point.

If you find that is stable, please can you try a couple of further bisections? Trying a5b781c (Sept 5) and then either e880e62 (Sept 18) if that is good, or 911147a (Aug 16) if not would be wonderful.
If you can narrow it down to a single commit then there's an even better chance of us finding the issue, but I do recognise that is quite an ask.
Currently we're still in the position of not being able to reproduce it, so we're totally stuck.

@Wireheadbe
Copy link
Contributor Author

tried every single version from that - seems to happen from c919d63

@Wireheadbe Wireheadbe changed the title 4.14.74-v7+ 3b+ VLAN udp IPv6 eth0: hw csum failure 4.14.74-v7+ 3b+ VLAN udp IPv4/6 - hw csum failure Oct 13, 2018
@6by9
Copy link
Contributor

6by9 commented Oct 13, 2018

Thank you. We can have a look at what changed with 4.14.71.

@6by9
Copy link
Contributor

6by9 commented Oct 13, 2018

Hmm, 4.14.70 to 4.14.71 is 128 commits of which around 35 appear to be network related.

Can you give some more details of your setup? OpenVPN is running on the Pi, with the tunnel coming in on eth0.30? We need some way to reproduce this unless you fancy bisecting the kernel commits (not as bad as it sounds, but a tad involved).

@Wireheadbe
Copy link
Contributor Author

I don't even have to run the tunnel. I'll see if i can get around to pinpointing it with wireshark. Bisecting the commits - please elaborate. Maybe this is the easiest way, although being somewhat repetitive ;)

@Wireheadbe
Copy link
Contributor Author

Wireheadbe commented Oct 13, 2018

I think I have it narrowed down to packets sent from one device - were there any changes in how IGMPv2; STP or ICMPv6 packets are handled? That's the big three red flags I have..

@6by9
Copy link
Contributor

6by9 commented Oct 13, 2018

See https://www.raspberrypi.org/documentation/linux/kernel/building.md for details of how to compile your own kernel.

git is the source control system used for the kernel source. If includes a mechanism for bisecting the tree to close in on a regression as quickly as possible. See https://git-scm.com/docs/git-bisect for a description.
4.14.71 was commit 1244bbb, 4.14.70 was commit 1244bbb. Your comment that "seems to happen from c919d63" says that it should be one of the commits between the two that caused the issue. bisecting 128 commits should take 7 builds if I've got my numbers right.

@6by9
Copy link
Contributor

6by9 commented Oct 13, 2018

You can see the commit history from that time for yourself via https://github.com/raspberrypi/linux/commits/rpi-4.14.y?before=1244bbb3e92135d247e2dddfa6fe5e3e171a9635+35

6bf32cd looks like it might have the potential for changing the behaviour on checksums. If you were to build your own kernel then it'd be worth a quick test having done git revert 6bf32cda46ebfbaf13da3c48a0a009adae925703

@Wireheadbe
Copy link
Contributor Author

I'll try to get a build environment set up to cross compile - don't want to ruin the SD card. In the meantime - I've attached a pcap with a selection of packets from that device at a time where the issue occurred. Set wireshark to absolute timing. The last 3 packets are candidates. 20:33:24* local time

packets.zip

@pelwell
Copy link
Contributor

pelwell commented Oct 13, 2018

A quick warning: git bisect (and other operations) does not work well on merge commits - you end up on a pure upstream tree with no clue as to what has gone wrong. The best workflow I have found is to use git revert to work backwards from a point of failure until it starts to work. You could also use git cherry-pick to pull in commits from upstream to an earlier downstream commit, but I found the reversion model easier.

@Wireheadbe
Copy link
Contributor Author

Yeah, all new to this, currently struggling with the fact that git revert gives me a fatal: bad object upon running
'''git revert 6bf32cd'''
this is with cloning the repo 4.14 branch as specified in https://www.raspberrypi.org/documentation/linux/kernel/building.md

@pelwell
Copy link
Contributor

pelwell commented Oct 13, 2018

That's a message I haven't seen before - one that suggests corruption. Try running git gc to garbage collect and verify the repo.

@6by9
Copy link
Contributor

6by9 commented Oct 13, 2018

Thanks Phil. Reailty is I've never used it, but know that it is there.

@Wireheadbe
Copy link
Contributor Author

default doc says depth=1.. d'oh.. cloned it fully now, reverted that one commit, building :)

@DougieLawson
Copy link

@Wireheadbe
Do you have an easy way to reproduce?
I'm assuming that this has been running without logging errors in the past, or has this never worked cleanly for you?

@6by9 I've seen this on three of my seventeen raspberries. They've been filling the kern.log since I updated to 4.14.74 #1149

That's on a B+, 2B & 1B.
The common thing with those three is that they're wired rather than WiFi.

All systems run a dual stack IPv4/IPv6 (radvd runs on my TP-Link router with a 6to4 tunnel)

@Wireheadbe
Copy link
Contributor Author

Wireheadbe commented Oct 14, 2018

OK- so I pulled the full linux tree, did a git revert 6bf32cda46ebfbaf13da3c48a0a009adae925703 , compiled and installed... and everything seems quiet at the moment.

@6by9
Copy link
Contributor

6by9 commented Oct 14, 2018

@Wireheadbe Thank you. We'll try to get a couple of other confirmations and report it upstream.

6by9 added a commit to 6by9/linux that referenced this issue Oct 15, 2018
This reverts commit 88078d9.

Various people have been reporting seeing "eth0: hw csum failure"
and callstacks dumped in the kernel log on 4.18, and since 4.14.71,
on both SMSC9514 and LAN7800 adapters.
This commit appears to be the reason, but potentially due to an
issue further down the stack. Revert whilst investigating the
trigger.

raspberrypi#2713
raspberrypi#2659
raspberrypi#2712

Signed-off-by: Dave Stevenson <[email protected]>
6by9 added a commit to 6by9/linux that referenced this issue Oct 15, 2018
This reverts commit 6bf32cd.

Various people have been reporting seeing "eth0: hw csum failure"
and callstacks dumped in the kernel log on 4.18, and since 4.14.71,
on both SMSC9514 and LAN7800 adapters.
This commit appears to be the reason, but potentially due to an
issue further down the stack. Revert whilst investigating the
trigger.

raspberrypi#2713
raspberrypi#2659
raspberrypi#2712

Signed-off-by: Dave Stevenson <[email protected]>
@6by9
Copy link
Contributor

6by9 commented Oct 15, 2018

We are looking at reverting it in 4.14 to deal with the majority of users.

4.18 may retain the patch for now because it isn't the main kernel version and we still haven't managed to reproduce this. If those people hitting the issue can provide descriptions of their systems and what they are doing at the time (simpler the better), then it would help.
The commit text would point in the direction of when receiving fragmented IP packets, but packet captures that have been provided so far don't appear to be fragmented.

@DougieLawson
Copy link

I'm running rpi-update on my three affected (B+, 2B, 1B) machines right now.
I'll follow up with that rpi-update on the other fourteen later.

Can you build the 4.18 version as BRANCH=next rpi-update and I'll happily test that on all seventeen raspberries?

@pelwell
Copy link
Contributor

pelwell commented Oct 15, 2018

@Wireheadbe rpi-update will overwrite the kernel, its modules, the Device Tree files+overlays, and the firmware.

@Wireheadbe
Copy link
Contributor Author

thanks @pelwell - much appreciated. Did a manual cleaning of old modules in /lib.

If any testing needs to be done in the future - let me know. For now, the latest rpi-update solves it; due to 6bf32cda46ebfbaf13da3c48a0a009adae925703 being reverted

@6by9
Copy link
Contributor

6by9 commented Oct 15, 2018

@Wireheadbe

no that patch didn't fix it.

Thanks for trying it. We're working blind here and it sounded plausible.

A thread appeared on the net-dev mailing list a couple of hours ago that implies the patch that was first implicated has caused issues on other platforms too. We're looking in to it, but implications are there is something in checksum offload in the driver that is broken and this patch has exposed it. Trying to determine what may not be trivial.
https://marc.info/?l=linux-netdev&m=153961652520511&w=2

@Wireheadbe
Copy link
Contributor Author

I can share a dump of some moments of traffic if you want - maybe there's a way to replay it and simulate it?

@pelwell
Copy link
Contributor

pelwell commented Oct 15, 2018

Yes please - we should be able to fake (or change) the MAC address and get the replay to work.

@Wireheadbe
Copy link
Contributor Author

@pelwell
capture.zip Here you go, contains some traffic (over VLAN 802.1q via openvpn and external network) - some gratious ARP; mix of IPv4/6

@6by9
Copy link
Contributor

6by9 commented Oct 16, 2018

Thank you - now trying to replay it.

@6by9
Copy link
Contributor

6by9 commented Oct 16, 2018

A simple replay shows nothing. Connected without a switch direct to a 3B or 3B+ and use bittwist to replay the capture file and nothing shows up in the kernel logs. (4.14.76 with the revert reverted).

Running with the suggestion of lots of UDP causing issues likewise shows nothing using netcat, and that includes with fragmentation.

Looking at the discussion / patch on the sungem, that implicates FCS (Frame Check Sum) checking and similar, but that all appears fine on lan78xx or smsc95xx.
It's lovely that in that there appears to be a requirement to run net-next to avoid regressions - https://marc.info/?l=linux-netdev&m=152949377629403&w=2

I'll implement some logging along the same lines as in https://marc.info/?l=linux-netdev&m=152945035611057&w=2 to dump out the buffers that fail checksum at the low level (if that is possible).

@6by9
Copy link
Contributor

6by9 commented Oct 17, 2018

@Wireheadbe I've totally failed to get replaying your captures to show any issue :-(

Could you rebuild a kernel with some extra logging in please?

git remote add 6by9 https://github.com/6by9/linux
git fetch 6by9
git checkout -t -b 6by9_4.14.y-net 6by9/rpi-4.14.y-net

and then the normal build commands. You should get bumped up to 4.14.76 as well as my couple of patches. It may well take a while to build again.
When run, any patches that fail a checksum validation within the driver should get logged into the kernel log.
(I don't know how good your git skills are. You could cherry-pick the top 2 commits from 6by9/rpi-4.14.y-net instead and rebuild just that, which should be substantially faster).

@6by9
Copy link
Contributor

6by9 commented Oct 22, 2018

The root cause has been traced upstream and a fix should ripple down fairly quickly, at which point we'll revert the reverts that were put in place.

Thanks for all your help.

@popcornmix
Copy link
Collaborator

Latest rpi-update kernel removes the revert and makes use of the upstream fix.
Would be helpful if affected users can report that is still okay.

@DougieLawson
Copy link

Can you build that on BRANCH=next rpi-update (4.19 kernel)?

@popcornmix
Copy link
Collaborator

Previous 4.19 kernel already contained the upstream fix (it took a while to reach 4.14).
There is a 4.19.1 kernel currently being built which will be pushed soon, but I suspect it's behaviour will be the same as 4.19.

@Wireheadbe
Copy link
Contributor Author

Did an rpi-update - all stays quiet at the moment.. Seems fixed. Sorry I was afk for a bit.. busy with work & family..

@suthernfriend
Copy link

suthernfriend commented Dec 10, 2018

i'm getting a similar message on kernel 4.14.86 on a pi 3B

[  249.511572] eth0: hw csum failure
[  249.511584] CPU: 0 PID: 1852 Comm: mandb Tainted: G         C      4.14.86-1-ARCH #1
[  249.511586] Hardware name: BCM2835
[  249.511607] [<8010eda8>] (unwind_backtrace) from [<8010b878>] (show_stack+0x10/0x14)
[  249.511616] [<8010b878>] (show_stack) from [<80a8619c>] (dump_stack+0x9c/0xc8)
[  249.511628] [<80a8619c>] (dump_stack) from [<8097f808>] (__skb_checksum_complete+0xb4/0xb8)
[  249.511637] [<8097f808>] (__skb_checksum_complete) from [<809fddf0>] (tcp_v4_rcv+0x7d0/0xf08)
[  249.511646] [<809fddf0>] (tcp_v4_rcv) from [<809d42d0>] (ip_local_deliver_finish+0xd0/0x348)
[  249.511654] [<809d42d0>] (ip_local_deliver_finish) from [<809d4be4>] (ip_local_deliver+0x50/0xec)
[  249.511661] [<809d4be4>] (ip_local_deliver) from [<809d4ec0>] (ip_rcv+0x240/0x594)
[  249.511670] [<809d4ec0>] (ip_rcv) from [<8098a5b0>] (__netif_receive_skb_core+0x998/0xd04)
[  249.511678] [<8098a5b0>] (__netif_receive_skb_core) from [<8098ca34>] (process_backlog+0x94/0x144)
[  249.511686] [<8098ca34>] (process_backlog) from [<80990c2c>] (net_rx_action+0x168/0x448)
[  249.511694] [<80990c2c>] (net_rx_action) from [<8010157c>] (__do_softirq+0xd4/0x32c)
[  249.511701] [<8010157c>] (__do_softirq) from [<80134e6c>] (irq_exit+0x8c/0x148)
[  249.511708] [<80134e6c>] (irq_exit) from [<801852f4>] (__handle_domain_irq+0x58/0xb8)
[  249.511717] [<801852f4>] (__handle_domain_irq) from [<80aa0644>] (__irq_usr+0x44/0x60)

tcp connections keep randomly dying and generic packet loss occurs. usually the above is displayed when it happens.

however, i only get this on 1 of the dozens of pis i have in use.

is this related or do i have a hardware problem? device is in use for 2 years now.

@popcornmix
Copy link
Collaborator

however, i only get this on 1 of the dozens of pis i have in use.

Can you swap the problematic pi with another (but keep same location, ethernet cable, sdcard etc).
Does the problem occur on the replacement pi?

@suthernfriend
Copy link

suthernfriend commented Dec 10, 2018

hmm. just tested with same display/image/cables but different pi (same revision etc), but don't have that issue. so it's probably hw related (and unrelated to this github issue)

@JamesH65
Copy link
Contributor

@6by9 This looks like the checksum issue, which I think has been fixed?

@JamesH65 JamesH65 added the Waiting for internal comment Waiting for comment from a member of the Raspberry Pi engineering team label Jul 31, 2019
@6by9
Copy link
Contributor

6by9 commented Jul 31, 2019

Yes, the upstream issues caused by 6bf32cd has been fixed from upstream with ac65fd7

@JamesH65
Copy link
Contributor

Closing this issue as questions answered/issue resolved.

@tomkcook
Copy link

Could someone please comment on how ac65fd7 fixes this for the raspberry pi platform? As far as I can tell, it only changes the mellanox mlx5 driver.

@pelwell
Copy link
Contributor

pelwell commented Apr 19, 2021

I don't that commit can have fixed it, but @6by9 may be able to explain.

@6by9
Copy link
Contributor

6by9 commented Apr 19, 2021

Looks to be an incorrect hash.
d55bef5 looks more likely, although that appears to already have been cherrypicked back to 4.14.

My memory says all of the issues were in mainline and not the specific LAN drivers, so should have been cherry-picked back to the stable branches assuming they got the relevant "Fixes:" tags.
Our rpi-4.14.y has been dead for quite a while, so it's possible that there are some fixes around in the mainline stable branches which aren't in our branch. Have you rebased all mainline patches into your branch? I'm not fancying going hunting for it.

@pelwell
Copy link
Contributor

pelwell commented Apr 19, 2021

Yes, that looks like the patch. In the absence of a new Issue we attribute the question to curiosity.

@tomkcook
Copy link

Sorry, I'm working on a system that's stuck for the time being on 4.14.77 + patches and seeing hw csum failure stack traces in dmesg, so trying to figure out what to add as a patch to my build system. Thanks.

@6by9
Copy link
Contributor

6by9 commented Apr 19, 2021

The LTS release is now at 4.14.231. rpi-4.14.y went up to 4.14.114 (April 2019).
This is obviously a network connected system seeing as you're worrying about errors from the network stack.
To be stuck on a kernel from October 2018 seems a little risky to say the least.

The other solution to seeing those errors is to revert 6bf32cd which was the trigger for seeing these issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Waiting for internal comment Waiting for comment from a member of the Raspberry Pi engineering team
Projects
None yet
Development

No branches or pull requests

8 participants