-
Notifications
You must be signed in to change notification settings - Fork 5.2k
4.14.74-v7+ 3b+ VLAN udp IPv4/6 - hw csum failure #2713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
#2659 seems similar - openvpn and iptables involved as well |
Does look like it's coming in from ipv4 instead looking at the trace |
@Wireheadbe Can you try reverting to an earlier kernel and retesting? We haven't made any changes since #2458 regarding the lan78xx driver, but it's possible something in the core has been backported to stable and is causing grief. |
It's a pretty recent install. Only thing I added yesterday was DHCPv6 next to radvd. This morning I connected via OpenVPN and noticed them. That's the only change I had in terms of functionality. I was getting them on an earlier kernel; so i actually did an rpi-update. The issue remained. The offloading disable does make them happen less often. |
went back to ec9d84e - seems to be gone at the moment? I'll keep an eye on it before confirming 100% |
Did some speedtests / ipv4 / ipv6. VPN traffic etc etc.. still quiet at the moment on ec9d84e - 4.14.54-v7+ SMP Mon Jul 9 16:41:01 BST 2018 armv7l GNU/Linux |
Thank you - that's a brilliant data point. If you find that is stable, please can you try a couple of further bisections? Trying a5b781c (Sept 5) and then either e880e62 (Sept 18) if that is good, or 911147a (Aug 16) if not would be wonderful. |
tried every single version from that - seems to happen from c919d63 |
Thank you. We can have a look at what changed with 4.14.71. |
Hmm, 4.14.70 to 4.14.71 is 128 commits of which around 35 appear to be network related. Can you give some more details of your setup? OpenVPN is running on the Pi, with the tunnel coming in on eth0.30? We need some way to reproduce this unless you fancy bisecting the kernel commits (not as bad as it sounds, but a tad involved). |
I don't even have to run the tunnel. I'll see if i can get around to pinpointing it with wireshark. Bisecting the commits - please elaborate. Maybe this is the easiest way, although being somewhat repetitive ;) |
I think I have it narrowed down to packets sent from one device - were there any changes in how IGMPv2; STP or ICMPv6 packets are handled? That's the big three red flags I have.. |
See https://www.raspberrypi.org/documentation/linux/kernel/building.md for details of how to compile your own kernel. git is the source control system used for the kernel source. If includes a mechanism for bisecting the tree to close in on a regression as quickly as possible. See https://git-scm.com/docs/git-bisect for a description. |
You can see the commit history from that time for yourself via https://github.com/raspberrypi/linux/commits/rpi-4.14.y?before=1244bbb3e92135d247e2dddfa6fe5e3e171a9635+35 6bf32cd looks like it might have the potential for changing the behaviour on checksums. If you were to build your own kernel then it'd be worth a quick test having done |
I'll try to get a build environment set up to cross compile - don't want to ruin the SD card. In the meantime - I've attached a pcap with a selection of packets from that device at a time where the issue occurred. Set wireshark to absolute timing. The last 3 packets are candidates. 20:33:24* local time |
A quick warning: git bisect (and other operations) does not work well on merge commits - you end up on a pure upstream tree with no clue as to what has gone wrong. The best workflow I have found is to use git revert to work backwards from a point of failure until it starts to work. You could also use git cherry-pick to pull in commits from upstream to an earlier downstream commit, but I found the reversion model easier. |
Yeah, all new to this, currently struggling with the fact that git revert gives me a fatal: bad object upon running |
That's a message I haven't seen before - one that suggests corruption. Try running git gc to garbage collect and verify the repo. |
Thanks Phil. Reailty is I've never used it, but know that it is there. |
default doc says depth=1.. d'oh.. cloned it fully now, reverted that one commit, building :) |
@6by9 I've seen this on three of my seventeen raspberries. They've been filling the kern.log since I updated to 4.14.74 #1149 That's on a B+, 2B & 1B. All systems run a dual stack IPv4/IPv6 (radvd runs on my TP-Link router with a 6to4 tunnel) |
OK- so I pulled the full linux tree, did a |
@Wireheadbe Thank you. We'll try to get a couple of other confirmations and report it upstream. |
This reverts commit 88078d9. Various people have been reporting seeing "eth0: hw csum failure" and callstacks dumped in the kernel log on 4.18, and since 4.14.71, on both SMSC9514 and LAN7800 adapters. This commit appears to be the reason, but potentially due to an issue further down the stack. Revert whilst investigating the trigger. raspberrypi#2713 raspberrypi#2659 raspberrypi#2712 Signed-off-by: Dave Stevenson <[email protected]>
This reverts commit 6bf32cd. Various people have been reporting seeing "eth0: hw csum failure" and callstacks dumped in the kernel log on 4.18, and since 4.14.71, on both SMSC9514 and LAN7800 adapters. This commit appears to be the reason, but potentially due to an issue further down the stack. Revert whilst investigating the trigger. raspberrypi#2713 raspberrypi#2659 raspberrypi#2712 Signed-off-by: Dave Stevenson <[email protected]>
We are looking at reverting it in 4.14 to deal with the majority of users. 4.18 may retain the patch for now because it isn't the main kernel version and we still haven't managed to reproduce this. If those people hitting the issue can provide descriptions of their systems and what they are doing at the time (simpler the better), then it would help. |
I'm running Can you build the 4.18 version as |
@Wireheadbe rpi-update will overwrite the kernel, its modules, the Device Tree files+overlays, and the firmware. |
thanks @pelwell - much appreciated. Did a manual cleaning of old modules in /lib. If any testing needs to be done in the future - let me know. For now, the latest rpi-update solves it; due to |
Thanks for trying it. We're working blind here and it sounded plausible. A thread appeared on the net-dev mailing list a couple of hours ago that implies the patch that was first implicated has caused issues on other platforms too. We're looking in to it, but implications are there is something in checksum offload in the driver that is broken and this patch has exposed it. Trying to determine what may not be trivial. |
I can share a dump of some moments of traffic if you want - maybe there's a way to replay it and simulate it? |
Yes please - we should be able to fake (or change) the MAC address and get the replay to work. |
@pelwell |
Thank you - now trying to replay it. |
A simple replay shows nothing. Connected without a switch direct to a 3B or 3B+ and use bittwist to replay the capture file and nothing shows up in the kernel logs. (4.14.76 with the revert reverted). Running with the suggestion of lots of UDP causing issues likewise shows nothing using netcat, and that includes with fragmentation. Looking at the discussion / patch on the sungem, that implicates FCS (Frame Check Sum) checking and similar, but that all appears fine on lan78xx or smsc95xx. I'll implement some logging along the same lines as in https://marc.info/?l=linux-netdev&m=152945035611057&w=2 to dump out the buffers that fail checksum at the low level (if that is possible). |
@Wireheadbe I've totally failed to get replaying your captures to show any issue :-( Could you rebuild a kernel with some extra logging in please?
and then the normal build commands. You should get bumped up to 4.14.76 as well as my couple of patches. It may well take a while to build again. |
The root cause has been traced upstream and a fix should ripple down fairly quickly, at which point we'll revert the reverts that were put in place. Thanks for all your help. |
Latest rpi-update kernel removes the revert and makes use of the upstream fix. |
Can you build that on BRANCH=next rpi-update (4.19 kernel)? |
Previous 4.19 kernel already contained the upstream fix (it took a while to reach 4.14). |
Did an rpi-update - all stays quiet at the moment.. Seems fixed. Sorry I was afk for a bit.. busy with work & family.. |
i'm getting a similar message on kernel 4.14.86 on a pi 3B
tcp connections keep randomly dying and generic packet loss occurs. usually the above is displayed when it happens. however, i only get this on 1 of the dozens of pis i have in use. is this related or do i have a hardware problem? device is in use for 2 years now. |
Can you swap the problematic pi with another (but keep same location, ethernet cable, sdcard etc). |
hmm. just tested with same display/image/cables but different pi (same revision etc), but don't have that issue. so it's probably hw related (and unrelated to this github issue) |
@6by9 This looks like the checksum issue, which I think has been fixed? |
Closing this issue as questions answered/issue resolved. |
Could someone please comment on how ac65fd7 fixes this for the raspberry pi platform? As far as I can tell, it only changes the mellanox mlx5 driver. |
I don't that commit can have fixed it, but @6by9 may be able to explain. |
Looks to be an incorrect hash. My memory says all of the issues were in mainline and not the specific LAN drivers, so should have been cherry-picked back to the stable branches assuming they got the relevant "Fixes:" tags. |
Yes, that looks like the patch. In the absence of a new Issue we attribute the question to curiosity. |
Sorry, I'm working on a system that's stuck for the time being on 4.14.77 + patches and seeing |
The LTS release is now at 4.14.231. rpi-4.14.y went up to 4.14.114 (April 2019). The other solution to seeing those errors is to revert 6bf32cd which was the trigger for seeing these issues. |
Hi All - seems like issue #2458 is popping up again.
Setup is eth0 on local lan; eth0.30 with tagged data. Running IPv6 as well. (both over eth0.30 as tun0 for openvpn)
Having OpenVPN traffic comming in over eth0.30 cause lots of these errors:
Seems like issue 2458 is popping up again or wasn't fixed completely. Running
Still results in these errors being generated.
The text was updated successfully, but these errors were encountered: