Skip to content

unregister_netdevice: waiting for tun11 to become free #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cron2 opened this issue May 14, 2025 · 7 comments
Open

unregister_netdevice: waiting for tun11 to become free #6

cron2 opened this issue May 14, 2025 · 7 comments

Comments

@cron2
Copy link

cron2 commented May 14, 2025

Ubuntu 20.04 running some backported version of kernel DCO (20250505 + 4 patches) to test OpenVPN server side with DCO.

One of the servers is a "point to point tls-server" instance (11), which is known to upset things :-) - today it upset something reference counting, it seems. Every 10 seconds, I get

May 14 18:58:29 ubuntu2004 kernel: [630402.815743] unregister_netdevice: waiting for tun11 to become free. Usage count = 1
May 14 18:58:39 ubuntu2004 kernel: [630412.895476] unregister_netdevice: waiting for tun11 to become free. Usage count = 1

... even if that process isn't actually doing anything right now (waiting for an incoming connection).

There is an interface of this name, and it looks healthy (well, it's down, because no peer is connected, I think)

gert@ubuntu2004:~/openvpn.dco$ ip link show dev tun11
1039: tun11: <NO-CARRIER,POINTOPOINT,NOARP,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/none 
gert@ubuntu2004:~/openvpn.dco$ ip addr show dev tun11
1039: tun11: <NO-CARRIER,POINTOPOINT,NOARP,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/none 
    inet 10.220.11.1/24 scope global tun11
       valid_lft forever preferred_lft forever
    inet6 fd00:abcd:220:11::1/64 scope global 
       valid_lft forever preferred_lft forever

the spamming started at yesterday morning's test run - here are the last entries from the relevant process, and the first "unregister_netdevice" entries in syslog

May 13 07:26:28 ubuntu2004 tun-udp-p2p-tls-sha256[178563]: Data Channel: cipher 'AES-256-GCM', peer-id: 0
May 13 07:26:28 ubuntu2004 tun-udp-p2p-tls-sha256[178563]: Timers: ping 10, ping-restart 30
May 13 07:26:53 ubuntu2004 tun-udp-p2p-tls-sha256[178563]: read UDPv6 [ECONNREFUSED]: Connection refused (fd=5,code=111)
May 13 07:26:53 ubuntu2004 tun-udp-p2p-tls-sha256[178563]: UDPv6 READ [0] from [AF_INET6][undef]:32146: DATA UNDEF len=-1
May 13 07:27:03 ubuntu2004 tun-udp-p2p-tls-sha256[178563]: read UDPv6 [ECONNREFUSED]: Connection refused (fd=5,code=111)
May 13 07:27:03 ubuntu2004 tun-udp-p2p-tls-sha256[178563]: UDPv6 READ [0] from [AF_INET6][undef]:32146: DATA UNDEF len=-1
May 13 07:27:14 ubuntu2004 tun-udp-p2p-tls-sha256[178563]: dco_do_read
May 13 07:27:14 ubuntu2004 tun-udp-p2p-tls-sha256[178563]: ovpn-dco: received CMD_PEER_DEL_NTF, ifindex: 911, peer-id 7762030, reason: 2
May 13 07:27:14 ubuntu2004 tun-udp-p2p-tls-sha256[178563]: process_incoming_dco: received peer expired notification of for peer-id 7762030
May 13 07:27:14 ubuntu2004 tun-udp-p2p-tls-sha256[178563]: [freebsd-14-amd64] Inactivity timeout (--ping-restart), restarting
May 13 07:27:14 ubuntu2004 tun-udp-p2p-tls-sha256[178563]: register signal: SIGUSR1 (ping-restart)
May 13 07:27:14 ubuntu2004 tun-udp-p2p-tls-sha256[178563]: TCP/UDP: Closing socket
May 13 07:27:14 ubuntu2004 tun-udp-p2p-tls-sha256[178563]: Closing ovpn-dco interface
May 13 07:27:14 ubuntu2004 tun-udp-p2p-tls-sha256[178563]: net_addr_v4_del: 10.220.11.1 dev tun11
May 13 07:27:14 ubuntu2004 tun-udp-p2p-tls-sha256[178563]: sitnl_send: checking for received messages
May 13 07:27:14 ubuntu2004 tun-udp-p2p-tls-sha256[178563]: sitnl_send: rtnl: received 36 bytes
May 13 07:27:14 ubuntu2004 tun-udp-p2p-tls-sha256[178563]: net_addr_v6_del: fd00:abcd:220:11::1/64 dev tun11
May 13 07:27:14 ubuntu2004 tun-udp-p2p-tls-sha256[178563]: sitnl_send: checking 
for received messages
May 13 07:27:14 ubuntu2004 tun-udp-p2p-tls-sha256[178563]: sitnl_send: rtnl: received 36 bytes
May 13 07:27:14 ubuntu2004 tun-udp-p2p-tls-sha256[178563]: close_tun_dco
May 13 07:27:14 ubuntu2004 tun-udp-p2p-tls-sha256[178563]: net_iface_del: delete tun11
May 13 07:27:24 ubuntu2004 kernel: [502541.937262] unregister_netdevice: waiting for tun11 to become free. Usage count = 1
May 13 07:27:34 ubuntu2004 kernel: [502552.016989] unregister_netdevice: waiting for tun11 to become free. Usage count = 1
May 13 07:27:44 ubuntu2004 kernel: [502562.096668] unregister_netdevice: waiting for tun11 to become free. Usage count = 1
May 13 07:27:54 ubuntu2004 kernel: [502572.176302] unregister_netdevice: waiting for tun11 to become free. Usage count = 1
May 13 07:28:04 ubuntu2004 kernel: [502582.256023] unregister_netdevice: waiting for tun11 to become free. Usage count = 1
May 13 07:28:14 ubuntu2004 kernel: [502592.335700] unregister_netdevice: waiting for tun11 to become free. Usage count = 1

And it seems that this also zombified the openvpn process (and we now have two of them)

root      178563  0.0  0.1   5064  3364 ?        Ds   May13   0:37 ./bin/openvpn --daemon tun-udp-p2p-tls-sha256 --cd tun-udp-p2p-tls-sha256 --config server.conf --writepid ../openvpn-tun-udp-p2p-tls-sha256.pid
root      213167  0.0  0.3  10260  7812 ?        Ss   07:02   0:02 ./bin/openvpn --daemon tun-udp-p2p-tls-sha256 --cd tun-udp-p2p-tls-sha256 --config server.conf --writepid ../openvpn-tun-udp-p2p-tls-sha256.pid

ending the newer process will remove the "visible" tun11 interface, but the syslog spam still goes on, and the other process also is still stuck:

gert@ubuntu2004:~/openvpn.dco$ ip link show dev tun11
Device "tun11" does not exist.
Message from syslogd@ubuntu2004 at May 14 19:14:06 ...
 kernel:[631340.225770] unregister_netdevice: waiting for tun11 to become free. Usage count = 1

This does not happen on every run of the p2p tls server, and the machine did not crash, but is not exactly healthy either. I leave it running in case you want me to gather more info.

@ordex
Copy link
Member

ordex commented May 14, 2025

Thanks for the report.
It's quite hard (impossible?) to gather any extra info once the spamming starts.
Either I have to reproduce this somewhat deterministically, or I have to fill ovpn with debug messages so that we can understand what is happening when the spam shows up.

Basically the "spam" starts when userspace requests the interface to be deleted, but its refcounter hasn't dropped to zero yet.
So that message is just telling you that the cleanup routine is still waiting for this "user" to drop the reference.
Normally this will never happen, because it's likely that we ended up there due to some bug.

Does it happen fairly often?
When it happens, it is always after a peer timeout?

@cron2
Copy link
Author

cron2 commented May 14, 2025

I have seen it once, so can't say... (usually the machine crashed hard somewhere in the middle of the tests, not surviving so much interface+socket creation/deletion activity ;-) - but the crashes are gone with the "skb" patch).

The one time I've seen it, userland logs are above - so a --ping timeout reported by the kernel.

Who would be holding a reference to that interface anyway? It can only be userland and ovpn.ko, no?

@ordex
Copy link
Member

ordex commented May 14, 2025

it's ovpn holding it, that due to some bug did not properly release it.

@ordex
Copy link
Member

ordex commented May 14, 2025

However, the "skb" patch was still deemed somewhat racy, therefore it is being reworked.
We can't rule out that what you have seen was exactly due to the "raciness" of that patch.

Ideally you could drop that one and apply these two:

commit 11ce184a1e61252a91f46149a9d71cb66309b754
Author: Antonio Quartulli <[email protected]>
Date:   Tue May 13 16:43:59 2025 +0200

    ovpn: properly deconfigure UDP-tunnel
commit 3336cc1f46bc4d67babe722af0c6bdf88c991901
Author: Antonio Quartulli <[email protected]>
Date:   Wed Apr 30 02:26:49 2025 +0200

    ovpn: ensure sk is still valid during cleanup

both are in the development branch

@cron2
Copy link
Author

cron2 commented May 15, 2025

Testing with DCO version: ovpn-net-next/development-sources-6.15.0-rc5-0f03aa5 now... so far, haven't been able to reproduce... but tests are running.

@ordex
Copy link
Member

ordex commented May 20, 2025

any luck reproducing this issue with the latest code?

@cron2
Copy link
Author

cron2 commented May 21, 2025

With DCO version: ovpn-net-next/development-sources-6.15.0-rc5-0f03aa5 I could not reproduce this or the other kernel panics anymore. Lots of clients connecting and disconnecting, p2p interfaces being opened and closed, etc.

Will re-test with the next development snapshot...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants