Workspacekit and seccomp-notify #3019

csweichel · 2021-01-26T10:31:41Z

This PR fixes #2845, #2123.
To make this work, this PR does some foundational work on the workspace runtime:

it separates the user-namespace setup from supervisor: the rings start a user-namespaced workspace have no dependency to supervisor per-se. We just added them there because it was convenient. Now that the rings require a CGO dependency (libseccomp and if we don't build with Go 1.16 also libcap) it made even more sense to separate supervisor and the rings. Now we have workspacekit which sets things up inside the workspace container, and supervisor who just resumes it's "init task" duty.
introduce seccomp-notify (added in Linux 5.0): workspacekit ring2 now loads its own seccomp filter that makes use of seccomp notify. It then passes the FD for the notifications back to ring1 (which is not subject to that seccomp filter). Ring1 handles the following syscalls:
- mount for intercepting proc mounts
- umount/umount2 to prevent unmounting proc masks
- bind in preparation for better workspace port notifications
move proc mounts using open_tree/move_mount (added in Linux 5.2), not through mount propagation: when a workspace wants to mount proc, we prepare that proc mount off-site (i.e. mount proc and add masks) and move it into place. Prior to this PR, we could only move that new mount into ring2 before it did its pivot_root. After that operation there's no connection between the ring2 process' mount namespace and ring1 anymore, hence we cannot move proc using mount propagation alone. Instead, we now use open_tree and move_mount which were added in Linux 5.2 for just that purpose.
prevent proc mask unmounts: when the only proc mount in a workspace was that of ring2 (living in the ring1 mount ns), the mount namespace in ring2 would lock the masks. Because ring2's mount NS is a child of ring1, anyone inside ring2 could not umount something from ring1. Now that the proc mount lives in the actual child mount namespace that's no longer the case. E.g. when a runc process inside the workspace container calls mount, we'd end up with the proc and mask mounts inside that runc process' mount namespace. Hence any other process in that namespace could just umount the masks. To prevent that, we seccomp-notify umount and umount2 calls - and prohibit open_tree/move_mount inside the workspace. Our seccomp handler can then decide if the umount should be allowed or not.
make ws-daemon nsenter properly: prior to this PR ws-daemon would call nsenter ... directly. Now, ws-daemon ships with its own little helper called nsinsider who can act on ws-daemon's behalf in another workspace. To this end we use runc's nsenter.

How to test

start a workspace with feature preview enabled
try to umount a proc mask, e.g. sudo umount /proc/kcore
start a Docker container (docker run --rm -it --name foo alpine:latest) and run cat /proc/1/cmdline. Notice it's the correct command line.
In that Docker container run mount | grep proc. Notice the proc masks. Try to umount a proc mask - it should fail.
Run docker exec -it foo sh - you should have a shell in the Docker container.
Stop the Docker container and notice that it does indeed stop. Call mount | grep proc in the workspace container. There should be no trace of the Docker container's mounts (rootfs or proc).

Future Work

docker exec still doesn't work and it's not quite clear why. I've fixed an issue in the runc-facade that might have broken things. All my attempts to strace the issue have been futile so far. The issue in error: OCI runtime exec - when running a docker command with healthcheck #2845 still stands.
umount("/some/proc/mount") - at the moment you cannot umount any proc mount, except for letting the kernel lazy-umount it when a mount namespace is closed. That's because we need to atomically umount the masks and proc. This PR already contains infrastructure for open_tree/move_mount'ing proc mounts from within a workspace using ws-daemon. That code does not work yet however.
bind notifications: again, we have preparations for getting bind syscall notifications, but fail to read the arguments of that syscall. Once we've figured that out, we can pass those notifications on to supervisor who then doesn't have to poll /net/proc/tcp anymore.
we're not making the appropriate calls to NotifIDValid to prevent time-of-check-time-of-use attacks. That function does not work reliably for some reason and more investigation is needed.

Notes and Caveats

this PR introduces a dependency to Linux 5.2. Prior to this PR we had no hard kernel version dependency. Ubuntu 18.04 supports 5.4, as do GKE and EKS. We must clarify what the requirements are in [self-hosted] Platform Support Matrix #2657
this PR uses a fork of libseccomp-go which is currently being upstreamed. Once that's happened, we should move back to the mainline.

To keep supervisor free from CGO e.g. libcap or libseccomp

svenefftinge

I have tested the changes. Only skimmed over the code, though.

geropl · 2021-01-29T15:44:51Z

components/workspacekit/cmd/rings.go

+			log.WithError(err).Error("cannot get parent socket fd")
+			failed = true
+			return
+		}


Why not defer conf.Close() here? Maybe in addition, as safeguard?

geropl · 2021-01-29T15:46:46Z

components/workspacekit/cmd/rings.go

+			failed = true
+			return
+		}
+		connf, err := conn.File()


I don't fully understand why conn - a connection build on an Unix socket that is basically a special kernel file (or not...?) - still works after pivotRoot :thinking_face:

because we already have the FD open. If we wanted to connect to the socket post pivot_root that wouldn't work. But because we've done that beforehand, we're fine :)

geropl · 2021-01-29T15:53:38Z

components/ws-manager/pkg/manager/create.go

@@ -432,7 +432,7 @@ func (m *Manager) createDefiniteWorkspacePod(startContext *startWorkspaceContext
 						MountPropagation: &mountPropagation,
 					},
 				)
-				pod.Spec.Containers[i].Command = []string{pod.Spec.Containers[i].Command[0], "ring0"}
+				pod.Spec.Containers[i].Command = []string{"/.supervisor/workspacekit", "ring0"}


Is this for all types of workspaces from the moment we deploy?

All types that use user-namespaces. Basically we're tying user namespaces to registry facade:

registry_facade FF can run by itself

user_namespace FF mandates the registry_facade FF

geropl · 2021-01-29T15:56:32Z

components/workspacekit/cmd/rings.go

@@ -261,21 +267,21 @@ var ring1Cmd = &cobra.Command{
 			}
 		}



In general the (very) methods in this file would benefit from a few, short comments to give some structure. In this case maybe:
// starting ring2

I've added a few comments where I think they'd help.

In general I like to avoid comments that just say what the code is doing - that should be readable from the code and tends to get outdated quickly. Commenting how or why we're doing it, that's super valuable.

geropl · 2021-01-29T15:57:05Z

components/workspacekit/cmd/rings.go

@@ -288,7 +294,7 @@ var ring1Cmd = &cobra.Command{
 		sigc := sigproxy.ForwardAllSignals(context.Background(), cmd.Process.Pid)
 		defer sigproxysignal.StopCatch(sigc)



// mount /proc into ring2

geropl · 2021-01-29T15:57:46Z

components/workspacekit/cmd/rings.go

@@ -305,17 +311,97 @@ var ring1Cmd = &cobra.Command{
 			return
 		}



// setup connection ring1<->ring2

geropl · 2021-01-29T15:58:24Z

components/workspacekit/cmd/rings.go

+			failed = true
+			return
+		}
+


// trigger ring2 to issue seccomp.LoadFilter(...)

geropl

Awesome, really mindblowing to be able to docker exec in Gitpod... 🚀

(comments are nits/questions)

csweichel requested review from svenefftinge, corneliusludmann and geropl January 26, 2021 10:31

csweichel force-pushed the cw/workspacekit branch 5 times, most recently from 8955ba7 to 77a5c09 Compare January 26, 2021 17:22

csweichel marked this pull request as ready for review January 28, 2021 08:37

csweichel mentioned this pull request Jan 29, 2021

[docker] user-namespaced Docker exec doesn't work #2123

Closed

csweichel linked an issue Jan 29, 2021 that may be closed by this pull request

[docker] user-namespaced Docker exec doesn't work #2123

Closed

Christian Weichel added 3 commits January 29, 2021 08:23

[workspacekit] Move all user namespace setup stuff to its own component

958a4ee

To keep supervisor free from CGO e.g. libcap or libseccomp

[dev-image] Add libseccomp 2.5.1 for building workspacekit

1a79228

[workspacekit] Support mount proc in a workspace using seccomp-notify

1835532

csweichel force-pushed the cw/workspacekit branch from 928ca4a to fd9c10c Compare January 29, 2021 08:23

svenefftinge approved these changes Jan 29, 2021

View reviewed changes

geropl reviewed Jan 29, 2021

View reviewed changes

components/workspacekit/cmd/rings.go

@@ -305,17 +311,97 @@ var ring1Cmd = &cobra.Command{

return

}

Copy link

Member

geropl Jan 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// setup connection ring1<->ring2

geropl reviewed Jan 29, 2021

View reviewed changes

components/workspacekit/cmd/rings.go

failed = true

return

}

Copy link

Member

geropl Jan 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// trigger ring2 to issue seccomp.LoadFilter(...)

geropl approved these changes Jan 29, 2021

View reviewed changes

Christian Weichel added 5 commits January 29, 2021 16:38

[ws-daemon] Support proc mounts using open_tree/move_mount

0e3610b

[ws-daemon] Support umount proc

34ffe22

[runc-facade] Don't replace proc mount

d12438b

[docker-up] Enable proper verbose logging

b7138a1

[seccomp-profile] Allow unshare(0) to make nsexec work

738bc4a

csweichel force-pushed the cw/workspacekit branch from fd9c10c to 738bc4a Compare January 29, 2021 16:39

csweichel merged commit 13a5f8e into master Jan 29, 2021

csweichel deleted the cw/workspacekit branch January 29, 2021 16:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Workspacekit and seccomp-notify #3019

Workspacekit and seccomp-notify #3019

Uh oh!

csweichel commented Jan 26, 2021 •

edited

Loading

Uh oh!

svenefftinge left a comment

Uh oh!

geropl Jan 29, 2021

Uh oh!

geropl Jan 29, 2021

Uh oh!

csweichel Jan 29, 2021

Uh oh!

geropl Jan 29, 2021

Uh oh!

csweichel Jan 29, 2021

Uh oh!

geropl Jan 29, 2021

Uh oh!

csweichel Jan 29, 2021

Uh oh!

geropl Jan 29, 2021

Uh oh!

geropl Jan 29, 2021

Uh oh!

geropl Jan 29, 2021

Uh oh!

geropl left a comment •

edited

Loading

Uh oh!

Uh oh!

		@@ -288,7 +294,7 @@ var ring1Cmd = &cobra.Command{
		sigc := sigproxy.ForwardAllSignals(context.Background(), cmd.Process.Pid)
		defer sigproxysignal.StopCatch(sigc)

Workspacekit and seccomp-notify #3019

Workspacekit and seccomp-notify #3019

Uh oh!

Conversation

csweichel commented Jan 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to test

Future Work

Notes and Caveats

Uh oh!

svenefftinge left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

geropl left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

csweichel commented Jan 26, 2021 •

edited

Loading

geropl left a comment •

edited

Loading