-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Workspacekit and seccomp-notify #3019
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8955ba7
to
77a5c09
Compare
To keep supervisor free from CGO e.g. libcap or libseccomp
928ca4a
to
fd9c10c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested the changes. Only skimmed over the code, though.
log.WithError(err).Error("cannot get parent socket fd") | ||
failed = true | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not defer conf.Close()
here? Maybe in addition, as safeguard?
failed = true | ||
return | ||
} | ||
connf, err := conn.File() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand why conn
- a connection build on an Unix socket that is basically a special kernel file (or not...?) - still works after pivotRoot
:thinking_face:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because we already have the FD open. If we wanted to connect to the socket post pivot_root
that wouldn't work. But because we've done that beforehand, we're fine :)
@@ -432,7 +432,7 @@ func (m *Manager) createDefiniteWorkspacePod(startContext *startWorkspaceContext | |||
MountPropagation: &mountPropagation, | |||
}, | |||
) | |||
pod.Spec.Containers[i].Command = []string{pod.Spec.Containers[i].Command[0], "ring0"} | |||
pod.Spec.Containers[i].Command = []string{"/.supervisor/workspacekit", "ring0"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for all types of workspaces from the moment we deploy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All types that use user-namespaces. Basically we're tying user namespaces to registry facade:
registry_facade
FF can run by itselfuser_namespace
FF mandates theregistry_facade
FF
@@ -261,21 +267,21 @@ var ring1Cmd = &cobra.Command{ | |||
} | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general the (very) methods in this file would benefit from a few, short comments to give some structure. In this case maybe:
// starting ring2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a few comments where I think they'd help.
In general I like to avoid comments that just say what the code is doing - that should be readable from the code and tends to get outdated quickly. Commenting how or why we're doing it, that's super valuable.
@@ -288,7 +294,7 @@ var ring1Cmd = &cobra.Command{ | |||
sigc := sigproxy.ForwardAllSignals(context.Background(), cmd.Process.Pid) | |||
defer sigproxysignal.StopCatch(sigc) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// mount /proc into ring2
@@ -305,17 +311,97 @@ var ring1Cmd = &cobra.Command{ | |||
return | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// setup connection ring1<->ring2
failed = true | ||
return | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// trigger ring2 to issue seccomp.LoadFilter(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, really mindblowing to be able to docker exec
in Gitpod... 🚀
(comments are nits/questions)
fd9c10c
to
738bc4a
Compare
This PR fixes #2845, #2123.
To make this work, this PR does some foundational work on the workspace runtime:
workspacekit
which sets things up inside the workspace container, and supervisor who just resumes it's "init task" duty.mount
for interceptingproc
mountsumount
/umount2
to prevent unmounting proc masksbind
in preparation for better workspace port notificationsopen_tree
/move_mount
(added in Linux 5.2), not through mount propagation: when a workspace wants to mount proc, we prepare that proc mount off-site (i.e. mount proc and add masks) and move it into place. Prior to this PR, we could only move that new mount into ring2 before it did itspivot_root
. After that operation there's no connection between the ring2 process' mount namespace and ring1 anymore, hence we cannot move proc using mount propagation alone. Instead, we now use open_tree and move_mount which were added in Linux 5.2 for just that purpose.mount
, we'd end up with the proc and mask mounts inside that runc process' mount namespace. Hence any other process in that namespace could just umount the masks. To prevent that, we seccomp-notifyumount
andumount2
calls - and prohibitopen_tree
/move_mount
inside the workspace. Our seccomp handler can then decide if the umount should be allowed or not.nsenter ...
directly. Now, ws-daemon ships with its own little helper callednsinsider
who can act on ws-daemon's behalf in another workspace. To this end we use runc's nsenter.How to test
sudo umount /proc/kcore
docker run --rm -it --name foo alpine:latest
) and runcat /proc/1/cmdline
. Notice it's the correct command line.mount | grep proc
. Notice the proc masks. Try to umount a proc mask - it should fail.docker exec -it foo sh
- you should have a shell in the Docker container.mount | grep proc
in the workspace container. There should be no trace of the Docker container's mounts (rootfs or proc).Future Work
docker exec
still doesn't work and it's not quite clear why. I've fixed an issue in therunc-facade
that might have broken things. All my attempts tostrace
the issue have been futile so far. The issue in error: OCI runtime exec - when running a docker command with healthcheck #2845 still stands.umount("/some/proc/mount")
- at the moment you cannot umount any proc mount, except for letting the kernel lazy-umount it when a mount namespace is closed. That's because we need to atomically umount the masks and proc. This PR already contains infrastructure foropen_tree
/move_mount
'ing proc mounts from within a workspace using ws-daemon. That code does not work yet however.bind
notifications: again, we have preparations for gettingbind
syscall notifications, but fail to read the arguments of that syscall. Once we've figured that out, we can pass those notifications on to supervisor who then doesn't have to poll/net/proc/tcp
anymore.NotifIDValid
to prevent time-of-check-time-of-use attacks. That function does not work reliably for some reason and more investigation is needed.Notes and Caveats