Skip to content

Workspacekit and seccomp-notify #3019

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jan 29, 2021
Merged

Workspacekit and seccomp-notify #3019

merged 8 commits into from
Jan 29, 2021

Conversation

csweichel
Copy link
Contributor

@csweichel csweichel commented Jan 26, 2021

This PR fixes #2845, #2123.
To make this work, this PR does some foundational work on the workspace runtime:

  • it separates the user-namespace setup from supervisor: the rings start a user-namespaced workspace have no dependency to supervisor per-se. We just added them there because it was convenient. Now that the rings require a CGO dependency (libseccomp and if we don't build with Go 1.16 also libcap) it made even more sense to separate supervisor and the rings. Now we have workspacekit which sets things up inside the workspace container, and supervisor who just resumes it's "init task" duty.
  • introduce seccomp-notify (added in Linux 5.0): workspacekit ring2 now loads its own seccomp filter that makes use of seccomp notify. It then passes the FD for the notifications back to ring1 (which is not subject to that seccomp filter). Ring1 handles the following syscalls:
    • mount for intercepting proc mounts
    • umount/umount2 to prevent unmounting proc masks
    • bind in preparation for better workspace port notifications
  • move proc mounts using open_tree/move_mount (added in Linux 5.2), not through mount propagation: when a workspace wants to mount proc, we prepare that proc mount off-site (i.e. mount proc and add masks) and move it into place. Prior to this PR, we could only move that new mount into ring2 before it did its pivot_root. After that operation there's no connection between the ring2 process' mount namespace and ring1 anymore, hence we cannot move proc using mount propagation alone. Instead, we now use open_tree and move_mount which were added in Linux 5.2 for just that purpose.
  • prevent proc mask unmounts: when the only proc mount in a workspace was that of ring2 (living in the ring1 mount ns), the mount namespace in ring2 would lock the masks. Because ring2's mount NS is a child of ring1, anyone inside ring2 could not umount something from ring1. Now that the proc mount lives in the actual child mount namespace that's no longer the case. E.g. when a runc process inside the workspace container calls mount, we'd end up with the proc and mask mounts inside that runc process' mount namespace. Hence any other process in that namespace could just umount the masks. To prevent that, we seccomp-notify umount and umount2 calls - and prohibit open_tree/move_mount inside the workspace. Our seccomp handler can then decide if the umount should be allowed or not.
  • make ws-daemon nsenter properly: prior to this PR ws-daemon would call nsenter ... directly. Now, ws-daemon ships with its own little helper called nsinsider who can act on ws-daemon's behalf in another workspace. To this end we use runc's nsenter.

How to test

  • start a workspace with feature preview enabled
  • try to umount a proc mask, e.g. sudo umount /proc/kcore
  • start a Docker container (docker run --rm -it --name foo alpine:latest) and run cat /proc/1/cmdline. Notice it's the correct command line.
  • In that Docker container run mount | grep proc. Notice the proc masks. Try to umount a proc mask - it should fail.
  • Run docker exec -it foo sh - you should have a shell in the Docker container.
  • Stop the Docker container and notice that it does indeed stop. Call mount | grep proc in the workspace container. There should be no trace of the Docker container's mounts (rootfs or proc).

Future Work

  • docker exec still doesn't work and it's not quite clear why. I've fixed an issue in the runc-facade that might have broken things. All my attempts to strace the issue have been futile so far. The issue in error: OCI runtime exec - when running a docker command with healthcheck #2845 still stands.
  • umount("/some/proc/mount") - at the moment you cannot umount any proc mount, except for letting the kernel lazy-umount it when a mount namespace is closed. That's because we need to atomically umount the masks and proc. This PR already contains infrastructure for open_tree/move_mount'ing proc mounts from within a workspace using ws-daemon. That code does not work yet however.
  • bind notifications: again, we have preparations for getting bind syscall notifications, but fail to read the arguments of that syscall. Once we've figured that out, we can pass those notifications on to supervisor who then doesn't have to poll /net/proc/tcp anymore.
  • we're not making the appropriate calls to NotifIDValid to prevent time-of-check-time-of-use attacks. That function does not work reliably for some reason and more investigation is needed.

Notes and Caveats

  • this PR introduces a dependency to Linux 5.2. Prior to this PR we had no hard kernel version dependency. Ubuntu 18.04 supports 5.4, as do GKE and EKS. We must clarify what the requirements are in [self-hosted] Platform Support Matrix #2657
  • this PR uses a fork of libseccomp-go which is currently being upstreamed. Once that's happened, we should move back to the mainline.

@csweichel csweichel force-pushed the cw/workspacekit branch 5 times, most recently from 8955ba7 to 77a5c09 Compare January 26, 2021 17:22
@csweichel csweichel marked this pull request as ready for review January 28, 2021 08:37
@csweichel csweichel linked an issue Jan 29, 2021 that may be closed by this pull request
Copy link
Member

@svenefftinge svenefftinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested the changes. Only skimmed over the code, though.

log.WithError(err).Error("cannot get parent socket fd")
failed = true
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not defer conf.Close() here? Maybe in addition, as safeguard?

failed = true
return
}
connf, err := conn.File()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand why conn - a connection build on an Unix socket that is basically a special kernel file (or not...?) - still works after pivotRoot :thinking_face:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we already have the FD open. If we wanted to connect to the socket post pivot_root that wouldn't work. But because we've done that beforehand, we're fine :)

@@ -432,7 +432,7 @@ func (m *Manager) createDefiniteWorkspacePod(startContext *startWorkspaceContext
MountPropagation: &mountPropagation,
},
)
pod.Spec.Containers[i].Command = []string{pod.Spec.Containers[i].Command[0], "ring0"}
pod.Spec.Containers[i].Command = []string{"/.supervisor/workspacekit", "ring0"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for all types of workspaces from the moment we deploy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All types that use user-namespaces. Basically we're tying user namespaces to registry facade:

  • registry_facade FF can run by itself
  • user_namespace FF mandates the registry_facade FF

@@ -261,21 +267,21 @@ var ring1Cmd = &cobra.Command{
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general the (very) methods in this file would benefit from a few, short comments to give some structure. In this case maybe:
// starting ring2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a few comments where I think they'd help.

In general I like to avoid comments that just say what the code is doing - that should be readable from the code and tends to get outdated quickly. Commenting how or why we're doing it, that's super valuable.

@@ -288,7 +294,7 @@ var ring1Cmd = &cobra.Command{
sigc := sigproxy.ForwardAllSignals(context.Background(), cmd.Process.Pid)
defer sigproxysignal.StopCatch(sigc)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// mount /proc into ring2

@@ -305,17 +311,97 @@ var ring1Cmd = &cobra.Command{
return
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// setup connection ring1<->ring2

failed = true
return
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// trigger ring2 to issue seccomp.LoadFilter(...)

Copy link
Member

@geropl geropl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, really mindblowing to be able to docker exec in Gitpod... 🚀

(comments are nits/questions)

@csweichel csweichel merged commit 13a5f8e into master Jan 29, 2021
@csweichel csweichel deleted the cw/workspacekit branch January 29, 2021 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

error: OCI runtime exec - when running a docker command with healthcheck [docker] user-namespaced Docker exec doesn't work
3 participants