Skip to content

TestMultipleVMs failure - "exit status 148" and "race detected during execution of test" #358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sipsma opened this issue Dec 6, 2019 · 7 comments
Labels
kind/bug Something isn't working upstream/firecracker

Comments

@sipsma
Copy link
Contributor

sipsma commented Dec 6, 2019

There was an ephemeral failure of TestMultipleVMs here (run as part of #356 ): https://buildkite.com/firecracker-microvm/firecracker-containerd/builds/1315#1dd8d6e8-953c-417b-97f7-e029aea0a82f

I haven't looked in depth yet, but a few lines in the logs for vmID 78 (which got one of the "os: process already finished" errors), shows some odd looking lines:

time="2019-12-06T00:08:44.729560758Z" level=warning msg="firecracker exited: exit status 148" runtime=aws.firecracker vmID=78
...
time="2019-12-06T00:08:44.729847979Z" level=error msg="error returned from VM wait" error="1 error occurred:\n\t* exit status 148\n\n" runtime=aws.firecracker vmID=78
...
time="2019-12-06T00:08:44.729855148Z" level=error msg="the VM returns unknown error" error="1 error occurred:\n\t* exit status 148\n\n" runtime=aws.firecracker vmID=78
...
time="2019-12-06T00:08:44.730000243Z" level=debug msg="stopVMM(): sending sigterm to firecracker" runtime=aws.firecracker vmID=78
...
time="2019-12-06T00:08:44.730102402Z" level=error error="rpc error: code = Unknown desc = os: process already finished"

I'm not sure off the top of my head what caused the exit code to be 148. We also shouldn't need to send SIGTERM if the VM is already exited (but that may be unrelated).

@sipsma sipsma added the kind/bug Something isn't working label Dec 6, 2019
@sipsma
Copy link
Contributor Author

sipsma commented Dec 6, 2019

@xibz indicated he previously encountered exit code 148 when Firecracker hit seccomp violations: firecracker-microvm/firecracker#897

So that's something to look into as a possibility.

EDIT: I also found this explicitly mentioned in their changelogs too: https://github.com/firecracker-microvm/firecracker/blob/master/CHANGELOG.md#changed-5

@sipsma
Copy link
Contributor Author

sipsma commented Dec 6, 2019

I can reproduce this locally. I saw there is an unreleased fix for an issue that will cause a seccomp violation during a firecracker panic, so I tried with 0.20 (from here), but still see VMs exiting with 148.

The exit 148 always happens right as the VM is supposed to shutdown and often (but not always) includes these logs about a bad syscall:

time="2019-12-06T19:44:42.723991805Z" level=debug msg="2019-12-06T19:44:42.723905694 [anonymous-instance:INFO] Received KVM_EXIT_SHUTDOWN signal" jailer=runc ociBundlePath=/run/firecracker-containerd/default/83 runcBinaryPath=/usr/local/bin/runc runtime=aws.firecracker vmID=83 vmm_stream=stdout
time="2019-12-06T19:44:42.724113235Z" level=debug msg="2019-12-06T19:44:42.724033902 [anonymous-instance:ERROR] Shutting down VM after intercepting a bad syscall (14)." jailer=runc ociBundlePath=/run/firecracker-containerd/default/83 runcBinaryPath=/usr/local/bin/runc runtime=aws.firecracker vmID=83 vmm_stream=stderr
time="2019-12-06T19:44:43.033965500Z" level=warning msg="firecracker exited: exit status 148" runtime=aws.firecracker vmID=83

According to /usr/include/x86_64-linux-gnu/asm/unistd_64.h, syscall 14 is rt_sigprocmask. I will see if this is a known issue and, if not, open an issue w/ Firecracker about this.

@sipsma
Copy link
Contributor Author

sipsma commented Dec 6, 2019

Opened an issue with Firecracker here w/ some more details I found: firecracker-microvm/firecracker#1456

In the meantime, I think we can clean up the error about "os: process already finished" as it should ideally be more like "VMM exited with status 148" or something like that.

@sipsma
Copy link
Contributor Author

sipsma commented Dec 6, 2019

Also, I haven't reproduced the race detected during execution of test line when reproducing the other errors locally, so it may be an independent issue.

@sipsma sipsma changed the title TestMultipleVMs failure - "os: process already finished" and "race detected during execution of test" TestMultipleVMs failure - "exit status 148" and "race detected during execution of test" Dec 10, 2019
@sipsma
Copy link
Contributor Author

sipsma commented Dec 10, 2019

I merged #362, so while the issue persists, the error is no longer "os: process already finished", it's now:

VMM exit errors: 1 error occurred:
    * exit status 148

and we no longer send SIGTERM to the already dead process.

@sipsma
Copy link
Contributor Author

sipsma commented Dec 11, 2019

The Firecracker team confirmed here there is a bug around handling KVM_EXIT_SHUTDOWN, so we'll need to wait for a fix there for this to be fully resolved. In the mean time, there isn't any apparent harm being caused by it other than errors being returned by StopVM occasionally

@kzys
Copy link
Contributor

kzys commented Feb 19, 2020

Firecracker team has fixed the issue and we have upgraded our _submodules/firecracker to include the fix.

@kzys kzys closed this as completed Feb 19, 2020
fangn2 pushed a commit to fangn2/firecracker-containerd that referenced this issue Mar 23, 2023
…ependabot/go_modules/github.com/go-openapi/errors-0.20.1

Bump github.com/go-openapi/errors from 0.20.0 to 0.20.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working upstream/firecracker
Projects
None yet
Development

No branches or pull requests

2 participants