-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Occasional "vCPUs resume failed" #1555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @kzys, if the vcpu threads don't start in 100ms, the host must be highly overloaded. Of course, changing the timeout to something larger is fine, there is no point to optimize for the error-case with a small timeout. Feel free to open a PR for that. More interesting to me is why is this happening for a number of microVMs as low as 100?
|
Thanks. I'm going to open a PR that changes the timeout. We are running our tests on m5d.metal, but unlike Firecracker, we are not creating a fresh host per build. There is a chance that the host is having multiple builds in the same time. Let me double-check the load when we had the issue. The test is trying to start all 100 at the exact same time, to make sure we don't have any race condition around our logic. |
Starting 100 VMs on a EC2 m5d.metal host sometimes didn't work due to the timeout. Fixes firecracker-microvm#1555.
Starting 100 VMs on a EC2 m5d.metal host sometimes didn't work due to the timeout. Fixes firecracker-microvm#1555. Signed-off-by: Kazuyoshi Kato <[email protected]>
Starting 100 VMs on a EC2 m5d.metal host sometimes didn't work due to the timeout. Fixes #1555. Signed-off-by: Kazuyoshi Kato <[email protected]>
Starting 100 VMs on a EC2 m5d.metal host sometimes didn't work due to the timeout. Fixes firecracker-microvm#1555. Signed-off-by: Kazuyoshi Kato <[email protected]>
Hello folks,
I'm upgrading Firecracker from 0.19.0 to 0.20.0 on firecracker-containerd (firecracker-microvm/firecracker-containerd#383). One of the tests we have is launching micro 100 VMs, and it occasionally got "vCPUs resume failed" error.
Apparently the test was hitting the receive timeout Firecracker internally has (
firecracker/src/vmm/src/lib.rs
Line 941 in 53cf1ba
But I'm not so sure what would be the right way to fix the issue;
Thanks,
The text was updated successfully, but these errors were encountered: