Skip to content

Occasional "vCPUs resume failed" #1555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kzys opened this issue Jan 28, 2020 · 2 comments · Fixed by #1574
Closed

Occasional "vCPUs resume failed" #1555

kzys opened this issue Jan 28, 2020 · 2 comments · Fixed by #1574

Comments

@kzys
Copy link
Contributor

kzys commented Jan 28, 2020

Hello folks,

I'm upgrading Firecracker from 0.19.0 to 0.20.0 on firecracker-containerd (firecracker-microvm/firecracker-containerd#383). One of the tests we have is launching micro 100 VMs, and it occasionally got "vCPUs resume failed" error.

Apparently the test was hitting the receive timeout Firecracker internally has (

fn resume_vcpus(&mut self) -> std::result::Result<(), StartMicrovmError> {
) and changing the timeout from 100ms to 1000ms mitigated the issue.

But I'm not so sure what would be the right way to fix the issue;

  • Changing the timeout from 100ms to 1000ms or something longer? It worked for us, but there is no guarantees that 1000ms is enough for everyone.
  • No timeout? We could let clients handle timeout. At least it is possible for firecracker-containerd.
  • Don't start vcpus as Paused mode? I don't know this is technically possible.

Thanks,

@acatangiu
Copy link
Contributor

Hi @kzys, if the vcpu threads don't start in 100ms, the host must be highly overloaded.

Of course, changing the timeout to something larger is fine, there is no point to optimize for the error-case with a small timeout. Feel free to open a PR for that.

More interesting to me is why is this happening for a number of microVMs as low as 100?

  • What type of host machine are you using (specs)?
  • Are you starting all 100 at the exact same time or did you allow a bit of spacing between them to avoid a thundering herd?

@kzys
Copy link
Contributor Author

kzys commented Feb 3, 2020

Thanks. I'm going to open a PR that changes the timeout.

We are running our tests on m5d.metal, but unlike Firecracker, we are not creating a fresh host per build. There is a chance that the host is having multiple builds in the same time. Let me double-check the load when we had the issue.

The test is trying to start all 100 at the exact same time, to make sure we don't have any race condition around our logic.

kzys added a commit to kzys/firecracker that referenced this issue Feb 3, 2020
Starting 100 VMs on a EC2 m5d.metal host sometimes didn't work due to
the timeout.

Fixes firecracker-microvm#1555.
kzys added a commit to kzys/firecracker that referenced this issue Feb 3, 2020
Starting 100 VMs on a EC2 m5d.metal host sometimes didn't work due to
the timeout.

Fixes firecracker-microvm#1555.

Signed-off-by: Kazuyoshi Kato <[email protected]>
alxiord pushed a commit that referenced this issue Feb 5, 2020
Starting 100 VMs on a EC2 m5d.metal host sometimes didn't work due to
the timeout.

Fixes #1555.

Signed-off-by: Kazuyoshi Kato <[email protected]>
iulianbarbu pushed a commit to iulianbarbu/firecracker that referenced this issue Mar 18, 2020
Starting 100 VMs on a EC2 m5d.metal host sometimes didn't work due to
the timeout.

Fixes firecracker-microvm#1555.

Signed-off-by: Kazuyoshi Kato <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants