Occasional "vCPUs resume failed"

Hello folks,

I'm upgrading Firecracker from 0.19.0 to 0.20.0 on firecracker-containerd (https://github.com/firecracker-microvm/firecracker-containerd/pull/383). One of the tests we have is launching micro 100 VMs, and it occasionally got "vCPUs resume failed" error.

Apparently the test was hitting the receive timeout Firecracker internally has (https://github.com/firecracker-microvm/firecracker/blob/53cf1bacadc23a21df361a894ac2887cafcb7139/src/vmm/src/lib.rs#L941) and changing the timeout from 100ms to 1000ms mitigated the issue.

But I'm not so sure what would be the right way to fix the issue;

- Changing the timeout from 100ms to 1000ms or something longer? It worked for us, but there is no guarantees that 1000ms is enough for everyone.
- No timeout? We could let clients handle timeout. At least it is possible for firecracker-containerd.
- Don't start vcpus as Paused mode? I don't know this is technically possible.

Thanks,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Occasional "vCPUs resume failed" #1555

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Occasional "vCPUs resume failed" #1555

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions