Skip to content

Occasional "vCPUs resume failed" #1555

Closed
@kzys

Description

@kzys

Hello folks,

I'm upgrading Firecracker from 0.19.0 to 0.20.0 on firecracker-containerd (firecracker-microvm/firecracker-containerd#383). One of the tests we have is launching micro 100 VMs, and it occasionally got "vCPUs resume failed" error.

Apparently the test was hitting the receive timeout Firecracker internally has (

fn resume_vcpus(&mut self) -> std::result::Result<(), StartMicrovmError> {
) and changing the timeout from 100ms to 1000ms mitigated the issue.

But I'm not so sure what would be the right way to fix the issue;

  • Changing the timeout from 100ms to 1000ms or something longer? It worked for us, but there is no guarantees that 1000ms is enough for everyone.
  • No timeout? We could let clients handle timeout. At least it is possible for firecracker-containerd.
  • Don't start vcpus as Paused mode? I don't know this is technically possible.

Thanks,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions