-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Load Microvm snapshot support #1907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a506af4
to
982de5a
Compare
.map(|handle| { | ||
handle | ||
.response_receiver() | ||
.recv_timeout(Duration::from_millis(1000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect that restoring Vcpus takes longer than saving them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've increased all timeouts to 1000ms to give a better chance of success to highly oversubscribed scenarios/machines.
We've seen scenarios in the past where people were running stress tests and where channel timeouts would happen simply because CPUs were so contended that threads of a particular Firecracker process were scheduled to run very sparsely.
We do not enforce in Firecracker a time-limit for these operations, they will run only as fast as the host CPU runs them. The timeout's purpose here is to catch error cases where things go wrong and the thread on the other end of the channel is not responding anymore.
I'm not sure even 1000ms is enough, but it is what was used to fix #1555 so we can continue with that for now.
ddf710b
to
120c4d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments, will continue the review tomorrow.
e03d718
to
28ac5d8
Compare
src/vmm/src/builder.rs
Outdated
vmm.start_vcpus(vcpus, seccomp_filter).map_err(Internal)?; | ||
|
||
// Restore vcpus kvm state. | ||
vmm.restore_vcpu_states(microvm_state.vcpu_states).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to explicitly call resume from the API to resume the VCPUs ? Can't we just do that here so a new API call is not necessary. IMO the only logical action after load is to resume.
I guess this unwrap is not in the final version of the PR. Error needs to travel up the stack and presented to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to explicitly call resume from the API to resume the VCPUs ? Can't we just do that here so a new API call is not necessary. IMO the only logical action after load is to resume.
Hmm, it's a good point and would simplify things a bit. But since CreateSnapshot
involves pausing and resuming to be done explicitly, we would lose I guess some kind of a symmetry this way :-??.
Maybe we should start thinking more about the config file option also for load, as you already proposed in another thread. We could add a new cmd line parameter, --snapshot-file
or --snapshot-config-file
which would include the load config and also the optional logger and metrics configs.
I guess this unwrap is not in the final version of the PR. Error needs to travel up the stack and presented to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to explicitly call resume from the API to resume the VCPUs ? Can't we just do that here so a new API call is not necessary. IMO the only logical action after load is to resume.
Yes, that is the proposed and approved API flow.
We can change that outside this PR, as it implies getting buy-in from users and updating documentation as well.
src/vmm/src/builder.rs
Outdated
// Restore vcpus kvm state. | ||
vmm.restore_vcpu_states(microvm_state.vcpu_states).unwrap(); | ||
|
||
// Load seccomp filters for the VMM thread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to persist the seccomp filters as well ? And allow the user to override it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there could be usecases for that. But let's treat it as an enhancement. It is not required for our current milestone.
d389b0d
to
daf3cfc
Compare
@sandreim @lauralt @ioanachirca I've addressed all your comments, please review. |
6b37662
to
a30bb29
Compare
@sandreim @lauralt @ioanachirca this PR is up-to-date. Please review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM.
a1a4796
to
4935b94
Compare
Add support to restore specific kvm states to the owned Vcpus. Signed-off-by: Adrian Catangiu <[email protected]>
The trait exposes methods to dump the guest memory to file, describe the guest memory regions and restore the guest memory. This commit also renames the memory_dump module to memory_snapshot. Signed-off-by: Ioana Chirca <[email protected]>
Split microVM building logic in boot-specific code and generic code. The latter is to be reused in the snapshot path. Signed-off-by: Adrian Catangiu <[email protected]>
Signed-off-by: Adrian Catangiu <[email protected]>
Signed-off-by: Adrian Catangiu <[email protected]>
Adds the missing plumbing to support the '/snapshot/load' command. Signed-off-by: Adrian Catangiu <[email protected]>
Signed-off-by: Adrian Catangiu <[email protected]>
Added KVM Vm state and Vcpu state related ioctl numbers to the allowed list of ioctls. Added 'libc::ftruncate' used by 'File::set_len'. Signed-off-by: Adrian Catangiu <[email protected]>
Signed-off-by: Adrian Catangiu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this PR was merged while i was reviewing :(, it was an approve anyway, but I had 2 nits (or questions) and I will add them though, maybe we will keep them in mind for future PRs (especially the coverage one).
Reason for This PR
Fixes #1847
Fixes #1870
Description of Changes
Splits microvm building logic in boot-specific code and generic code. The latter is to be reused in the snapshot path.
Adds code to load microvm state from snapshot file and to build microvm from state definition.
Implements missing plumbing to have end-to-end functionality.
Integration tests will be added as part of #1850
License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license.
PR Checklist
[Author TODO: Meet these criteria.]
[Reviewer TODO: Verify that these criteria are met. Request changes if not]
git commit -s
).unsafe
code is properly documented.firecracker/swagger.yaml
.CHANGELOG.md
.