Skip to content

Load Microvm snapshot support #1907

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jun 5, 2020

Conversation

acatangiu
Copy link
Contributor

@acatangiu acatangiu commented May 18, 2020

Reason for This PR

Fixes #1847
Fixes #1870

Description of Changes

Splits microvm building logic in boot-specific code and generic code. The latter is to be reused in the snapshot path.

Adds code to load microvm state from snapshot file and to build microvm from state definition.

Implements missing plumbing to have end-to-end functionality.

Integration tests will be added as part of #1850

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license.

PR Checklist

[Author TODO: Meet these criteria.]
[Reviewer TODO: Verify that these criteria are met. Request changes if not]

  • All commits in this PR are signed (git commit -s).
  • The reason for this PR is clearly provided (issue no. or explanation).
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this PR.
  • Any newly added unsafe code is properly documented.
  • Any API changes are reflected in firecracker/swagger.yaml.
  • Any user-facing changes are mentioned in CHANGELOG.md.

@acatangiu acatangiu self-assigned this May 18, 2020
@acatangiu acatangiu force-pushed the load_snap branch 2 times, most recently from a506af4 to 982de5a Compare May 19, 2020 17:02
@lauralt lauralt mentioned this pull request May 20, 2020
8 tasks
.map(|handle| {
handle
.response_receiver()
.recv_timeout(Duration::from_millis(1000))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect that restoring Vcpus takes longer than saving them?

Copy link
Contributor Author

@acatangiu acatangiu May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've increased all timeouts to 1000ms to give a better chance of success to highly oversubscribed scenarios/machines.

We've seen scenarios in the past where people were running stress tests and where channel timeouts would happen simply because CPUs were so contended that threads of a particular Firecracker process were scheduled to run very sparsely.

We do not enforce in Firecracker a time-limit for these operations, they will run only as fast as the host CPU runs them. The timeout's purpose here is to catch error cases where things go wrong and the thread on the other end of the channel is not responding anymore.

I'm not sure even 1000ms is enough, but it is what was used to fix #1555 so we can continue with that for now.

@acatangiu acatangiu changed the title [RFC] Load Microvm snapshot support Load Microvm snapshot support May 21, 2020
@acatangiu acatangiu marked this pull request as ready for review May 21, 2020 10:05
@acatangiu acatangiu force-pushed the load_snap branch 2 times, most recently from ddf710b to 120c4d5 Compare May 21, 2020 13:17
@lauralt lauralt self-requested a review May 21, 2020 13:56
Copy link

@lauralt lauralt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments, will continue the review tomorrow.

@acatangiu acatangiu added Status: Author Status: Awaiting review Indicates that a pull request is ready to be reviewed labels May 22, 2020
@lauralt lauralt mentioned this pull request May 25, 2020
8 tasks
@acatangiu acatangiu force-pushed the load_snap branch 2 times, most recently from e03d718 to 28ac5d8 Compare May 26, 2020 13:59
vmm.start_vcpus(vcpus, seccomp_filter).map_err(Internal)?;

// Restore vcpus kvm state.
vmm.restore_vcpu_states(microvm_state.vcpu_states).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to explicitly call resume from the API to resume the VCPUs ? Can't we just do that here so a new API call is not necessary. IMO the only logical action after load is to resume.

I guess this unwrap is not in the final version of the PR. Error needs to travel up the stack and presented to the user.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to explicitly call resume from the API to resume the VCPUs ? Can't we just do that here so a new API call is not necessary. IMO the only logical action after load is to resume.

Hmm, it's a good point and would simplify things a bit. But since CreateSnapshot involves pausing and resuming to be done explicitly, we would lose I guess some kind of a symmetry this way :-??.
Maybe we should start thinking more about the config file option also for load, as you already proposed in another thread. We could add a new cmd line parameter, --snapshot-file or --snapshot-config-file which would include the load config and also the optional logger and metrics configs.

I guess this unwrap is not in the final version of the PR. Error needs to travel up the stack and presented to the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to explicitly call resume from the API to resume the VCPUs ? Can't we just do that here so a new API call is not necessary. IMO the only logical action after load is to resume.

Yes, that is the proposed and approved API flow.

We can change that outside this PR, as it implies getting buy-in from users and updating documentation as well.

// Restore vcpus kvm state.
vmm.restore_vcpu_states(microvm_state.vcpu_states).unwrap();

// Load seccomp filters for the VMM thread.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to persist the seccomp filters as well ? And allow the user to override it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there could be usecases for that. But let's treat it as an enhancement. It is not required for our current milestone.

@acatangiu acatangiu force-pushed the load_snap branch 2 times, most recently from d389b0d to daf3cfc Compare May 27, 2020 10:33
@acatangiu
Copy link
Contributor Author

@sandreim @lauralt @ioanachirca I've addressed all your comments, please review.

@acatangiu acatangiu force-pushed the load_snap branch 2 times, most recently from 6b37662 to a30bb29 Compare May 29, 2020 16:16
@acatangiu
Copy link
Contributor Author

@sandreim @lauralt @ioanachirca this PR is up-to-date. Please review.

Copy link
Contributor

@ioanachirca ioanachirca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.

@acatangiu acatangiu force-pushed the load_snap branch 3 times, most recently from a1a4796 to 4935b94 Compare June 3, 2020 16:06
sandreim
sandreim previously approved these changes Jun 4, 2020
acatangiu and others added 9 commits June 5, 2020 13:26
Add support to restore specific kvm states to the owned Vcpus.

Signed-off-by: Adrian Catangiu <[email protected]>
The trait exposes methods to dump the guest memory to file,
describe the guest memory regions and restore the guest memory.

This commit also renames the memory_dump module to memory_snapshot.

Signed-off-by: Ioana Chirca <[email protected]>
Split microVM building logic in boot-specific code and generic code.
The latter is to be reused in the snapshot path.

Signed-off-by: Adrian Catangiu <[email protected]>
Signed-off-by: Adrian Catangiu <[email protected]>
Adds the missing plumbing to support the '/snapshot/load' command.

Signed-off-by: Adrian Catangiu <[email protected]>
Added KVM Vm state and Vcpu state related ioctl numbers
to the allowed list of ioctls.

Added 'libc::ftruncate' used by 'File::set_len'.

Signed-off-by: Adrian Catangiu <[email protected]>
@acatangiu
Copy link
Contributor Author

I removed that dumb integ test from this PR. A highly improved version of it will be added by @sandreim as part of #1850

@sandreim sandreim merged commit 3fecd27 into firecracker-microvm:master Jun 5, 2020
Copy link

@lauralt lauralt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this PR was merged while i was reviewing :(, it was an approve anyway, but I had 2 nits (or questions) and I will add them though, maybe we will keep them in mind for future PRs (especially the coverage one).

@acatangiu acatangiu deleted the load_snap branch July 27, 2020 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Awaiting review Indicates that a pull request is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend snapshotted memory state Vmm support for restoring a microVM from snapshot
4 participants