Skip to content

CPUID (1/3): Error type expansion #3104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

JonathanWoollett-Light
Copy link
Contributor

@JonathanWoollett-Light JonathanWoollett-Light commented Aug 24, 2022

Reason for This PR

Error handling around snapshotting and CPUID relies on large error incomplete error types that can be difficult to understand and use appropriately. In addition, these larger types make distinguishing from which function an error arose difficult and make it unclear the possible errors functions can produce.

Description of Changes

Adjusts error handling around snapshotting and CPUID to be more explicit and allow easier expansion.

Dependency justification: thiserror

thiserror is a convenience crate exposing a single derive proc-macro to help with defining errors.

thiserror is used here and typically to reduce repetitive code. If this was widely applied throughout the Firecracker code base it could remove thousands of lines of repetitive code, reducing manually written code and thus the likelihood of errors within this code. thiserror is extensively tested although has no documented testing or security policy,

thiserror is (at the time of writing) the 84th most popular crate on crates.io with ~51 million downloads. The 1st version was released ~3 years ago, and the most recent version was released 22 days ago.

We already depend on thiserror as a transitive dependency of userfaultfd and vm-allocator.

Test coverage decrease

The decrease in test coverage is a result of the increased code generated for the std::error::Error implementations for error types. Some of this is unused as not all error types are printed in testing.

Binary size increase

The binary size increase is a result of the additional code generated and the static strings associated with this code. This size difference is as small as ~12kb.

  • This functionality can be added in rust-vmm.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license.

PR Checklist

  • All commits in this PR are signed (git commit -s).
  • The issue which led to this PR has a clear conclusion.
  • This PR follows the solution outlined in the related issue.
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this PR.
  • Any newly added unsafe code is properly documented.
  • Any API changes follow the Runbook for Firecracker API changes.
  • Any user-facing changes are mentioned in CHANGELOG.md.
  • All added/changed functionality is tested.

@JonathanWoollett-Light JonathanWoollett-Light changed the title CPUID (1/3): Error type expansion CPUID templates (1/3): Error type expansion Aug 24, 2022
@JonathanWoollett-Light JonathanWoollett-Light changed the title CPUID templates (1/3): Error type expansion CPUID (1/3): Error type expansion Aug 24, 2022
@JonathanWoollett-Light JonathanWoollett-Light changed the base branch from main to feature/cpu-templates August 24, 2022 18:08
@JonathanWoollett-Light JonathanWoollett-Light force-pushed the error-type-expansion branch 9 times, most recently from aa315b8 to 98557e0 Compare August 25, 2022 15:33
@JonathanWoollett-Light JonathanWoollett-Light changed the base branch from feature/cpu-templates to main August 25, 2022 15:33
@JonathanWoollett-Light JonathanWoollett-Light force-pushed the error-type-expansion branch 5 times, most recently from c3fb66c to cb95b31 Compare August 30, 2022 17:33
InconsistentVcpuCount,
/// The VgicSysRegsState is invalid
#[error("The VgicSysRegsState is invalid.")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible to add a parameter for what the state is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would need to change the underlying function for this. Where practical I have tried not to change too much functionality.

#[derive(Debug, thiserror::Error)]
pub enum SetMSRsError {
/// Failed to create [`vmm_sys_util::fam::FamStructWrapper`] for MSRs.
#[error("Could not create `vmm_sys_util::fam::FamStructWrapper` for MSRs")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the context for this error may give a hint as to the reason why this may happen, but I don't think this would be a useful error to see in a log. Some context as to why the error happened would be ideal, otherwise dropping this error in favour of the more generic error would seem more appropriate to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with giving further context here is that since fam::Error doesn't implement std::error::Error we can't display it nicely (I recently got a PR merged fixing this rust-vmm/vmm-sys-util#168 but we would need to update the version for this).

In this context I'd argue that we leave it in the most descriptive state, considering this will form an error message like:

Failed to boot microVM: Failed to initialize vCPU: Failed to set MSRs: Could not create vmm_sys_util::fam::FamStructWrapper for MSRs

When we update our version we can update here to add further context.

It would be useful to remind someone in the future to add this context. Do you know if there is there any recommended approach here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this context I'd argue that we leave it in the most descriptive state

My concern is that for a Firecracker user, this error isn't descriptive. Parameterising a more specific message from where the error occurs so that the user can troubleshoot would be a better way to go here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is as specific as we can go, all we know from Msrs::from_entries(&msr_entries) is that an error occurred. fam::Error contains this description, but for now, we cannot display it (we need there to be a new version of the crate containing fam::Error so as to expose the std::error::Error implementation, so we can then display it).

While it would be ideal that every error would be descriptive enough to highlight solutions I think for the moment here it cannot be well done and this is still an improvement over the existing even less specific error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the context of the error here might clarify the usefulness of it. A less specific but general error like "Internal error" for example would at least indicate a system error versus a user error. The current description is for a developer's use, not for the user of Firecracker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A less specific but general error like "Internal error" for example would at least indicate a system error versus a user error. The current description is for a developer's use, not for the user of Firecracker.

I think this distinction is difficult to draw and somewhat unnecessary. If we have a descriptive error we can then on the front-end present the user with a simple "Internal error". I do not see the argument for having a less descriptive error (in the same sense I would not see an argument for having less descriptive documentation).

Apologies if I am misunderstanding.

@@ -282,20 +282,44 @@ pub fn create_boot_msr_entries() -> Vec<kvm_msr_entry> {
]
}

/// Error type for [`set_msrs`].
#[derive(Debug, thiserror::Error)]
pub enum SetMSRsError {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I suggest reframing the error naming that this type encapsulates? My recommendation would be to rename the enum to MSRConfigurationError? (or MSRAccessError, then it could contain errors for reading and writing, unless you anticipate a large variety of error types for each category of cases)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current pattern at the moment is to prefix with the function name. I tend to prefer this as it best illustrates they are bound (and this error is not used elsewhere).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this mean that there would potentially be a GetMSRsError eventually as well?

The naming here of an error class (enum) associated with each single function seems like overkill?

Copy link
Contributor Author

@JonathanWoollett-Light JonathanWoollett-Light Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this mean that there would potentially be a GetMSRsError eventually as well?

Most likely.

The naming here of an error class (enum) associated with each single function seems like overkill?

This is a somewhat opinionated area. In the Rust standard library we see instances of both function specific error types and shared error types. I lean towards specific error types as this provides better documentation and traceability. There is a balance here, if functions return mostly similar errors it may make sense to use a shared error type, otherwise function specific error types may make sense. I find error types act like documentation here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO If SetMSR and GetMSR have identical/similar errors, they should use the same Error type as you suggest also, so maybe something MSRConfigError?

Copy link
Contributor Author

@JonathanWoollett-Light JonathanWoollett-Light Sep 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting here, there wouldn't be a GetMSRsError in Firecracker since there is no get_msrs in Firecracker (this function is within rust-vmm).

@@ -95,7 +95,7 @@ pub enum StartMicrovmError {
/// Unable to set VmResources.
SetVmResources(VmConfigError),
}

impl std::error::Error for StartMicrovmError {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: Is the builder.rs file the appropriate place for this type? I have not looked at this specific crate's builder file before, but it seems to be bloated with a lot of code that may be best put elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -441,6 +442,65 @@ pub fn build_microvm_for_boot(
Ok(vmm)
}

/// Error type for [`build_microvm_from_snapshot`].
#[derive(Debug, thiserror::Error)]
pub enum BuildMicrovmFromSnapshotError {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I realize there are other error types in this builder.rs file already, but I can't help but wonder if there isn't a better spot for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having error types be closely bound to the function/s they correspond to, I find it increases readability.

I think the issue here is rather that builder.rs is just too big to begin with, potentially some functions could be moved into sub-modules (although I think this may be outside the scope of this PR).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, builder is a monster. I think it needs re-working and there are some ideas orthogonal to this PR that might do exactly that.

@@ -65,7 +65,36 @@ pub enum Error {
RestoreGic(arch::aarch64::gic::Error),
}

impl Display for Error {
/// Error type for [`Vm::restore_state`]
#[cfg(target_arch = "x86_64")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously not too keen on differentiating coding styles based on the architecture to be built for. Is this not generalizable for x86_64 and aarch64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unforgettably here since the function is entirely different between architectures so must the error types be different (it could be generalized via having the error type having all variants for both architectures, but this would result in unused code and be less clear).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function wasn't different previously, I don't follow why the function should be different given the scope of the change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function restore_store did have an x86_64 and aarch64 variant before:

    #[cfg(target_arch = "x86_64")]
    /// Restores the Kvm Vm state.
    pub fn restore_state(&self, state: &VmState) -> Result<()> {
        self.fd
            .set_pit2(&state.pitstate)
            .map_err(Error::VmSetPit2)?;
        self.fd.set_clock(&state.clock).map_err(Error::VmSetClock)?;
        self.fd
            .set_irqchip(&state.pic_master)
            .map_err(Error::VmSetIrqChip)?;
        self.fd
            .set_irqchip(&state.pic_slave)
            .map_err(Error::VmSetIrqChip)?;
        self.fd
            .set_irqchip(&state.ioapic)
            .map_err(Error::VmSetIrqChip)?;
        Ok(())
    }

    #[cfg(target_arch = "aarch64")]
    pub fn save_state(&self, mpidrs: &[u64]) -> Result<VmState> {
        Ok(VmState {
            gic: self
                .get_irqchip()
                .save_device(mpidrs)
                .map_err(Error::SaveGic)?,
        })
    }

    #[cfg(target_arch = "aarch64")]
    pub fn restore_state(&self, mpidrs: &[u64], state: &VmState) -> Result<()> {
        self.get_irqchip()
            .restore_device(mpidrs, &state.gic)
            .map_err(Error::RestoreGic)
    }

"JAILER_BINARY_SIZE_TARGET": 797632,
"FC_BINARY_SIZE_LIMIT": 2485719,
"FC_BINARY_SIZE_LIMIT": 2646663,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This increase is significant (2646663 − 2485719 = 160944 bytes (~157.7 KB)). This equates to roughly 6% increase in the binary size. I'm assuming that is all due to thiserror being added as a dependency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thiserror is already included as a transitive dependency via userfaultfd and vm-allocator so this wouldn't contribute additionally to the binary size. The increase is a result of the static strings uses for the std::fmt::Display implementations for the error types and some of the additional code generated with std::error::Error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of a library in a dependency wouldn't automatically result in the library being linked in. I would have expected the net delta in the binary size as a result of the code change in the immediate repository to be better not worse. The change here should be entirely related to the use of thiserror locally.

Regardless, as discussed offline, the change to the size target/limit here is not exclusively owed to the use of thiserror due to the 5% tolerance margin. It is just that the change here finally breached the threshold.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the test_binary_size.py locally and came up with a change of (2520632 - 2455280) = 65352 which is less serious.

@JonathanWoollett-Light JonathanWoollett-Light force-pushed the error-type-expansion branch 2 times, most recently from 195a973 to bf81dcd Compare September 1, 2022 09:35
@JonathanWoollett-Light JonathanWoollett-Light added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Sep 1, 2022
Copy link
Contributor

@pb8o pb8o left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This touches far more than CPUID. General comment is reducing the scope to maybe just CPUID crate would have been a simpler review, and make future introduction of thiserror more gradual.

@@ -11,6 +11,7 @@ serde = { version = ">=1.0.27", features = ["derive"] }
serde_derive = ">=1.0.27"
serde_json = ">=1.0.9"
derive_more = { version = "0.99.17", default-features = false, features = ["from"] }
thiserror = "1.0.32"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to make this >= ?

Copy link
Contributor Author

@JonathanWoollett-Light JonathanWoollett-Light Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With >= we cannot guarantee an author won't make a change that breaks our build. = "1.0.32" is the only way to fully guarantee this, we could loosen it with = "1" instead if we trust they stick to SemVer.

I'm not strongly opinionated here, I'm not sure if we have a policy on this (we should probably make one if not). What do you think is best?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do people tend to break semver? I would assume that if 1 does not become 2 we should be fine.

Copy link
Contributor Author

@JonathanWoollett-Light JonathanWoollett-Light Sep 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, quite often. Even if they intend to abide SemVer (which they often don't) since their is no common automated checking mechanism (rust-vmm broke it recently). I have no strong opinion here, do you think we should go with thiserror="1"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't have a strong opinion. Maybe we could play safe and leave it as it is here.

pb8o
pb8o previously approved these changes Sep 2, 2022
Copy link
Contributor

@bchalios bchalios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. This was big 😛

I really like the changes apart from minor comments.

@@ -11,6 +11,7 @@ serde = { version = ">=1.0.27", features = ["derive"] }
serde_derive = ">=1.0.27"
serde_json = ">=1.0.9"
derive_more = { version = "0.99.17", default-features = false, features = ["from"] }
thiserror = "1.0.32"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do people tend to break semver? I would assume that if 1 does not become 2 we should be fine.

@@ -9,11 +9,13 @@ use kvm_bindings::kvm_lapic_state;
use kvm_ioctls::VcpuFd;
use utils::byte_order;
/// Errors thrown while configuring the LAPIC.
#[derive(Debug)]
#[derive(Debug, thiserror::Error, PartialEq)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the PartialEq here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and in other Error types

Copy link
Contributor Author

@JonathanWoollett-Light JonathanWoollett-Light Sep 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kvm_ioctls::Error doesn't implement std::cmp::Eq unfortunately, so we cannot derive it on types which contain it (deriving PartialEq is still useful so we can do ErrorX == ErrorY).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need it in the first place? Does thiserror require it?

@@ -282,20 +282,44 @@ pub fn create_boot_msr_entries() -> Vec<kvm_msr_entry> {
]
}

/// Error type for [`set_msrs`].
#[derive(Debug, thiserror::Error)]
pub enum SetMSRsError {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO If SetMSR and GetMSR have identical/similar errors, they should use the same Error type as you suggest also, so maybe something MSRConfigError?

/// Error type for [`setup_fpu`].
#[derive(Debug, derive_more::From, PartialEq)]
pub struct SetupFpuError(utils::errno::Error);
impl std::error::Error for SetupFpuError {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not thiserror here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thiserror doesn't seem to be able to derive on new types (new types being struct X(Y)).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link?

@@ -441,6 +442,65 @@ pub fn build_microvm_for_boot(
Ok(vmm)
}

/// Error type for [`build_microvm_from_snapshot`].
#[derive(Debug, thiserror::Error)]
pub enum BuildMicrovmFromSnapshotError {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, builder is a monster. I think it needs re-working and there are some ideas orthogonal to this PR that might do exactly that.

pb8o
pb8o previously approved these changes Sep 5, 2022
bchalios
bchalios previously approved these changes Sep 5, 2022
Jonathan Woollett-Light added 3 commits September 5, 2022 10:19
Signed-off-by: Jonathan Woollett-Light <[email protected]>
Signed-off-by: Jonathan Woollett-Light <[email protected]>
Signed-off-by: Jonathan Woollett-Light <[email protected]>
@JonathanWoollett-Light JonathanWoollett-Light merged commit 196ac5a into firecracker-microvm:main Sep 5, 2022
@JonathanWoollett-Light JonathanWoollett-Light deleted the error-type-expansion branch September 5, 2022 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Awaiting review Indicates that a pull request is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants