Skip to content

Commit 05e7885

Browse files
Sean Christophersongregkh
Sean Christopherson
authored andcommitted
KVM: nVMX: restore host state in nested_vmx_vmexit for VMFail
[ Upstream commit bd18bff ] A VMEnter that VMFails (as opposed to VMExits) does not touch host state beyond registers that are explicitly noted in the VMFail path, e.g. EFLAGS. Host state does not need to be loaded because VMFail is only signaled for consistency checks that occur before the CPU starts to load guest state, i.e. there is no need to restore any state as nothing has been modified. But in the case where a VMFail is detected by hardware and not by KVM (due to deferring consistency checks to hardware), KVM has already loaded some amount of guest state. Luckily, "loaded" only means loaded to KVM's software model, i.e. vmcs01 has not been modified. So, unwind our software model to the pre-VMEntry host state. Not restoring host state in this VMFail path leads to a variety of failures because we end up with stale data in vcpu->arch, e.g. CR0, CR4, EFER, etc... will all be out of sync relative to vmcs01. Any significant delta in the stale data is all but guaranteed to crash L1, e.g. emulation of SMEP, SMAP, UMIP, WP, etc... will be wrong. An alternative to this "soft" reload would be to load host state from vmcs12 as if we triggered a VMExit (as opposed to VMFail), but that is wildly inconsistent with respect to the VMX architecture, e.g. an L1 VMM with separate VMExit and VMFail paths would explode. Note that this approach does not mean KVM is 100% accurate with respect to VMX hardware behavior, even at an architectural level (the exact order of consistency checks is microarchitecture specific). But 100% emulation accuracy isn't the goal (with this patch), rather the goal is to be consistent in the information delivered to L1, e.g. a VMExit should not fall-through VMENTER, and a VMFail should not jump to HOST_RIP. This technically reverts commit "5af4157388ad (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure)", but retains the core aspects of that patch, just in an open coded form due to the need to pull state from vmcs01 instead of vmcs12. Restoring host state resolves a variety of issues introduced by commit "4f350c6dbcb9 (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)", which remedied the incorrect behavior of treating VMFail like VMExit but in doing so neglected to restore arch state that had been modified prior to attempting nested VMEnter. A sample failure that occurs due to stale vcpu.arch state is a fault of some form while emulating an LGDT (due to emulated UMIP) from L1 after a failed VMEntry to L3, in this case when running the KVM unit test test_tpr_threshold_values in L1. L0 also hits a WARN in this case due to a stale arch.cr4.UMIP. L1: BUG: unable to handle kernel paging request at ffffc90000663b9e PGD 276512067 P4D 276512067 PUD 276513067 PMD 274efa067 PTE 8000000271de2163 Oops: 0009 [#1] SMP CPU: 5 PID: 12495 Comm: qemu-system-x86 Tainted: G W 4.18.0-rc2+ #2 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:native_load_gdt+0x0/0x10 ... Call Trace: load_fixmap_gdt+0x22/0x30 __vmx_load_host_state+0x10e/0x1c0 [kvm_intel] vmx_switch_vmcs+0x2d/0x50 [kvm_intel] nested_vmx_vmexit+0x222/0x9c0 [kvm_intel] vmx_handle_exit+0x246/0x15a0 [kvm_intel] kvm_arch_vcpu_ioctl_run+0x850/0x1830 [kvm] kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm] do_vfs_ioctl+0x9f/0x600 ksys_ioctl+0x66/0x70 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x4f/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 L0: WARNING: CPU: 2 PID: 3529 at arch/x86/kvm/vmx.c:6618 handle_desc+0x28/0x30 [kvm_intel] ... CPU: 2 PID: 3529 Comm: qemu-system-x86 Not tainted 4.17.2-coffee+ #76 Hardware name: Intel Corporation Kabylake Client platform/KBL S RIP: 0010:handle_desc+0x28/0x30 [kvm_intel] ... Call Trace: kvm_arch_vcpu_ioctl_run+0x863/0x1840 [kvm] kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm] do_vfs_ioctl+0x9f/0x5e0 ksys_ioctl+0x66/0x70 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x49/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 5af4157 (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure) Fixes: 4f350c6 (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly) Cc: Jim Mattson <[email protected]> Cc: Krish Sadhukhan <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Radim KrÄmář <[email protected]> Cc: Wanpeng Li <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
1 parent 00254ad commit 05e7885

File tree

1 file changed

+153
-20
lines changed

1 file changed

+153
-20
lines changed

arch/x86/kvm/vmx.c

Lines changed: 153 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -11846,24 +11846,6 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
1184611846
kvm_clear_interrupt_queue(vcpu);
1184711847
}
1184811848

11849-
static void load_vmcs12_mmu_host_state(struct kvm_vcpu *vcpu,
11850-
struct vmcs12 *vmcs12)
11851-
{
11852-
u32 entry_failure_code;
11853-
11854-
nested_ept_uninit_mmu_context(vcpu);
11855-
11856-
/*
11857-
* Only PDPTE load can fail as the value of cr3 was checked on entry and
11858-
* couldn't have changed.
11859-
*/
11860-
if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
11861-
nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
11862-
11863-
if (!enable_ept)
11864-
vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
11865-
}
11866-
1186711849
/*
1186811850
* A part of what we need to when the nested L2 guest exits and we want to
1186911851
* run its L1 parent, is to reset L1's guest state to the host state specified
@@ -11877,6 +11859,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
1187711859
struct vmcs12 *vmcs12)
1187811860
{
1187911861
struct kvm_segment seg;
11862+
u32 entry_failure_code;
1188011863

1188111864
if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
1188211865
vcpu->arch.efer = vmcs12->host_ia32_efer;
@@ -11903,7 +11886,17 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
1190311886
vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
1190411887
vmx_set_cr4(vcpu, vmcs12->host_cr4);
1190511888

11906-
load_vmcs12_mmu_host_state(vcpu, vmcs12);
11889+
nested_ept_uninit_mmu_context(vcpu);
11890+
11891+
/*
11892+
* Only PDPTE load can fail as the value of cr3 was checked on entry and
11893+
* couldn't have changed.
11894+
*/
11895+
if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
11896+
nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
11897+
11898+
if (!enable_ept)
11899+
vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
1190711900

1190811901
if (enable_vpid) {
1190911902
/*
@@ -11994,6 +11987,140 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
1199411987
nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL);
1199511988
}
1199611989

11990+
static inline u64 nested_vmx_get_vmcs01_guest_efer(struct vcpu_vmx *vmx)
11991+
{
11992+
struct shared_msr_entry *efer_msr;
11993+
unsigned int i;
11994+
11995+
if (vm_entry_controls_get(vmx) & VM_ENTRY_LOAD_IA32_EFER)
11996+
return vmcs_read64(GUEST_IA32_EFER);
11997+
11998+
if (cpu_has_load_ia32_efer)
11999+
return host_efer;
12000+
12001+
for (i = 0; i < vmx->msr_autoload.guest.nr; ++i) {
12002+
if (vmx->msr_autoload.guest.val[i].index == MSR_EFER)
12003+
return vmx->msr_autoload.guest.val[i].value;
12004+
}
12005+
12006+
efer_msr = find_msr_entry(vmx, MSR_EFER);
12007+
if (efer_msr)
12008+
return efer_msr->data;
12009+
12010+
return host_efer;
12011+
}
12012+
12013+
static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
12014+
{
12015+
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
12016+
struct vcpu_vmx *vmx = to_vmx(vcpu);
12017+
struct vmx_msr_entry g, h;
12018+
struct msr_data msr;
12019+
gpa_t gpa;
12020+
u32 i, j;
12021+
12022+
vcpu->arch.pat = vmcs_read64(GUEST_IA32_PAT);
12023+
12024+
if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS) {
12025+
/*
12026+
* L1's host DR7 is lost if KVM_GUESTDBG_USE_HW_BP is set
12027+
* as vmcs01.GUEST_DR7 contains a userspace defined value
12028+
* and vcpu->arch.dr7 is not squirreled away before the
12029+
* nested VMENTER (not worth adding a variable in nested_vmx).
12030+
*/
12031+
if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP)
12032+
kvm_set_dr(vcpu, 7, DR7_FIXED_1);
12033+
else
12034+
WARN_ON(kvm_set_dr(vcpu, 7, vmcs_readl(GUEST_DR7)));
12035+
}
12036+
12037+
/*
12038+
* Note that calling vmx_set_{efer,cr0,cr4} is important as they
12039+
* handle a variety of side effects to KVM's software model.
12040+
*/
12041+
vmx_set_efer(vcpu, nested_vmx_get_vmcs01_guest_efer(vmx));
12042+
12043+
vcpu->arch.cr0_guest_owned_bits = X86_CR0_TS;
12044+
vmx_set_cr0(vcpu, vmcs_readl(CR0_READ_SHADOW));
12045+
12046+
vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
12047+
vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW));
12048+
12049+
nested_ept_uninit_mmu_context(vcpu);
12050+
vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
12051+
__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
12052+
12053+
/*
12054+
* Use ept_save_pdptrs(vcpu) to load the MMU's cached PDPTRs
12055+
* from vmcs01 (if necessary). The PDPTRs are not loaded on
12056+
* VMFail, like everything else we just need to ensure our
12057+
* software model is up-to-date.
12058+
*/
12059+
ept_save_pdptrs(vcpu);
12060+
12061+
kvm_mmu_reset_context(vcpu);
12062+
12063+
if (cpu_has_vmx_msr_bitmap())
12064+
vmx_update_msr_bitmap(vcpu);
12065+
12066+
/*
12067+
* This nasty bit of open coding is a compromise between blindly
12068+
* loading L1's MSRs using the exit load lists (incorrect emulation
12069+
* of VMFail), leaving the nested VM's MSRs in the software model
12070+
* (incorrect behavior) and snapshotting the modified MSRs (too
12071+
* expensive since the lists are unbound by hardware). For each
12072+
* MSR that was (prematurely) loaded from the nested VMEntry load
12073+
* list, reload it from the exit load list if it exists and differs
12074+
* from the guest value. The intent is to stuff host state as
12075+
* silently as possible, not to fully process the exit load list.
12076+
*/
12077+
msr.host_initiated = false;
12078+
for (i = 0; i < vmcs12->vm_entry_msr_load_count; i++) {
12079+
gpa = vmcs12->vm_entry_msr_load_addr + (i * sizeof(g));
12080+
if (kvm_vcpu_read_guest(vcpu, gpa, &g, sizeof(g))) {
12081+
pr_debug_ratelimited(
12082+
"%s read MSR index failed (%u, 0x%08llx)\n",
12083+
__func__, i, gpa);
12084+
goto vmabort;
12085+
}
12086+
12087+
for (j = 0; j < vmcs12->vm_exit_msr_load_count; j++) {
12088+
gpa = vmcs12->vm_exit_msr_load_addr + (j * sizeof(h));
12089+
if (kvm_vcpu_read_guest(vcpu, gpa, &h, sizeof(h))) {
12090+
pr_debug_ratelimited(
12091+
"%s read MSR failed (%u, 0x%08llx)\n",
12092+
__func__, j, gpa);
12093+
goto vmabort;
12094+
}
12095+
if (h.index != g.index)
12096+
continue;
12097+
if (h.value == g.value)
12098+
break;
12099+
12100+
if (nested_vmx_load_msr_check(vcpu, &h)) {
12101+
pr_debug_ratelimited(
12102+
"%s check failed (%u, 0x%x, 0x%x)\n",
12103+
__func__, j, h.index, h.reserved);
12104+
goto vmabort;
12105+
}
12106+
12107+
msr.index = h.index;
12108+
msr.data = h.value;
12109+
if (kvm_set_msr(vcpu, &msr)) {
12110+
pr_debug_ratelimited(
12111+
"%s WRMSR failed (%u, 0x%x, 0x%llx)\n",
12112+
__func__, j, h.index, h.value);
12113+
goto vmabort;
12114+
}
12115+
}
12116+
}
12117+
12118+
return;
12119+
12120+
vmabort:
12121+
nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL);
12122+
}
12123+
1199712124
/*
1199812125
* Emulate an exit from nested guest (L2) to L1, i.e., prepare to run L1
1199912126
* and modify vmcs12 to make it see what it would expect to see there if
@@ -12126,7 +12253,13 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
1212612253
*/
1212712254
nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
1212812255

12129-
load_vmcs12_mmu_host_state(vcpu, vmcs12);
12256+
/*
12257+
* Restore L1's host state to KVM's software model. We're here
12258+
* because a consistency check was caught by hardware, which
12259+
* means some amount of guest state has been propagated to KVM's
12260+
* model and needs to be unwound to the host's state.
12261+
*/
12262+
nested_vmx_restore_host_state(vcpu);
1213012263

1213112264
/*
1213212265
* The emulated instruction was already skipped in

0 commit comments

Comments
 (0)