VMware Cloud Community
simitel
Contributor
Contributor
Jump to solution

Help needed: Failed to enter protected mode in nested virtualization.

I am running Linux VM (4.4.0) in VMWare.

And I developed a VMX based hypervisor.

It can load real-mode guest code well.

Now I changed the guest, to make it enter protected mode from real-mode, but it failed to do that.

Here is the guest code,

#define SEG_KCODE 1  // kernel code

#define SEG_KDATA 2  // kernel data+stack

#define SEG_KCPU  3  // kernel per-cpu data

#define SEG_UCODE 4  // user code

#define SEG_UDATA 5  // user data+stack

#define SEG_TSS   6  // this process's task state

#define CR0_PE          0x00000001      // Protection Enable

#define SEG_NULLASM                                             \

    .word 0, 0;                                             \

    .byte 0, 0, 0, 0

// The 0xC0 means the limit is in 4096-byte units

// and (for executable segments) 32-bit mode.

#define SEG_ASM(type,base,lim)                                  \

        .word (((lim) >> 12) & 0xffff), ((base) & 0xffff);      \

        .byte (((base) >> 16) & 0xff), (0x90 | (type)),         \

        (0xC0 | (((lim) >> 28) & 0xf)), (((base) >> 24) & 0xff)

#define STA_X     0x8       // Executable segment

#define STA_E     0x4       // Expand down (non-executable segments)

#define STA_C     0x4       // Conforming code segment (executable only)

#define STA_W     0x2       // Writeable (non-executable segments)

#define STA_R     0x2       // Readable (executable segments)

#define STA_A       0x1 // Accessed

# Start the first CPU: switch to 32-bit protected mode, jump into C.

        .code16

        .global code16, code16_end

code16:

        xor %ecx, %ecx

        mov %cr3, %eax

        mov %eax, %cr3

    seta20.1:

        inb     $0x64,%al               # Wait for not busy

        testb   $0x2,%al

        jnz     seta20.1

        movb    $0xd1,%al               # 0xd1 -> port 0x64

        outb    %al,$0x64

    seta20.2:

        inb     $0x64,%al               # Wait for not busy

        testb   $0x2,%al

        jnz     seta20.2

        movb    $0xdf,%al               # 0xdf -> port 0x60

        outb    %al,$0x60

        wrmsr

        lgdt    gdtdesc

        movl    %cr0, %eax

        orl     $CR0_PE, %eax

        movl    %eax, %cr0

        rdmsr      <======

//PAGEBREAK!

# Complete transition to 32-bit protected mode by using long jmp

# to reload %cs and %eip.  The segment descriptors are set up with no

# translation, so that the mapping is still the identity mapping.

         ljmp    $(SEG_KCODE<<3), $start32

        .code32  # Tell assembler to generate 32-bit code now.

start32:

cid:

        rdmsr

        cpuid

        # Bootstrap GDT

        .p2align 2                                # force 4 byte alignment

gdt:

        SEG_NULLASM                              # NULL seg

        SEG_ASM(STA_X|STA_R, 0x0, 0xffffffff)   # code seg

        SEG_ASM(STA_W, 0x0, 0xffffffff)         # data seg

gdtdesc:

        .word   (gdtdesc - gdt - 1)             # sizeof(gdt) - 1

        .long   gdt

code16_end:

The code is built in Linux as follows,

G_CFLAGS = -fno-pic -static -fno-builtin -fno-strict-aliasing -Wall -MD -ggdb -m32 -Werror -fno-omit-frame-pointer

G_CFLAGS += $(shell $(CC) -fno-stack-protector -E -x c /dev/null >/dev/null 2>&1 && echo -fno-stack-protector)

G_LDFLAGS += -m $(shell $(LD) -V | grep elf_i386 2>/dev/null)

        $(CC) $(G_CFLAGS) -fno-pic -nostdinc -I. -c code16.S

        $(LD) $(G_LDFLAGS) -N -e start -Ttext 0x7C00 -o bootblock.o code16.o

        $(OBJCOPY) -S -O binary -j .text bootblock.o bootblock.bin

The first rdmsr can trigger VM exit as expected, and the guest state and VMCS at that moment are as follows,

VMCS fields.

0x0000003F = control_VMX_pin_based

0xA501E1F2 = control_VMX_cpu_based

0x00000082 = control_VMX_proc2_based

0x00000000 = control_exception_bitmap

0x00000000 = control_pagefault_errorcode_mask

0xFFFFFFFF = control_pagefault_errorcode_match

0x00000000 = control_CR3_target_count

0x00036FFB = control_VM_exit_controls

0x000011FB = control_VM_entry_controls

0x00000000 = control_VM_entry_interruption_information

0x00000000 = control_VM_entry_exception_errorcode

0x00000000 = control_VM_entry_instruction_length

0xFFFFFFFFFFFFFFF7 = control_CR0_mask

0xFFFFFFFFFFFFF871 = control_CR4_mask

0x0000000060000010 = control_CR0_shadow

0x0000000000000000 = control_CR4_shadow

0x0000000000000000 = control_CR3_target0

0x00000000B7934000 = control_CR3_target1

0x0000000000000000 = control_CR3_target2

0x0000000000000000 = control_CR3_target3

Guest state:

CR0=0000000000000031  CR3=0000000000000000  CR4=0000000000002050

RSP=0000000000007BFA  SYSENTER_ESP=0000000000000000

RIP=0000000000007C2E  SYSENTER_EIP=0000000000000000

DR7=0000000000000400  SYSENTER_CS=00000000  RFLAGS=0000000000000006

   ES=0000  [ base=0000000000000000 limit=0000FFFF rights=00000093 ]

   CS=0000  [ base=0000000000000000 limit=0000FFFF rights=0000009B ]

   SS=0000  [ base=0000000000000000 limit=0000FFFF rights=00000093 ]

   DS=0000  [ base=0000000000000000 limit=0000FFFF rights=00000093 ]

   FS=0000  [ base=0000000000000000 limit=0000FFFF rights=00000093 ]

   GS=0000  [ base=0000000000000000 limit=0000FFFF rights=00000093 ]

LDTR=0000  [ base=0000000000000000 limit=0000FFFF rights=00000082 ]

   TR=0000  [ base=0000000000000000 limit=0000FFFF rights=0000008B ]

      GDTR  [ base=0000000000007C3C limit=00000017 ]

      IDTR  [ base=0000000000000000 limit=0000FFFF ]

EAX=60000011  ECX=00000000  ESI=00000000  ESP=00007BFA

EBX=00000000  EDX=00000000  EDI=00000000  EBP=00000000

The Linux VM host's cpuinfo is as follows,

processor       : 1

vendor_id       : GenuineIntel

cpu family      : 6

model           : 63

model name      : Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz

stepping        : 2

microcode       : 0x3c

cpu MHz         : 2397.291

cache size      : 15360 KB

physical id     : 2

siblings        : 1

core id         : 0

cpu cores       : 1

apicid          : 2

initial apicid  : 2

fpu             : yes

fpu_exception   : yes

cpuid level     : 15

wp              : yes

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm tpr_shadow vnmi ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid xsaveopt arat

bugs            :

bogomips        : 4801.89

clflush size    : 64

cache_alignment : 64

address sizes   : 43 bits physical, 48 bits virtual

power management:

I don't know what i missed.

Please help on it.

Thanks,

Tags (1)
1 Solution

Accepted Solutions
dariusd
VMware Employee
VMware Employee
Jump to solution

This is a bit above my level of knowledge of nested virtualization and HV in general, but I'll take a quick look anyway just to see if I can get things started.  Apologies if my questions are off the mark.  Smiley Wink

1. Does your code successfully switch to protected mode when running in your own hypervisor on "bare metal" (i.e. without running it inside VMware Workstation/ESXi/Fusion)?  (Or has this not been tried?)

2. Does your code successfully switch to protected mode when running directly in the VMware hypervisor (i.e. modify it to do something visible on screen from inside protected mode, then build bootblock.bin and attach it as a floppy image directly to a VMware virtual machine)?  (Or has this not been tried?)

3. What happens after the ljmp when executed in the nested scenario?  (Hang?  Crash?  Incorrect execution?  ... in your hypervisor or in the inner guest?)

4. Is there some part of the guest state at the first rdmsr which you know to be incorrect?  There's a lot of information there, so if you are aware of a problem or inconsistency, it would be best to point it out to help us along, since we haven't been looking at your code for as many hours as you have. Smiley Wink

I'm still trying to figure out how the CR0 mask, CR0 shadow and guest CR0 combine to make a value consistent with the guest having just set PE in CR0... It looks like the vmx guest will be running in protected mode (because PE is set in guest CR0) but would not actually see PE set in CR0 if the guest were to read it back (since the corresponding bit is set in the CR0 mask and clear in the CR0 shadow).  That is all if I'm understanding correctly, which is highly questionable right now...  This is my first foray into vmx programming.

Thanks,

--

Darius

View solution in original post

0 Kudos
6 Replies
dariusd
VMware Employee
VMware Employee
Jump to solution

This is a bit above my level of knowledge of nested virtualization and HV in general, but I'll take a quick look anyway just to see if I can get things started.  Apologies if my questions are off the mark.  Smiley Wink

1. Does your code successfully switch to protected mode when running in your own hypervisor on "bare metal" (i.e. without running it inside VMware Workstation/ESXi/Fusion)?  (Or has this not been tried?)

2. Does your code successfully switch to protected mode when running directly in the VMware hypervisor (i.e. modify it to do something visible on screen from inside protected mode, then build bootblock.bin and attach it as a floppy image directly to a VMware virtual machine)?  (Or has this not been tried?)

3. What happens after the ljmp when executed in the nested scenario?  (Hang?  Crash?  Incorrect execution?  ... in your hypervisor or in the inner guest?)

4. Is there some part of the guest state at the first rdmsr which you know to be incorrect?  There's a lot of information there, so if you are aware of a problem or inconsistency, it would be best to point it out to help us along, since we haven't been looking at your code for as many hours as you have. Smiley Wink

I'm still trying to figure out how the CR0 mask, CR0 shadow and guest CR0 combine to make a value consistent with the guest having just set PE in CR0... It looks like the vmx guest will be running in protected mode (because PE is set in guest CR0) but would not actually see PE set in CR0 if the guest were to read it back (since the corresponding bit is set in the CR0 mask and clear in the CR0 shadow).  That is all if I'm understanding correctly, which is highly questionable right now...  This is my first foray into vmx programming.

Thanks,

--

Darius

0 Kudos
simitel
Contributor
Contributor
Jump to solution

Darius,

Thank you very much for your reply.

You really give me the important hint about where the failure may come from.

And you are right, CR0_shadow and CR0_mask need to be changed to make it work.

I just changed the CR0_mask to 0xFFFFFFFFFFFFFFF0, which means, bit0 is owned by guest, so that guest can set it as its well.

With this change, ljmp start32 really works, it is very very amazing.

But, one more question from it.

With rdmsr VM exit in start32, i checked the guest state, and found the CR0 is 0x30, not 0x31, is it expected?

I am reading SDM andgoogling for CR0_mask/CR0_shadow, for more details.

Thanks,

-Thai

0 Kudos
dariusd
VMware Employee
VMware Employee
Jump to solution

With rdmsr VM exit in start32, i checked the guest state, and found the CR0 is 0x30, not 0x31, is it expected?

Interesting... I don't think it is expected.  Can you provide updated dumps of the VMCS and guest state from the rdmsr instructions, both from before the ljmp and after the ljmp?

Also, so that I can be a little bit lazy, could you attach your bootloader.bin to a forum post so that I can examine the resulting binary?

Thanks,

--

Darius

0 Kudos
simitel
Contributor
Contributor
Jump to solution

Hi Darius,

I found a mistake in my hypervisor, in which the guest_CR0 is not updated correctly.

By fixing the error, guest_CR0 is 0x31 now.

And I also added checking of PE0 bit in guest code, as follows,

    7c1d:       0f 01 16 5c 7c      lgdtw  0x7c5c          <=== Load GDT,

    7c22:       0f 20 c0                mov    %cr0,%eax

    7c25:       66 83 c8 01          or     $0x1,%eax

    7c29:       0f 22 c0                mov    %eax,%cr0       <=== set bit0_PE to CR0

    7c2c:       0f 20 c0                mov    %cr0,%eax       <=== read CR0 back to AX

    7c2f:       a8 01                    test   $0x1,%al              <=== check if bit0_PE is '1'.

    7c31:       75 02                   jne    7c35 <go_pe>    <=== if '1', go to 'ljmp',

    7c33:       0f 30                    wrmsr                           <=== else, VM_EXIT by wrmsr. In fact, it did NOT happen.

00007c35 <go_pe>:

    7c35:       66 ea 3d 7c 00 00 08    ljmpl  $0x8,$0x7c3d

    7c3c:       00

00007c3d <start32>:

    7c3d:       0f 32                   rdmsr

00007c3f <spin>:

    7c3f:       f4                      hlt

    7c40:       eb fd                   jmp    7c3f <spin>

    7c42:       0f a2                   cpuid

00007c44 <gdt>:

        ...

    7c4c:       ff                      (bad)

    7c4d:       ff 00                   incw   (%bx,%si)

    7c4f:       00 00                   add    %al,(%bx,%si)

    7c51:       9a cf 00 ff ff          lcall  $0xffff,$0xcf

    7c56:       00 00                   add    %al,(%bx,%si)

    7c58:       00 92 cf 00             add    %dl,0xcf(%bp,%si)

00007c5c <gdtdesc>:

    7c5c:       17                      pop    %ss

    7c5d:       00 44 7c                add    %al,0x7c(%si)

        ...

VMX Execution Controls

0x0000003F = control_VMX_pin_based

0xA501E1F2 = control_VMX_cpu_based

0x00000082 = control_VMX_proc2_based

0x00000000 = control_exception_bitmap

0x00000000 = control_pagefault_errorcode_mask

0xFFFFFFFF = control_pagefault_errorcode_match

0x00000000 = control_CR3_target_count

0x00036FFB = control_VM_exit_controls

0x000011FB = control_VM_entry_controls

0x00000000 = control_VM_entry_interruption_information

0x00000000 = control_VM_entry_exception_errorcode

0x00000000 = control_VM_entry_instruction_length

0xFFFFFFFFFFFFFFF0 = control_CR0_mask

0xFFFFFFFFFFFFF871 = control_CR4_mask

0x0000000060000010 = control_CR0_shadow

0x0000000000000000 = control_CR4_shadow

0x0000000000000000 = control_CR3_target0

0x00000000B31C0000 = control_CR3_target1

0x0000000000000000 = control_CR3_target2

0x0000000000000000 = control_CR3_target3

So, you can see, lowest 4 bits of CR0_mask are '0', so guest can read/write them as it expects.

I am checking why following configure could NOT work.

0xFFFFFFFFFFFFFFF7 = control_CR0_mask

and  0x0000000060000010 = control_CR0_shadow.

It means, bit0_PE is owned by host, and when guest read CR0, it is from CR0_shadow, bit0_PE is '0'.

When guest sets bit0_PE, VMEXIT happens, control_CR0_shadow is changed to  0x0000000060000011.

So, the guest should be put into 'protected mode', and can ljmp to start32.

I am still checking it.

Thanks,

0 Kudos
dariusd
VMware Employee
VMware Employee
Jump to solution

From my reading of the Intel docs (which might well be incorrect!), control_CR0_mask and control_CR0_shadow are only used when the guest explicitly reads from CR0, and the guest CR0 is what is actually set into the CPU's CR0 while it is running guest code.

So I think that if PE is set in control_CR0_mask (i.e. CR0.PE is owned by the host), your VM exit handler will need to correspondingly set/clear PE in guest CR0 (so that the running guest is actually put into protected mode) as well as in CR0_shadow (so that the guest can *see* that it is in protected mode when it reads CR0).

At least I think that is how it works...

Cheers,

--

Darius

simitel
Contributor
Contributor
Jump to solution

Yep, it is working, and i can move on VMX study.

Thank you for your help.

Thai

0 Kudos