Solved: Help needed: Failed to enter protected mode in nes...

simitel · ‎04-19-2018

I am running Linux VM (4.4.0) in VMWare.

And I developed a VMX based hypervisor.

It can load real-mode guest code well.

Now I changed the guest, to make it enter protected mode from real-mode, but it failed to do that.

Here is the guest code,

#define SEG_KCODE 1 // kernel code

#define SEG_KDATA 2 // kernel data+stack

#define SEG_KCPU 3 // kernel per-cpu data

#define SEG_UCODE 4 // user code

#define SEG_UDATA 5 // user data+stack

#define SEG_TSS 6 // this process's task state

#define CR0_PE 0x00000001 // Protection Enable

#define SEG_NULLASM \

.word 0, 0; \

.byte 0, 0, 0, 0

// The 0xC0 means the limit is in 4096-byte units

// and (for executable segments) 32-bit mode.

#define SEG_ASM(type,base,lim) \

.word (((lim) >> 12) & 0xffff), ((base) & 0xffff); \

.byte (((base) >> 16) & 0xff), (0x90 | (type)), \

(0xC0 | (((lim) >> 28) & 0xf)), (((base) >> 24) & 0xff)

#define STA_X 0x8 // Executable segment

#define STA_E 0x4 // Expand down (non-executable segments)

#define STA_C 0x4 // Conforming code segment (executable only)

#define STA_W 0x2 // Writeable (non-executable segments)

#define STA_R 0x2 // Readable (executable segments)

#define STA_A 0x1 // Accessed

# Start the first CPU: switch to 32-bit protected mode, jump into C.

.code16

.global code16, code16_end

code16:

xor %ecx, %ecx

mov %cr3, %eax

mov %eax, %cr3

seta20.1:

inb $0x64,%al # Wait for not busy

testb $0x2,%al

jnz seta20.1

movb $0xd1,%al # 0xd1 -> port 0x64

outb %al,$0x64

seta20.2:

inb $0x64,%al # Wait for not busy

testb $0x2,%al

jnz seta20.2

movb $0xdf,%al # 0xdf -> port 0x60

outb %al,$0x60

wrmsr

lgdt gdtdesc

movl %cr0, %eax

orl $CR0_PE, %eax

movl %eax, %cr0

rdmsr <======

//PAGEBREAK!

# Complete transition to 32-bit protected mode by using long jmp

# to reload %cs and %eip. The segment descriptors are set up with no

# translation, so that the mapping is still the identity mapping.

ljmp $(SEG_KCODE<<3), $start32

.code32 # Tell assembler to generate 32-bit code now.

start32:

cid:

rdmsr

cpuid

# Bootstrap GDT

.p2align 2 # force 4 byte alignment

gdt:

SEG_NULLASM # NULL seg

SEG_ASM(STA_X|STA_R, 0x0, 0xffffffff) # code seg

SEG_ASM(STA_W, 0x0, 0xffffffff) # data seg

gdtdesc:

.word (gdtdesc - gdt - 1) # sizeof(gdt) - 1

.long gdt

code16_end:

The code is built in Linux as follows,

G_CFLAGS = -fno-pic -static -fno-builtin -fno-strict-aliasing -Wall -MD -ggdb -m32 -Werror -fno-omit-frame-pointer

G_CFLAGS += $(shell $(CC) -fno-stack-protector -E -x c /dev/null >/dev/null 2>&1 && echo -fno-stack-protector)

G_LDFLAGS += -m $(shell $(LD) -V | grep elf_i386 2>/dev/null)

$(CC) $(G_CFLAGS) -fno-pic -nostdinc -I. -c code16.S

$(LD) $(G_LDFLAGS) -N -e start -Ttext 0x7C00 -o bootblock.o code16.o

$(OBJCOPY) -S -O binary -j .text bootblock.o bootblock.bin

The first rdmsr can trigger VM exit as expected, and the guest state and VMCS at that moment are as follows,

VMCS fields.

0x0000003F = control_VMX_pin_based

0xA501E1F2 = control_VMX_cpu_based

0x00000082 = control_VMX_proc2_based

0x00000000 = control_exception_bitmap

0x00000000 = control_pagefault_errorcode_mask

0xFFFFFFFF = control_pagefault_errorcode_match

0x00000000 = control_CR3_target_count

0x00036FFB = control_VM_exit_controls

0x000011FB = control_VM_entry_controls

0x00000000 = control_VM_entry_interruption_information

0x00000000 = control_VM_entry_exception_errorcode

0x00000000 = control_VM_entry_instruction_length

0xFFFFFFFFFFFFFFF7 = control_CR0_mask

0xFFFFFFFFFFFFF871 = control_CR4_mask

0x0000000060000010 = control_CR0_shadow

0x0000000000000000 = control_CR4_shadow

0x0000000000000000 = control_CR3_target0

0x00000000B7934000 = control_CR3_target1

0x0000000000000000 = control_CR3_target2

0x0000000000000000 = control_CR3_target3

Guest state:

CR0=0000000000000031 CR3=0000000000000000 CR4=0000000000002050

RSP=0000000000007BFA SYSENTER_ESP=0000000000000000

RIP=0000000000007C2E SYSENTER_EIP=0000000000000000

DR7=0000000000000400 SYSENTER_CS=00000000 RFLAGS=0000000000000006

ES=0000 [ base=0000000000000000 limit=0000FFFF rights=00000093 ]

CS=0000 [ base=0000000000000000 limit=0000FFFF rights=0000009B ]

SS=0000 [ base=0000000000000000 limit=0000FFFF rights=00000093 ]

DS=0000 [ base=0000000000000000 limit=0000FFFF rights=00000093 ]

FS=0000 [ base=0000000000000000 limit=0000FFFF rights=00000093 ]

GS=0000 [ base=0000000000000000 limit=0000FFFF rights=00000093 ]

LDTR=0000 [ base=0000000000000000 limit=0000FFFF rights=00000082 ]

TR=0000 [ base=0000000000000000 limit=0000FFFF rights=0000008B ]

GDTR [ base=0000000000007C3C limit=00000017 ]

IDTR [ base=0000000000000000 limit=0000FFFF ]

EAX=60000011 ECX=00000000 ESI=00000000 ESP=00007BFA

EBX=00000000 EDX=00000000 EDI=00000000 EBP=00000000

The Linux VM host's cpuinfo is as follows,

processor : 1

vendor_id : GenuineIntel

cpu family : 6

model : 63

model name : Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz

stepping : 2

microcode : 0x3c

cpu MHz : 2397.291

cache size : 15360 KB

physical id : 2

siblings : 1

core id : 0

cpu cores : 1

apicid : 2

initial apicid : 2

fpu : yes

fpu_exception : yes

cpuid level : 15

wp : yes

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm tpr_shadow vnmi ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid xsaveopt arat

bugs :

bogomips : 4801.89

clflush size : 64

cache_alignment : 64

address sizes : 43 bits physical, 48 bits virtual

power management:

I don't know what i missed.

Please help on it.

Thanks,

dariusd · ‎04-20-2018

This is a bit above my level of knowledge of nested virtualization and HV in general, but I'll take a quick look anyway just to see if I can get things started. Apologies if my questions are off the mark.

1. Does your code successfully switch to protected mode when running in your own hypervisor on "bare metal" (i.e. without running it inside VMware Workstation/ESXi/Fusion)? (Or has this not been tried?)

2. Does your code successfully switch to protected mode when running directly in the VMware hypervisor (i.e. modify it to do something visible on screen from inside protected mode, then build bootblock.bin and attach it as a floppy image directly to a VMware virtual machine)? (Or has this not been tried?)

3. What happens after the ljmp when executed in the nested scenario? (Hang? Crash? Incorrect execution? ... in your hypervisor or in the inner guest?)

4. Is there some part of the guest state at the first rdmsr which you know to be incorrect? There's a lot of information there, so if you are aware of a problem or inconsistency, it would be best to point it out to help us along, since we haven't been looking at your code for as many hours as you have.

I'm still trying to figure out how the CR0 mask, CR0 shadow and guest CR0 combine to make a value consistent with the guest having just set PE in CR0... It looks like the vmx guest will be running in protected mode (because PE is set in guest CR0) but would not actually see PE set in CR0 if the guest were to read it back (since the corresponding bit is set in the CR0 mask and clear in the CR0 shadow). That is all if I'm understanding correctly, which is highly questionable right now... This is my first foray into vmx programming.

Thanks,

--

Darius

View solution in original post

dariusd · ‎04-20-2018

This is a bit above my level of knowledge of nested virtualization and HV in general, but I'll take a quick look anyway just to see if I can get things started. Apologies if my questions are off the mark.

1. Does your code successfully switch to protected mode when running in your own hypervisor on "bare metal" (i.e. without running it inside VMware Workstation/ESXi/Fusion)? (Or has this not been tried?)

2. Does your code successfully switch to protected mode when running directly in the VMware hypervisor (i.e. modify it to do something visible on screen from inside protected mode, then build bootblock.bin and attach it as a floppy image directly to a VMware virtual machine)? (Or has this not been tried?)

3. What happens after the ljmp when executed in the nested scenario? (Hang? Crash? Incorrect execution? ... in your hypervisor or in the inner guest?)

4. Is there some part of the guest state at the first rdmsr which you know to be incorrect? There's a lot of information there, so if you are aware of a problem or inconsistency, it would be best to point it out to help us along, since we haven't been looking at your code for as many hours as you have.

I'm still trying to figure out how the CR0 mask, CR0 shadow and guest CR0 combine to make a value consistent with the guest having just set PE in CR0... It looks like the vmx guest will be running in protected mode (because PE is set in guest CR0) but would not actually see PE set in CR0 if the guest were to read it back (since the corresponding bit is set in the CR0 mask and clear in the CR0 shadow). That is all if I'm understanding correctly, which is highly questionable right now... This is my first foray into vmx programming.

Thanks,

--

Darius

simitel · ‎04-20-2018

Darius,

Thank you very much for your reply.

You really give me the important hint about where the failure may come from.

And you are right, CR0_shadow and CR0_mask need to be changed to make it work.

I just changed the CR0_mask to 0xFFFFFFFFFFFFFFF0, which means, bit0 is owned by guest, so that guest can set it as its well.

With this change, ljmp start32 really works, it is very very amazing.

But, one more question from it.

With rdmsr VM exit in start32, i checked the guest state, and found the CR0 is 0x30, not 0x31, is it expected?

I am reading SDM andgoogling for CR0_mask/CR0_shadow, for more details.

Thanks,

-Thai

dariusd · ‎04-21-2018

With rdmsr VM exit in start32, i checked the guest state, and found the CR0 is 0x30, not 0x31, is it expected?

Interesting... I don't think it is expected. Can you provide updated dumps of the VMCS and guest state from the rdmsr instructions, both from before the ljmp and after the ljmp?

Also, so that I can be a little bit lazy, could you attach your bootloader.bin to a forum post so that I can examine the resulting binary?

Thanks,

--

Darius

simitel · ‎04-21-2018

Hi Darius,

I found a mistake in my hypervisor, in which the guest_CR0 is not updated correctly.

By fixing the error, guest_CR0 is 0x31 now.

And I also added checking of PE0 bit in guest code, as follows,

7c1d: 0f 01 16 5c 7c lgdtw 0x7c5c <=== Load GDT,

7c22: 0f 20 c0 mov %cr0,%eax

7c25: 66 83 c8 01 or $0x1,%eax

7c29: 0f 22 c0 mov %eax,%cr0 <=== set bit0_PE to CR0

7c2c: 0f 20 c0 mov %cr0,%eax <=== read CR0 back to AX

7c2f: a8 01 test $0x1,%al <=== check if bit0_PE is '1'.

7c31: 75 02 jne 7c35 <go_pe> <=== if '1', go to 'ljmp',

7c33: 0f 30 wrmsr <=== else, VM_EXIT by wrmsr. In fact, it did NOT happen.

00007c35 <go_pe>:

7c35: 66 ea 3d 7c 00 00 08 ljmpl $0x8,$0x7c3d

7c3c: 00

00007c3d <start32>:

7c3d: 0f 32 rdmsr

00007c3f <spin>:

7c3f: f4 hlt

7c40: eb fd jmp 7c3f <spin>

7c42: 0f a2 cpuid

00007c44 <gdt>:

...

7c4c: ff (bad)

7c4d: ff 00 incw (%bx,%si)

7c4f: 00 00 add %al,(%bx,%si)

7c51: 9a cf 00 ff ff lcall $0xffff,$0xcf

7c56: 00 00 add %al,(%bx,%si)

7c58: 00 92 cf 00 add %dl,0xcf(%bp,%si)

00007c5c <gdtdesc>:

7c5c: 17 pop %ss

7c5d: 00 44 7c add %al,0x7c(%si)

...

VMX Execution Controls

0x0000003F = control_VMX_pin_based

0xA501E1F2 = control_VMX_cpu_based

0x00000082 = control_VMX_proc2_based

0x00000000 = control_exception_bitmap

0x00000000 = control_pagefault_errorcode_mask

0xFFFFFFFF = control_pagefault_errorcode_match

0x00000000 = control_CR3_target_count

0x00036FFB = control_VM_exit_controls

0x000011FB = control_VM_entry_controls

0x00000000 = control_VM_entry_interruption_information

0x00000000 = control_VM_entry_exception_errorcode

0x00000000 = control_VM_entry_instruction_length

0xFFFFFFFFFFFFFFF0 = control_CR0_mask

0xFFFFFFFFFFFFF871 = control_CR4_mask

0x0000000060000010 = control_CR0_shadow

0x0000000000000000 = control_CR4_shadow

0x0000000000000000 = control_CR3_target0

0x00000000B31C0000 = control_CR3_target1

0x0000000000000000 = control_CR3_target2

0x0000000000000000 = control_CR3_target3

So, you can see, lowest 4 bits of CR0_mask are '0', so guest can read/write them as it expects.

I am checking why following configure could NOT work.

0xFFFFFFFFFFFFFFF7 = control_CR0_mask

and 0x0000000060000010 = control_CR0_shadow.

It means, bit0_PE is owned by host, and when guest read CR0, it is from CR0_shadow, bit0_PE is '0'.

When guest sets bit0_PE, VMEXIT happens, control_CR0_shadow is changed to 0x0000000060000011.

So, the guest should be put into 'protected mode', and can ljmp to start32.

I am still checking it.

Thanks,

dariusd · ‎04-21-2018

From my reading of the Intel docs (which might well be incorrect!), control_CR0_mask and control_CR0_shadow are only used when the guest explicitly reads from CR0, and the guest CR0 is what is actually set into the CPU's CR0 while it is running guest code.

So I think that if PE is set in control_CR0_mask (i.e. CR0.PE is owned by the host), your VM exit handler will need to correspondingly set/clear PE in guest CR0 (so that the running guest is actually put into protected mode) as well as in CR0_shadow (so that the guest can *see* that it is in protected mode when it reads CR0).

At least I think that is how it works...

Cheers,

--

Darius

simitel · ‎04-22-2018

Yep, it is working, and i can move on VMX study.

Thank you for your help.

Thai

All

Help needed: Failed to enter protected mode in nested virtualization.