I am running Linux VM (4.4.0) in VMWare.
And I developed a VMX based hypervisor.
It can load real-mode guest code well.
Now I changed the guest, to make it enter protected mode from real-mode, but it failed to do that.
Here is the guest code,
#define SEG_KCODE 1 // kernel code
#define SEG_KDATA 2 // kernel data+stack
#define SEG_KCPU 3 // kernel per-cpu data
#define SEG_UCODE 4 // user code
#define SEG_UDATA 5 // user data+stack
#define SEG_TSS 6 // this process's task state
#define CR0_PE 0x00000001 // Protection Enable
#define SEG_NULLASM \
.word 0, 0; \
.byte 0, 0, 0, 0
// The 0xC0 means the limit is in 4096-byte units
// and (for executable segments) 32-bit mode.
#define SEG_ASM(type,base,lim) \
.word (((lim) >> 12) & 0xffff), ((base) & 0xffff); \
.byte (((base) >> 16) & 0xff), (0x90 | (type)), \
(0xC0 | (((lim) >> 28) & 0xf)), (((base) >> 24) & 0xff)
#define STA_X 0x8 // Executable segment
#define STA_E 0x4 // Expand down (non-executable segments)
#define STA_C 0x4 // Conforming code segment (executable only)
#define STA_W 0x2 // Writeable (non-executable segments)
#define STA_R 0x2 // Readable (executable segments)
#define STA_A 0x1 // Accessed
# Start the first CPU: switch to 32-bit protected mode, jump into C.
.code16
.global code16, code16_end
code16:
xor %ecx, %ecx
mov %cr3, %eax
mov %eax, %cr3
seta20.1:
inb $0x64,%al # Wait for not busy
testb $0x2,%al
jnz seta20.1
movb $0xd1,%al # 0xd1 -> port 0x64
outb %al,$0x64
seta20.2:
inb $0x64,%al # Wait for not busy
testb $0x2,%al
jnz seta20.2
movb $0xdf,%al # 0xdf -> port 0x60
outb %al,$0x60
wrmsr
lgdt gdtdesc
movl %cr0, %eax
orl $CR0_PE, %eax
movl %eax, %cr0
rdmsr <======
//PAGEBREAK!
# Complete transition to 32-bit protected mode by using long jmp
# to reload %cs and %eip. The segment descriptors are set up with no
# translation, so that the mapping is still the identity mapping.
ljmp $(SEG_KCODE<<3), $start32
.code32 # Tell assembler to generate 32-bit code now.
start32:
cid:
rdmsr
cpuid
# Bootstrap GDT
.p2align 2 # force 4 byte alignment
gdt:
SEG_NULLASM # NULL seg
SEG_ASM(STA_X|STA_R, 0x0, 0xffffffff) # code seg
SEG_ASM(STA_W, 0x0, 0xffffffff) # data seg
gdtdesc:
.word (gdtdesc - gdt - 1) # sizeof(gdt) - 1
.long gdt
code16_end:
The code is built in Linux as follows,
G_CFLAGS = -fno-pic -static -fno-builtin -fno-strict-aliasing -Wall -MD -ggdb -m32 -Werror -fno-omit-frame-pointer
G_CFLAGS += $(shell $(CC) -fno-stack-protector -E -x c /dev/null >/dev/null 2>&1 && echo -fno-stack-protector)
G_LDFLAGS += -m $(shell $(LD) -V | grep elf_i386 2>/dev/null)
$(CC) $(G_CFLAGS) -fno-pic -nostdinc -I. -c code16.S
$(LD) $(G_LDFLAGS) -N -e start -Ttext 0x7C00 -o bootblock.o code16.o
$(OBJCOPY) -S -O binary -j .text bootblock.o bootblock.bin
The first rdmsr can trigger VM exit as expected, and the guest state and VMCS at that moment are as follows,
VMCS fields.
0x0000003F = control_VMX_pin_based
0xA501E1F2 = control_VMX_cpu_based
0x00000082 = control_VMX_proc2_based
0x00000000 = control_exception_bitmap
0x00000000 = control_pagefault_errorcode_mask
0xFFFFFFFF = control_pagefault_errorcode_match
0x00000000 = control_CR3_target_count
0x00036FFB = control_VM_exit_controls
0x000011FB = control_VM_entry_controls
0x00000000 = control_VM_entry_interruption_information
0x00000000 = control_VM_entry_exception_errorcode
0x00000000 = control_VM_entry_instruction_length
0xFFFFFFFFFFFFFFF7 = control_CR0_mask
0xFFFFFFFFFFFFF871 = control_CR4_mask
0x0000000060000010 = control_CR0_shadow
0x0000000000000000 = control_CR4_shadow
0x0000000000000000 = control_CR3_target0
0x00000000B7934000 = control_CR3_target1
0x0000000000000000 = control_CR3_target2
0x0000000000000000 = control_CR3_target3
Guest state:
CR0=0000000000000031 CR3=0000000000000000 CR4=0000000000002050
RSP=0000000000007BFA SYSENTER_ESP=0000000000000000
RIP=0000000000007C2E SYSENTER_EIP=0000000000000000
DR7=0000000000000400 SYSENTER_CS=00000000 RFLAGS=0000000000000006
ES=0000 [ base=0000000000000000 limit=0000FFFF rights=00000093 ]
CS=0000 [ base=0000000000000000 limit=0000FFFF rights=0000009B ]
SS=0000 [ base=0000000000000000 limit=0000FFFF rights=00000093 ]
DS=0000 [ base=0000000000000000 limit=0000FFFF rights=00000093 ]
FS=0000 [ base=0000000000000000 limit=0000FFFF rights=00000093 ]
GS=0000 [ base=0000000000000000 limit=0000FFFF rights=00000093 ]
LDTR=0000 [ base=0000000000000000 limit=0000FFFF rights=00000082 ]
TR=0000 [ base=0000000000000000 limit=0000FFFF rights=0000008B ]
GDTR [ base=0000000000007C3C limit=00000017 ]
IDTR [ base=0000000000000000 limit=0000FFFF ]
EAX=60000011 ECX=00000000 ESI=00000000 ESP=00007BFA
EBX=00000000 EDX=00000000 EDI=00000000 EBP=00000000
The Linux VM host's cpuinfo is as follows,
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
stepping : 2
microcode : 0x3c
cpu MHz : 2397.291
cache size : 15360 KB
physical id : 2
siblings : 1
core id : 0
cpu cores : 1
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 15
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm tpr_shadow vnmi ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid xsaveopt arat
bugs :
bogomips : 4801.89
clflush size : 64
cache_alignment : 64
address sizes : 43 bits physical, 48 bits virtual
power management:
I don't know what i missed.
Please help on it.
Thanks,
This is a bit above my level of knowledge of nested virtualization and HV in general, but I'll take a quick look anyway just to see if I can get things started. Apologies if my questions are off the mark.
1. Does your code successfully switch to protected mode when running in your own hypervisor on "bare metal" (i.e. without running it inside VMware Workstation/ESXi/Fusion)? (Or has this not been tried?)
2. Does your code successfully switch to protected mode when running directly in the VMware hypervisor (i.e. modify it to do something visible on screen from inside protected mode, then build bootblock.bin and attach it as a floppy image directly to a VMware virtual machine)? (Or has this not been tried?)
3. What happens after the ljmp when executed in the nested scenario? (Hang? Crash? Incorrect execution? ... in your hypervisor or in the inner guest?)
4. Is there some part of the guest state at the first rdmsr which you know to be incorrect? There's a lot of information there, so if you are aware of a problem or inconsistency, it would be best to point it out to help us along, since we haven't been looking at your code for as many hours as you have.
I'm still trying to figure out how the CR0 mask, CR0 shadow and guest CR0 combine to make a value consistent with the guest having just set PE in CR0... It looks like the vmx guest will be running in protected mode (because PE is set in guest CR0) but would not actually see PE set in CR0 if the guest were to read it back (since the corresponding bit is set in the CR0 mask and clear in the CR0 shadow). That is all if I'm understanding correctly, which is highly questionable right now... This is my first foray into vmx programming.
Thanks,
--
Darius
This is a bit above my level of knowledge of nested virtualization and HV in general, but I'll take a quick look anyway just to see if I can get things started. Apologies if my questions are off the mark.
1. Does your code successfully switch to protected mode when running in your own hypervisor on "bare metal" (i.e. without running it inside VMware Workstation/ESXi/Fusion)? (Or has this not been tried?)
2. Does your code successfully switch to protected mode when running directly in the VMware hypervisor (i.e. modify it to do something visible on screen from inside protected mode, then build bootblock.bin and attach it as a floppy image directly to a VMware virtual machine)? (Or has this not been tried?)
3. What happens after the ljmp when executed in the nested scenario? (Hang? Crash? Incorrect execution? ... in your hypervisor or in the inner guest?)
4. Is there some part of the guest state at the first rdmsr which you know to be incorrect? There's a lot of information there, so if you are aware of a problem or inconsistency, it would be best to point it out to help us along, since we haven't been looking at your code for as many hours as you have.
I'm still trying to figure out how the CR0 mask, CR0 shadow and guest CR0 combine to make a value consistent with the guest having just set PE in CR0... It looks like the vmx guest will be running in protected mode (because PE is set in guest CR0) but would not actually see PE set in CR0 if the guest were to read it back (since the corresponding bit is set in the CR0 mask and clear in the CR0 shadow). That is all if I'm understanding correctly, which is highly questionable right now... This is my first foray into vmx programming.
Thanks,
--
Darius
Darius,
Thank you very much for your reply.
You really give me the important hint about where the failure may come from.
And you are right, CR0_shadow and CR0_mask need to be changed to make it work.
I just changed the CR0_mask to 0xFFFFFFFFFFFFFFF0, which means, bit0 is owned by guest, so that guest can set it as its well.
With this change, ljmp start32 really works, it is very very amazing.
But, one more question from it.
With rdmsr VM exit in start32, i checked the guest state, and found the CR0 is 0x30, not 0x31, is it expected?
I am reading SDM andgoogling for CR0_mask/CR0_shadow, for more details.
Thanks,
-Thai
With rdmsr VM exit in start32, i checked the guest state, and found the CR0 is 0x30, not 0x31, is it expected?
Interesting... I don't think it is expected. Can you provide updated dumps of the VMCS and guest state from the rdmsr instructions, both from before the ljmp and after the ljmp?
Also, so that I can be a little bit lazy, could you attach your bootloader.bin to a forum post so that I can examine the resulting binary?
Thanks,
--
Darius
Hi Darius,
I found a mistake in my hypervisor, in which the guest_CR0 is not updated correctly.
By fixing the error, guest_CR0 is 0x31 now.
And I also added checking of PE0 bit in guest code, as follows,
7c1d: 0f 01 16 5c 7c lgdtw 0x7c5c <=== Load GDT,
7c22: 0f 20 c0 mov %cr0,%eax
7c25: 66 83 c8 01 or $0x1,%eax
7c29: 0f 22 c0 mov %eax,%cr0 <=== set bit0_PE to CR0
7c2c: 0f 20 c0 mov %cr0,%eax <=== read CR0 back to AX
7c2f: a8 01 test $0x1,%al <=== check if bit0_PE is '1'.
7c31: 75 02 jne 7c35 <go_pe> <=== if '1', go to 'ljmp',
7c33: 0f 30 wrmsr <=== else, VM_EXIT by wrmsr. In fact, it did NOT happen.
00007c35 <go_pe>:
7c35: 66 ea 3d 7c 00 00 08 ljmpl $0x8,$0x7c3d
7c3c: 00
00007c3d <start32>:
7c3d: 0f 32 rdmsr
00007c3f <spin>:
7c3f: f4 hlt
7c40: eb fd jmp 7c3f <spin>
7c42: 0f a2 cpuid
00007c44 <gdt>:
...
7c4c: ff (bad)
7c4d: ff 00 incw (%bx,%si)
7c4f: 00 00 add %al,(%bx,%si)
7c51: 9a cf 00 ff ff lcall $0xffff,$0xcf
7c56: 00 00 add %al,(%bx,%si)
7c58: 00 92 cf 00 add %dl,0xcf(%bp,%si)
00007c5c <gdtdesc>:
7c5c: 17 pop %ss
7c5d: 00 44 7c add %al,0x7c(%si)
...
VMX Execution Controls
0x0000003F = control_VMX_pin_based
0xA501E1F2 = control_VMX_cpu_based
0x00000082 = control_VMX_proc2_based
0x00000000 = control_exception_bitmap
0x00000000 = control_pagefault_errorcode_mask
0xFFFFFFFF = control_pagefault_errorcode_match
0x00000000 = control_CR3_target_count
0x00036FFB = control_VM_exit_controls
0x000011FB = control_VM_entry_controls
0x00000000 = control_VM_entry_interruption_information
0x00000000 = control_VM_entry_exception_errorcode
0x00000000 = control_VM_entry_instruction_length
0xFFFFFFFFFFFFFFF0 = control_CR0_mask
0xFFFFFFFFFFFFF871 = control_CR4_mask
0x0000000060000010 = control_CR0_shadow
0x0000000000000000 = control_CR4_shadow
0x0000000000000000 = control_CR3_target0
0x00000000B31C0000 = control_CR3_target1
0x0000000000000000 = control_CR3_target2
0x0000000000000000 = control_CR3_target3
So, you can see, lowest 4 bits of CR0_mask are '0', so guest can read/write them as it expects.
I am checking why following configure could NOT work.
0xFFFFFFFFFFFFFFF7 = control_CR0_mask
and 0x0000000060000010 = control_CR0_shadow.
It means, bit0_PE is owned by host, and when guest read CR0, it is from CR0_shadow, bit0_PE is '0'.
When guest sets bit0_PE, VMEXIT happens, control_CR0_shadow is changed to 0x0000000060000011.
So, the guest should be put into 'protected mode', and can ljmp to start32.
I am still checking it.
Thanks,
From my reading of the Intel docs (which might well be incorrect!), control_CR0_mask and control_CR0_shadow are only used when the guest explicitly reads from CR0, and the guest CR0 is what is actually set into the CPU's CR0 while it is running guest code.
So I think that if PE is set in control_CR0_mask (i.e. CR0.PE is owned by the host), your VM exit handler will need to correspondingly set/clear PE in guest CR0 (so that the running guest is actually put into protected mode) as well as in CR0_shadow (so that the guest can *see* that it is in protected mode when it reads CR0).
At least I think that is how it works...
Cheers,
--
Darius
Yep, it is working, and i can move on VMX study.
Thank you for your help.
Thai