Hello all
Let me first make precise that I am using vmware workstation 6.0.3 but this issue also
arises on vmware-ws 6.5b.
It happens that I have a strange behavior when debugging a subtle piece of code that does
the following:
- switch back from protected mode to real mode
- call some BIOS API
- switch back from real mode to protected mode
- continue the original program
When stepping this code in vmware, it works very well and never crash. However, when
executing this code without stepping, vmware crashes with a stack fault, and without
calling the BIOS routine properly.
My questions are as follow:
- is there anyway to know where the code crash using the vmware coredump file (or another debug option ?)
- is it possible that debugging into vmware change rights on memory (just like debugging a userland application
with gdb usually does). More generally, where can I get more information about the vmware gdb stub ?
- is there any other known issue (vmware or other) about executing such code, that would explain the difference
of results between debugging into vmware and just letting it run ?
Thanks
JFV
Hello J.F.V, please let me address your questions:
> is there anyway to know where the code crash using the vmware coredump file (or another debug option ?)
If you get a stack fault (that technically is triple-fault), then yes. If you are using beta build, or release build with "debug=true" then adding the following
option to config file will make your VM suspend at the moment of triple-fault:
monitor.suspend_on_triplefault=true
Then if you resume the VM by passing one more option on the command line, debugger will get activated before the instruction is executed, just at the point of the triple-fault:
.../vmware -s monitor_control.enable_guestdebugonstart=true MyVM.vmx
You should be able to attach debugger and take a look at memory/registers/etc. If this fails for any reason, take a look at vmware.log file after VM gets suspended on a triple-fault, there should be register dump at the end of the log. It may help you find location of the code that caused triple-fault.
Both options are undocumented and unsupported, but I'll try to help if they do not work.
> Is it possible that debugging into vmware change rights on memory (just like debugging a userland application with gdb usually does). More generally, where can I get more information about the vmware gdb stub ?
Debug stub does not change Guest-visible memory protection. If you use "normal" breakpoints, stub will put "int3" in the Guest memory. With debugStub.hideBreakpoints=1, it is practically invisible. One exception - and the one that frequently masks triple faults - is this: when breakpoint hits (no matter if hidden or not), or when you single-step with the stub, we are flushing hardware TLB. If you happen to have a stale mapping in the TLB, then it could explain why you are getting stack-faults without stub, but do not get them with it. Just to be sure, you may want to try flushing TLB after all page table modifications in your kernel.
We do not share that much information about internals of debug stub. If you have access to our Academics Program, or if your employer is part of our Community Source, then you can take a look at sources and contact me and other developers. Otherwise, the best way to find out is to be hired by VMware ![]()
> is there any other known issue (vmware or other) about executing such code, that would explain the difference of results between debugging into vmware and just letting it run ?
I think I addressed this question above: stale TLB mappings in Guest kernel would be my best guess. I hope this helps.
Sincerely,
Vyacheslav (Slava) Malyugin
Thanks for your time Slava,
On vm-ws 6.0.3, I put debug=true and monitor.suspend_on_triplefault=true
in my VM configuration, then power off and restart vmware via its
graphical interface.
Before the bug happens, I suspend the VM, then do on command line:
vmware -s monitor_control.enable_guestdebugonstart=true NAMEOFVM.vmx
then I resume the VM, but the stack fault dialog is still appearing, and while
its possible to attach the debugger after this, it does not show anything except
a timeout after around 1 minut. Also the vmware.log file does not show
anything interesting except:
Jun 03 14:40:43.537: vcpu-0| Triple fault.
Jun 03 14:40:43.537: vcpu-0| Msg_Hint: msg.monitorEvent.cpl0SS (sent)
Where am I going wrong ?
JFV
I also tried to execute:
vmware -s monitor_control.enable_guestdebugonstart=true NAMEOFVM.vmx
after the crash, but it gave the same result.
Sorry, I have been unclear. With monitor.suspend_on_triplefault=true the VM should suspend instead of producing stack fault. Do you observe this?
Sincerely,
Vyacheslav (Slava) Malyugin
No I dont, the VM keeps showing the dialog indicating that a stack fault happened.
My VM configuration file contains:
#!/usr/bin/vmware
config.version = "8"
virtualHW.version = "6"
scsi0.present = "TRUE"
memsize = "32"
MemAllowAutoScaleDown = "FALSE"
ide0:0.present = "TRUE"
ide0:0.fileName = "MEMTEST.vmdk"
ide1:0.present = "TRUE"
ide1:0.autodetect = "TRUE"
ide1:0.deviceType = "cdrom-raw"
floppy0.autodetect = "TRUE"
ethernet0.present = "TRUE"
ethernet0.wakeOnPcktRcv = "FALSE"
sound.present = "TRUE"
sound.fileName = "-1"
sound.autodetect = "TRUE"
svga.autodetect = "TRUE"
pciBridge0.present = "TRUE"
displayName = "MEMTEST"
guestOS = "other"
nvram = "MEMTEST.nvram"
deploymentPlatform = "windows"
virtualHW.productCompatibility = "hosted"
RemoteDisplay.vnc.port = "0"
tools.upgrade.policy = "useGlobal"
floppy0.fileName = "/dev/fd0"
extendedConfigFile = "MEMTEST.vmxf"
ethernet0.addressType = "generated"
uuid.location = "56 4d d1 4d 35 5c d9 48-f6 4e 2b 21 63 66 61 d1"
uuid.bios = "56 4d d1 4d 35 5c d9 48-f6 4e 2b 21 63 66 61 d1"
ide0:0.redo = ""
pciBridge0.pciSlotNumber = "17"
scsi0.pciSlotNumber = "16"
ethernet0.pciSlotNumber = "32"
sound.pciSlotNumber = "33"
ethernet0.generatedAddress = "00:0c:29:66:61:d1"
ethernet0.generatedAddressOffset = "0"
debug=true
monitor.suspend_on_triplefault=true
debugStub.listen.guest32=true
debugStub.listen.guest32.remote=true
debugStub.hideBreakpoints=true
tools.remindInstall = "TRUE"
replay.logging = "FALSE"
ethernet0.connectionType = "hostonly"
checkpoint.vmState = ""
-
It seems correct (e.g. following what you said) but maybe some options are incompatible ?
Note that my version is 6.0.3 -unregistered- (trial version)
Thanks
-JFV
Another hint:
The code only works when -stepped- (e.g. breakpoint on the beginning, stepping
the whole code, then continue until next hit of the breakpoint, and do the same
when the breakpoint is reached again)
On the other hand, putting a breakpoint at the beginning of the code, then
immediately continuing until next hit of the breakpoint, without stepping the code,
gives a triple fault.
As for your explanations regarding flushing TLB, my code is always entered with
cr3 = NULL. Just to check, I added cr3 flush at the entry and exit of
my code, and it changed nothing, which is not so surprising.
I am now trying to find which cache is desynchronized when code is running free,
and correctly synchronized when code is stepped in vmware...
-JFV
Hi J.F.V. I double-checked the sources for 6.0.3, and the options should work. I guess something is wrong with my assumption about the fault that your VM is getting. Is it possible that you added options to the .vmx file, and then restored from snapshot? Snapshot restore overwrites the values in configuration file. Do you see all options in the vmware.log after the crash? Could you paste portion of the log a few dozen lines before and after message informing about "stack fault"? Thank you.
Sincerely,
Vyacheslav (Slava) Malyugin
Hey Slava, thanks for your time !
I finally managed to enable debug mode : not only I had to put it in the configuration
file, but also enable it in the properties of the virtual machine as "full debug".
I used the procedure you indicated me and I could indeed get the state of the code
at the moment of the crash. It crashes exactly at this moment:
<here switch from pmode to rmode>
sti
mov %eax, const1
<----
here
mov %ebx, const2
Given that the 2 mov instructions are totally inoffensive, I am thinking about an interrupt
triggered just after STI is executed. I think the two MOV are already prefetched and the first
may even be already executed (INTEL reference manual specifies that STI instruction has
such delay for allowing STI to be executed just before a RET).
As far as I know, when interruptions are disabled using CLI, they are never "stacked" and
triggered after the next STI. This remark keeps the problem being unresolved.
Now the $1K question : why is such interrupt never triggered when stepping the code into
vmware, and always triggered otherwise ?
Added note: when resuming the VM with debug=full after the stack fault has executed,
the guest kernel can continue as if nothing had happened, e.g. the interrupt
does not appear fatal.
-JFV
As you requested, here is the extended log information when the triple fault happens:
Jun 05 15:22:50.368: vmx| CDROM-LIN: AIOCallbackSGIO: Unexpected errno: Unknown error 4294967295 (-1)
Jun 05 15:22:50.421: vcpu-0| VLANCE: Ignoring LANCE_EXTINT OUT of 0x0
Jun 05 15:22:50.655: vcpu-0| VMSAMPLE32: cs=0x9693, eip=0x4e98
Jun 05 15:22:52.659: vcpu-0| VMSAMPLE32: cs=0x0, eip=0x9ce3
Jun 05 15:22:53.677: vcpu-0| VMSAMPLE32: cs=0x10, eip=0x4427
Jun 05 15:22:54.362: vcpu-0| NOT_TESTED vmcore/vmm32/cpu/segment.c:1262
Jun 05 15:22:54.362: vcpu-0| NOT_TESTED vmcore/vmm32/cpu/segment.c:395
Jun 05 15:22:54.362: vcpu-0| NOT_TESTED vmcore/vmm32/cpu/segment.c:1262
Jun 05 15:22:54.362: vcpu-0| NOT_TESTED vmcore/vmm32/cpu/segment.c:1262
Jun 05 15:22:54.362: vcpu-0| Triple fault.
Jun 05 15:22:54.362: vcpu-0| Writing monitor corefile "/home/jfv/vmware/MEMTEST/vmware-core.gz"
Jun 05 15:22:54.365: vcpu-0| Beginning monitor coredump
Jun 05 15:22:54.726: vcpu-0| End monitor coredump
Jun 05 15:22:54.726: vcpu-0| Beginning extended monitor coredump
Jun 05 15:22:54.726: vcpu-0| Writing anonymous pages at pos: 401000
Jun 05 15:22:55.366: vcpu-0| VCPU: eflags=0x210246 rip=0x17b
Jun 05 15:22:55.366: vcpu-0| VCPU: RAX=0xfffff000 RBX=0x0 RCX=0x1a880 RDX=0x617
Jun 05 15:22:55.366: vcpu-0| VCPU: RDI=0x3e RSI=0x1ca00 RBP=0x1ca00 RSP=0x10617
Jun 05 15:22:55.366: vcpu-0| VCPU: R8=0x0 R9=0x0 R10=0x0 R11=0x0
Jun 05 15:22:55.366: vcpu-0| VCPU: R12=0x0 R13=0x0 R14=0x0 R15=0x0
Jun 05 15:22:55.366: vcpu-0| VCPU: ES=0x1000 CS=0x1000 SS=0x1000 DS=0x0 FS=0x1000 GS=0xf000
Jun 05 15:22:55.367: vcpu-0| VCPU: code=2 stack=2 task=4
Jun 05 15:22:55.367: vcpu-0| VCPU: cr0=0x10 cr2=0x0 cr3=0x0 cr4=0x0
Note that the "sti" instruction is located at address 1000:0177
-JFV
Hi J.F.V. I agree: the crash appears to be related to the STI. You are in real mode, so there are not too many things that could cause triple fault. Perhaps IDT is pointing at nowhere, or SS is not set up properly? I probably would not be able to help you narrow down the problem, but it seems that you are right in suspectting that IRQ arrived at unexpected time.
The reason problem does not happen when you single step is because we suspend interrupts while single-stepping. Otherwise you'd end up in the timer handler after every step. You may want to use Record/Replay to record the crash, and then step thru it in Replay mode - you should be able to see the crash.
Sincerely,
Vyacheslav (Slava) Mlayugin
I was trying to test this feature by enabling
debug=true
monitor.suspend_on_triplefault=true
in the .vmx file for the VM.
Manually induced kernel crash still prints the stack trace to the console
and doesn't suspend the VM.
What we would like to do is
(1) guest VM should suspend on a kernel fault
(2) we would like to connect to this VM from thru kgdb after resuming the guest
Today it looks like there can be only one gdbserver listener on vm management
network at vmnet gateway:8864 for a 64-bit guest.
Will the above sequence of triple fault suspend, resume, connect thru kgdb
still work if there are multiple guest VMs started
on the same vm management network on the same hypervisor with the following lines
in the .vmx file
debugStub.listen.guest64 ="TRUE"
debugStub.listen.guest64.remote ="TRUE"
debugStub.hidebreakpoints = "TRUE"
monitor.debugOnStartGuest64 = "TRUE"
Thanks,
kvc