After upgrade from 5.0 to 5.5, I am seeing strange behavior. These few VMs also hang at 65% when using vMotion. All have really high CPU used, and IDLE.
If I log into the guest OS, no activity. I mean 0.0%us. Proxies are CentOS 5.2 Linux, the Jasper is Windows. The Jasper has nothing but default windows process running and is completely idle.
7:07:29pm up 1 day 3:44, 572 worlds, 10 VMs, 16 vCPUs; CPU load average: 0.73, 0.72, 0.73
PCPU USED(%): 59 70 54 57 51 53 73 71 AVG: 61
PCPU UTIL(%): 68 81 64 70 56 61 82 82 AVG: 71
ID GID NAME NWLD %USED %RUN %SYS %WAIT %VMWAIT %RDY %IDLE %OVRLP %CSTP %MLMTD %SWPWT
512777 512777 cl-ivrproxy 8 141.94 139.26 1.16 646.57 0.34 14.05 187.08 0.62 0.00 0.00 0.00
66472 66472 Jasper (windows 7 140.34 138.41 0.04 554.68 0.11 6.78 197.28 0.37 0.00 0.00 0.00
438541 438541 cl-proxy 7 139.26 137.58 0.48 550.05 0.52 12.28 91.20 0.53 0.00 0.00 0.00
I have another host with a few more VMs behaving the same way. The rest of the VMs on the hosts are behaving fine.
The Windows VM has two vCPUs and is using the multiprocessor HAL.
The two Linux proxies are single vCPU, expect I changed on to two vCPU to see if it made a difference. No difference.
Any thoughts?
Root cause turned out to be entries in the VMX files for these machines that vSphere 5.5 did not like.
I believe the template that was used to create these handful of machines started off life as a Lab Manager VM a long time ago.
As such, it had many extra vmx options than a typical VM, some of which were mks parameters.
vSphere 5.0 did not seem to mind, but 5.5 has problems with them.
Removing these seemed to fix the problem.
Also, these same VMs give this error when trying to access console through vCenter:
"Unable to connect to the MKS: Error connecting to /bin/vmx process."
"Unable to connect to the MKS: Error connecting to /bin/vmx process.
Now coming to the issue at hand, in the host where you are able to see the vms in esxtop, can you press "e" and type in the GID of one VM say, the Jasper and let us know, if you are able to see the vmx process running with it?
Wow, that little trick is helpful.
looks like mks is the culprit. I can only assume that the svga is also related to the virtual console.
Any advice on how to set it straight?
ID GID NAME NWLD %USED %RUN %SYS %WAIT %VMWAIT %RDY %IDLE %OVRLP %CSTP %MLMTD %SWPWT
67008 66472 vmx 1 0.17 0.11 0.06 99.82 - 0.09 0.00 0.00 0.00 0.00 0.00
67012 66472 vmast.67011 1 0.01 0.01 0.00 100.00 - 0.00 0.00 0.00 0.00 0.00 0.00
67014 66472 vmx-vthread-5:J 1 0.00 0.00 0.00 100.00 - 0.00 0.00 0.00 0.00 0.00 0.00
67206 66472 vmx-mks:Jasper 1 75.09 74.50 0.00 19.12 - 6.40 0.00 0.27 0.00 0.00 0.00
67207 66472 vmx-svga:Jasper 1 59.06 57.57 0.00 27.76 - 14.69 0.00 0.21 0.00 0.00 0.00
67208 66472 vmx-vcpu-0:Jasp 1 0.87 0.86 0.00 98.14 0.09 1.01 98.05 0.01 0.00 0.00 0.00
67209 66472 vmx-vcpu-1:Jasp 1 0.75 0.74 0.00 98.49 0.09 0.79 98.39 0.01 0.00 0.00 0.00
looks like mks is the culprit. I can only assume that the svga is also related to the virtual console.
Close enough, mks is just a process. We need to find out who called it and why the owner said that it was unable to connect to the mks. svga is again a process to handle the mks thread.
Any advice on how to set it straight?
Are you opening the console from the source or destination?. Having asked that, I have also seen such issues when:
1. I open the console of a VM
2. Migrate the VM to another host
3. The remote console gets stuck for a considerable time and returns the error.
This is because, the remote console opened with the source host is no longer existing, Because it moved to the destination
Also, the other way to work around is to restart the VM is you are unable to get any remote consoles
More information:
- restarting the VM results in the same state.
- migrating with VM powered on results in hung vMotion at 65%
- migrating with VM powered off works.
- starting VM on another host results in same state
- removing the VM from inventory, and re-adding it through datastore manager results in same state ( Have to "esxcli vm process kill" to unlock it before I can re-add it.)
I had the thought that if I moved them around until I found the original host, it might fix it. Didn't seem to work that way.
The only thing I found that resolved this strange state was to remove the VM from inventory. Create a new VM. Copy VMDK files over to new VM directory. Add disks to new VM configuration.
I would rather have a more graceful solution, and I am really curious about how this could happen.
I have 6 VMs stuck in this strange state. It all started towards the end of my upgrade from 5.0 to 5.5. Everything seemed fine. I brought up the last of my 4 hosts, and tried to vMotion some VMs around to organize my machines. These 6 all stuck at 65% and never recovered.
These 6 all stuck at 65% and never recovered.
Only things we can do is check the vmkernel.log to find out what is hogging during the 65% time time. either is VM having too much of snapshots or heavy IO and raise up with VMware tech support if you have valid support contract.
Created support request yesterday before creating this discussion.
Still waiting for VMware to contact me.
Thanks for your help.
Root cause turned out to be entries in the VMX files for these machines that vSphere 5.5 did not like.
I believe the template that was used to create these handful of machines started off life as a Lab Manager VM a long time ago.
As such, it had many extra vmx options than a typical VM, some of which were mks parameters.
vSphere 5.0 did not seem to mind, but 5.5 has problems with them.
Removing these seemed to fix the problem.
Your post saved the day. Just replaced 8 hosts with 5.5, many running linux VMs. Before i started here apparently Lab Manager was used and we were experiencing the same issues as you stated and we were at a loss. Just wanted to bump with a Thank you. Solved our issues.
Dear All,
I compared the vmx files of CPU-high-load VMs and the VM template, they are almost the same. Can you give me some key words of the extra/unwanted entries in vmx file?
Thank you very much!
This is the vmx file's content of CPU-high-load VM:
.encoding = "UTF-8"
config.version = "8"
virtualHW.version = "8"
pciBridge0.present = "TRUE"
pciBridge4.present = "TRUE"
pciBridge4.virtualDev = "pcieRootPort"
pciBridge4.functions = "8"
pciBridge5.present = "TRUE"
pciBridge5.virtualDev = "pcieRootPort"
pciBridge5.functions = "8"
pciBridge6.present = "TRUE"
pciBridge6.virtualDev = "pcieRootPort"
pciBridge6.functions = "8"
pciBridge7.present = "TRUE"
pciBridge7.virtualDev = "pcieRootPort"
pciBridge7.functions = "8"
vmci0.present = "TRUE"
hpet0.present = "TRUE"
nvram = "HO-SRV-FLE-02.nvram"
virtualHW.productCompatibility = "hosted"
powerType.powerOff = "soft"
powerType.powerOn = "hard"
powerType.suspend = "hard"
powerType.reset = "soft"
displayName = "HO-SRV-FLE-02"
extendedConfigFile = "HO-SRV-FLE-02.vmxf"
numvcpus = "4"
cpuid.coresPerSocket = "2"
scsi0.present = "TRUE"
scsi0.sharedBus = "none"
scsi0.virtualDev = "lsisas1068"
memsize = "3072"
scsi0:0.present = "TRUE"
scsi0:0.fileName = "HO-SRV-FLE-02.vmdk"
scsi0:0.deviceType = "scsi-hardDisk"
ide1:0.present = "TRUE"
ide1:0.deviceType = "atapi-cdrom"
ide1:0.startConnected = "FALSE"
ethernet0.present = "TRUE"
ethernet0.virtualDev = "e1000"
ethernet0.networkName = "Subnet 5"
ethernet0.addressType = "generated"
svga.vramSize = "8388608"
guestOS = "windows7-64"
uuid.location = "56 4d 2f cc e1 58 a1 7f-5e 50 93 e4 b9 9b 91 be"
uuid.bios = "56 4d 29 cb 3e 0d db 37-da 42 74 ab 0b d6 fd f6"
vc.uuid = "52 bb 2c a1 bd b7 98 30-45 1a 8e 16 cb e3 cd 92"
tools.upgrade.policy = "manual"
ethernet0.generatedAddress = "00:0c:29:d6:fd:f6"
vmci0.id = "198639094"
tools.syncTime = "FALSE"
annotation = "for file checking"
cleanShutdown = "FALSE"
replay.supported = "FALSE"
unity.wasCapable = "TRUE"
sched.swap.derivedName = "/vmfs/volumes/4e987720-8c12d07e-c8d6-782bcb4f76fe/HO-SRV-ANZ-02/HO-SRV-ANZ-02-be4d538a.vswp"
replay.filename = ""
scsi0:0.redo = ""
pciBridge0.pciSlotNumber = "17"
pciBridge4.pciSlotNumber = "21"
pciBridge5.pciSlotNumber = "22"
pciBridge6.pciSlotNumber = "23"
pciBridge7.pciSlotNumber = "24"
scsi0.pciSlotNumber = "160"
ethernet0.pciSlotNumber = "32"
vmci0.pciSlotNumber = "33"
scsi0.sasWWID = "50 05 05 6b 3e 0d db 30"
ethernet0.generatedAddressOffset = "0"
hostCPUID.0 = "0000000b756e65476c65746e49656e69"
hostCPUID.1 = "000206c220200800029ee3ffbfebfbff"
hostCPUID.80000001 = "0000000000000000000000012c100800"
guestCPUID.0 = "0000000b756e65476c65746e49656e69"
guestCPUID.1 = "000206c200020800829822031fabfbff"
guestCPUID.80000001 = "00000000000000000000000128100800"
userCPUID.0 = "0000000b756e65476c65746e49656e69"
userCPUID.1 = "000206c220200800029822031fabfbff"
userCPUID.80000001 = "00000000000000000000000128100800"
evcCompatibilityMode = "FALSE"
vmotion.checkpointFBSize = "8388608"
ide1:0.clientDevice = "TRUE"
floppy0.present = "FALSE"
softPowerOff = "FALSE"
toolsInstallManager.lastInstallError = "0"
toolsInstallManager.updateCounter = "2"
tools.remindInstall = "FALSE"
sched.cpu.affinity = "4,5,6,7"
sched.mem.affinity = "all"