This is my first post and I would like to have a general concensus and possible root cause at a problem I encountered with one of our production VM's. Scenario was that the VM in question became totally inaccessible. When i opened the console, it stated : "error connecting to /vmfs/volumes... VMX is not started". This currently runs on ESX 3.5 (busy with upgrade phase to vSphere 4). I tried restarting the management services with service mgmt-vmware restart / stop / start etc but the service hung. also tried restarting the VPXA. this stopped and started fine. Now with the management services that got stuck, the host was inaccessible to vCenter thus we couldnt issue commands to the VM or host. the host and the other VM's luckily stayed up. after recovering the host, it became accessible to vCenter again. When I looked at the problem VM, all of its VMX, SWAP, LOG and other config files were all gone. Only left with the VMDK. now thinking that OK, i can rebuild it with it's current disk config - also that all locks were released. this was not the case, i was able to recreate it yes, but there was still a lock on that VMDK - the physical thought the VM was still up. after logging a call with vmware support, they assisted me in removing the lock without affecting the other vm's and i was able to power back on the problem vm. Now after my whole story, my question is: WHAT WOULD MAKE A VM's VMX go CORRUPT???. because at the end of the day, this would be the root cause?
I would appreciate any insight into this.
thanks again guys
Interesting question rubberduck, but this will be difficult to answer without access to logs etc.
If you logged a call with VMware and they resolved the issue, you should also insist on them reviewing the logs and doing a root cause analysis for you.
Locked VMX files are quite unusual - and can result from many things - is it possible that someone was accessing / editing the VMX while the VM was running (possibly datastore browser / WincSCP / SSH session)
Also, do you know if the storage presented to the ESX host was also presented to another host (which was not registered in the same VC) - is it possible that someone tried to mount the VMX on another ESX host?
did you have any backup software ruinning that may have affected your config.
or lastly, did you perhaps delete snapshot files form the datastore, without cleaning them up / commiting them properly (or in that case, any of the other files that make up the VM)
I have seen something similar happen when an admin accidentally deleted a VM from one host, while the VM was running on another host. Both hosts were in the same VMware cluster, but the person was working directly from the command line To this day we don't know exactly what happened and what sequence of events led to the problem, but he was able to delete not only the VMX file, but one of two VMDKs, as well.
We just restored from backup and upgraded all hosts to the latest ESX build. There was no in-depth investigation.