today we ran into some trouble with our 3-node vSphere 4.0 Cluster. Due to a netork failure one node was getting isolated. Because of this isolation, the vm's on this host were stopped. So far - so good. But only some of the vm's were restarted on the remaining 2 hosts. Here are the logs from the host which tried to restart the vm
2009-12-02 10:20:40.880 F62E9B90 info 'TaskManager' Task Created : haTask-ha-folder-vm-vim.Folder.registerVm-1853759376
2009-12-02 10:20:40.880 F62E9B90 info 'ha-folder-vm'] Register called: [/vmfs/volumes/484d002e-6b5cea72-25b4-001e0bd1b6ca/PLESK-02/PLESK-02.vmx
2009-12-02 10:20:40.885 F62E9B90 info 'VMFileChecker' Config rules file '/etc/vmware/configrules' loaded and parsed successfully.
2009-12-02 10:20:40.886 F62E9B90 warning 'Vmsvc' RegisterVm file check error: IO error
2009-12-02 10:20:40.888 F62E9B90 info 'App' AdapterServer caught exception: vim.fault.NotFound
2009-12-02 10:20:40.888 F62E9B90 info 'TaskManager' Task Completed : haTask-ha-folder-vm-vim.Folder.registerVm-1853759376 Status error
2009-12-02 10:20:40.888 F62E9B90 info 'Vmomi' Activation N5Vmomi10ActivationE:0x5b538db8 : Invoke done registerVm on vim.Folder:ha-folder-vm
2009-12-02 10:20:40.888 F62E9B90 verbose 'Vmomi' Arg path:
2009-12-02 10:20:40.888 F62E9B90 verbose 'Vmomi' Arg name:
2009-12-02 10:20:40.888 F62E9B90 verbose 'Vmomi' Arg asTemplate:
2009-12-02 10:20:40.888 F62E9B90 verbose 'Vmomi' Arg pool:
2009-12-02 10:20:40.888 F62E9B90 verbose 'Vmomi' Arg host:
2009-12-02 10:20:40.888 F62E9B90 info 'Vmomi' Throw vim.fault.NotFound
2009-12-02 10:20:40.888 F62E9B90 info 'Vmomi' Result:
dynamicType = <unset>,
faultCause = (vmodl.MethodFault) null,
msg = "",
The host can't find the VM configuration - and that is true, because it is looking at the wrong place. This VM (and all the ones which couldn't be restarted) had been moved to another storage system with storage vmotion two weeks ago. But it looks like none of the other host is the cluster noticed that change.
After we had the isolated host back to the cluster, we were able to start the affected vm's manually. Now the correct path was used. Here is the log file (from the same host like the first logfile)
2009-12-02 11:07:54.263 F62E9B90 info 'TaskManager'-- Task Created : haTask-ha-folder-vm-vim.Folder.registerVm-1853761779
2009-12-02 11:07:54.263 F62E9B90 info 'ha-folder-vm'] Register called: [--/vmfs/volumes/4a69621c-5a16699a-4427-001e0bd1b6ca/PLESK-02/PLESK-02.vmx
2009-12-02 11:07:54.290 F62E9B90 info 'VMFileChecker'-- Config rules file '/etc/vmware/configrules' loaded and parsed successfully.
2009-12-02 11:07:54.291 F62E9B90 info 'VMFileChecker'-- VM config file '/vmfs/volumes/4a69621c-5a16699a-4427-001e0bd1b6ca/PLESK-02/PLESK-02.vmx' already belongs to uid 0. Returning.
Has anyone experienced this behaviour too ? But most important : how to avoid it ? We have some vm's that have been storage v-motioned and we would like to see them restarting when a host isolation occurs.
Any help or hints are appreciated.
This is a bug in vSphere 4.0 that has been fixed in 4.1 (will be released soon I think). To workaround the problem in 4.0 you can try suspending and resuming the vms after they have been storage vmotioned.
i've been looking through tehe release notes of Update 1, but i am not able to find this issue.
I'm going to try the work-around with reconfiguring ha on the other hosts.