I am unable to enable HA agent or hot vMotion any VMs on an ESXi 3.5 123629 host in a HA/DRS/EVC cluster. The funny thing is that HA and vMotion were working fine on all three hosts in the cluster, but recently stopped working on one of them. No changes were made.
Two of the hosts are HP DL380 G5s and one is a DL360 G5. Two of the servers are running ESXi 3.5 123629 and one is running ESX 3.5.0 123630. All three hosts have the same DNS info. I am able to ping and resolve all three ESX hosts by name.
I have disabled and re-enabled HA on the cluster, but one of the hosts still fails with the error An error occurred during configuration of the HA agent of the host. - cmd remove failed.
I tried to vMotion from the host with the HA error and it stops at 10% before timing out. I can cold migrate the VMs from that same host to one of the two other hosts.
I am hoping that I don't have to power down all of the VMs on the host with the issue and then reboot the host. I am not even sure that will help. Any ideas? THANKS
vMotion will error at 10% when the vmkernel network can not reach the target host.
Try using vmkping -D from the consoles of the involved hosts to see if the vmkernel network is working.
DNS resolution must work with both a fqdn and a short name from each host to the other.
Also check your /var/log/vmware/vmkwarning log.
vmkping -D showed successful pings
I do not see a vmkwarning.log file located in the /var/log/vmware/ directory. Does ESXi 3.5 normally have that log file in that location?
Any other ideas? Thanks.
No vmkwarning is good. That is the correct location for logs on ESXi.
Check VLAN assignments for the vMotion network if you use them.
Check your vmkernel gateway on the vMotion network. If you have multiple vmkernel nets you may need to add a vmkernel route for the vMotion vmkernel net.
After closer inspection of the network adapters on the ESX host that has the issue I noticed that the vnic that is used for vMotion has a blank observed IP address range. The other vnics on the other two ESX hosts have normal observed IP address ranges. I am wondering if perhaps the physical nic is bad or if the switch port is bad. I looked at both of them the other day and both had active link lights.
I may try another switch port and see what happens. We use a 5 port switch for vMotion. We plug all three ESX hosts into that switch. Like I said earlier, vMotion and HA were working fine for several months and no changes were made.
Does a blank observed IP address range for a vnic normally indicate a hardware issue such as a bad nic? Thanks.
Thats a normal item to see. vmkernel ip's are not part of the VM networks and are not enumerated there. Only the VM networks would be part of the list.
Are there any other vmkernel nets? Like iSCSI etc.
Do a vmkping from one host to the other for the vmotion network.
server1 vmotion ip = 10.10.10.1
server2 vmotion ip = 10.10.10.2
If it responds then the network is not the issue.