Re: HA agent disabled on ESX 4 host

xlcor · ‎08-10-2010

I have two ESX 4 hypervisors in my cluster of 6 that spontaneously dropped connectivity overnight (still not sure what happened), but when I try to view their respective Summary tabs in the vCenter 4 client interface I get the message "HA agent disabled on <host> in cluster <cluster> in <datcenter>. Cannot synchronize host <host>. Operation timed out." I can't interface with that ESX server at all (I tried to put it in maintenance mode and then reboot it, but the options are greyed out). There aren't even any Alarms listed in the tab in vCenter!! The odd thing is that the two VMs running on one of the ESX servers are pingable and live!! (although I can't see them through the console), and they are listed as being Disconnected in the Hosts and Clusters section of the Inventory option, although I can't Edit Settings at all, that option is greyed out. Does anyone have a clue what happened or how I can get my ESX servers back online? Thanks in advance.

a_p_ · ‎08-10-2010

The odd thing is that the two VMs running on one of the ESX servers are pingable and live!

It's just the host agent which cannot communicate with the vCenter Server. The VM's are not affected by this.

Take a look at the log files on the hosts "/var/logs". They should actually show you what happened and when it happened.

André

mittim12 · ‎08-10-2010

I would verify that the service console still has network connectivity. If it does you can use putty to access the server and do a service mgmt-vmware restart which may fix your vcenter to esx host connectivity. If it doesn't than your going to have to review the logs in the path that was specified by the other user so we can try and determine what the issue.

If the service console doesn't have network connectivity then at leas we know what the problem is and how to fix it.

If you found this or any other post helpful please consider the use of the Helpful/Correct buttons to award points

xlcor · ‎08-10-2010

this is the repeated message set of the vmkernal and vmkwarning log files from last night and today (there's nothing in the vmksummary.txt file):

Aug 10 09:35:51 vmott2 vmkernel: 8:19:14:41.291 cpu12:4228)WARNING: NMP: nmp_DeviceStartLoop: NMP Device "naa.600a0b8000744aaf000003164c44a639" is blocked. Not starting I/O from device.

Aug 10 09:35:51 vmott2 vmkernel: 8:19:14:41.291 cpu2:4121)WARNING: ScsiDeviceIO: 2715: READ CAPACITY on device "naa.600a0b8000744aaf000003164c44a639" from Plugin "NMP" failed. Timeout

Aug 10 09:35:51 vmott2 vmkernel: 8:19:14:41.291 cpu2:4121)WARNING: Fil3: 1930: Failed to reserve volume f530 28 1 4c44edbc e6235910 1fe623f9 7f307d13 0 0 0 0 0 0 0

Aug 10 09:35:52 vmott2 vmkernel: 8:19:14:42.209 cpu10:4247)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world restore device "naa.600a0b8000744aaf000003164c44a639" - no more commands to retry

Aug 10 09:35:52 vmott2 vmkernel: 8:19:14:42.209 cpu10:4247)WARNING: NMP: nmp_IssueCommandToDevice: I/O could not be issued to device "naa.600a0b8000744aaf000003164c44a639" due to Not found

Aug 10 09:35:52 vmott2 vmkernel: 8:19:14:42.209 cpu10:4247)WARNING: NMP: nmp_DeviceRetryCommand: Device "naa.600a0b8000744aaf000003164c44a639": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.

Aug 10 09:35:52 vmott2 vmkernel: 8:19:14:42.209 cpu10:4247)WARNING: NMP: nmp_DeviceStartLoop: NMP Device "naa.600a0b8000744aaf000003164c44a639" is blocked. Not starting I/O from device.

Aug 10 09:35:53 vmott2 vmkernel: 8:19:14:43.217 cpu8:4247)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.600a0b8000744aaf000003164c44a639" - issuing command 0x4100020c5780

Aug 10 09:35:53 vmott2 vmkernel: 8:19:14:43.217 cpu8:4247)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.600a0b8000744aaf000003164c44a639" - failed to issue command due to Not found (APD), try again...

Aug 10 09:35:53 vmott2 vmkernel: 8:19:14:43.217 cpu8:4247)WARNING: NMP: nmp_DeviceAttemptFailover: Logical device "naa.600a0b8000744aaf000003164c44a639": awaiting fast path state update...

It seems to be referencing an issue with several CPUs on the hypervisor. does this have any bearing on the current situation? any ideas? : )

xlcor · ‎08-10-2010

the service console is definitely network available, as I can SSH to the server and was able to gather the log file info that I posted. what was that service mgmt-vmware restart process that you mentioned? is that a CLI command?

mittim12 · ‎08-10-2010

That command is a service console command. It will not have any affect on your running VM's.

If you found this or any other post helpful please consider the use of the Helpful/Correct buttons to award points

Troy_Clavell · ‎08-10-2010

forgive me for not going through the entire thread, but a few things I would check. First would be to ensure there is proper name resolution to and from each ESX Host as well as vCenter. Second, you may try restarting the management agents on the host that will not configure HA. From the service console issue

service mgmt-vmware restart

Finally, check your /etc/sysconfig/network settings.

xlcor · ‎08-10-2010

so I run the command exactly as listed : service mgmt-vmware restart. correct?

Troy_Clavell · ‎08-10-2010

the command must be run with root priveledges, but yes

[root@ ~]# service mgmt-vmware restart
Stopping VMware ESX Management services:
   VMware ESX Host Agent Watchdog                          [  OK  ]
   VMware ESX Host Agent                                   [  OK  ]
Starting VMware ESX Management services:
   VMware ESX Host Agent (background)                      [  OK  ]
   Availability report startup (background)                [  OK  ]
[root@ ~]#

a_p_ · ‎08-10-2010

Just saw I gave you the wrong location for the host agent logs. The vpxa logs are located in "/var/log/vmware/vpx/".

André

xlcor · ‎08-10-2010

after running service mgmt-vmware restart I've been stuck at this step for the last 10 minutes or so with no progression (looks hung):

Stopping VMware ESX Management services:

VMware ESX Host Agent Watchdog

VMware ESX Host Agent

is that normal that the services would take so long to shut down, let alone restart?

Here's the contents of my /etc/sysconfig/network file:

NETWORKING=yes

HOSTNAME=vmott2.lmg.lan

GATEWAY=192.168.110.1

GATEWAYDEV=vswif0

IPV6_AUTOCONF=no

NETWORKING_IPV6=no

it looks good to my eye. are there any glaring items not listed?

mittim12 · ‎08-10-2010

That is not normal but I have seen it before. Never got a resolution as I ended up leaving it for the night and it had completly restarted by the next day.

If you found this or any other post helpful please consider the use of the Helpful/Correct buttons to award points

Troy_Clavell · ‎08-10-2010

hostd may be hung or in somewhat of a crashed state. Let it try to restart. I think if you can get this to restart you should be able to configure HA.

....or if you can, and have other hosts in the cluster, vMotion the guests to the remaining hosts and restart this ESX Host, which will fix the hostd issue.

xlcor · ‎08-10-2010

ok thanks! I looked at the vpxa logs and these are the lines that repeat over and over: