xlcor
Contributor
Contributor

HA agent disabled on ESX 4 host

I have two ESX 4 hypervisors in my cluster of 6 that spontaneously dropped connectivity overnight (still not sure what happened), but when I try to view their respective Summary tabs in the vCenter 4 client interface I get the message "HA agent disabled on <host> in cluster <cluster> in <datcenter>. Cannot synchronize host <host>. Operation timed out." I can't interface with that ESX server at all (I tried to put it in maintenance mode and then reboot it, but the options are greyed out). There aren't even any Alarms listed in the tab in vCenter!! The odd thing is that the two VMs running on one of the ESX servers are pingable and live!! (although I can't see them through the console), and they are listed as being Disconnected in the Hosts and Clusters section of the Inventory option, although I can't Edit Settings at all, that option is greyed out. Does anyone have a clue what happened or how I can get my ESX servers back online? Thanks in advance.

0 Kudos
34 Replies
a_p_
Leadership
Leadership

The odd thing is that the two VMs running on one of the ESX servers are pingable and live!

It's just the host agent which cannot communicate with the vCenter Server. The VM's are not affected by this.

Take a look at the log files on the hosts "/var/logs". They should actually show you what happened and when it happened.

André

0 Kudos
mittim12
Immortal
Immortal

I would verify that the service console still has network connectivity. If it does you can use putty to access the server and do a service mgmt-vmware restart which may fix your vcenter to esx host connectivity. If it doesn't than your going to have to review the logs in the path that was specified by the other user so we can try and determine what the issue.

If the service console doesn't have network connectivity then at leas we know what the problem is and how to fix it.






If you found this or any other post helpful please consider the use of the Helpful/Correct buttons to award points

0 Kudos
xlcor
Contributor
Contributor

this is the repeated message set of the vmkernal and vmkwarning log files from last night and today (there's nothing in the vmksummary.txt file):

Aug 10 09:35:51 vmott2 vmkernel: 8:19:14:41.291 cpu12:4228)WARNING: NMP: nmp_DeviceStartLoop: NMP Device "naa.600a0b8000744aaf000003164c44a639" is blocked. Not starting I/O from device.

Aug 10 09:35:51 vmott2 vmkernel: 8:19:14:41.291 cpu2:4121)WARNING: ScsiDeviceIO: 2715: READ CAPACITY on device "naa.600a0b8000744aaf000003164c44a639" from Plugin "NMP" failed. Timeout

Aug 10 09:35:51 vmott2 vmkernel: 8:19:14:41.291 cpu2:4121)WARNING: Fil3: 1930: Failed to reserve volume f530 28 1 4c44edbc e6235910 1fe623f9 7f307d13 0 0 0 0 0 0 0

Aug 10 09:35:52 vmott2 vmkernel: 8:19:14:42.209 cpu10:4247)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world restore device "naa.600a0b8000744aaf000003164c44a639" - no more commands to retry

Aug 10 09:35:52 vmott2 vmkernel: 8:19:14:42.209 cpu10:4247)WARNING: NMP: nmp_IssueCommandToDevice: I/O could not be issued to device "naa.600a0b8000744aaf000003164c44a639" due to Not found

Aug 10 09:35:52 vmott2 vmkernel: 8:19:14:42.209 cpu10:4247)WARNING: NMP: nmp_DeviceRetryCommand: Device "naa.600a0b8000744aaf000003164c44a639": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.

Aug 10 09:35:52 vmott2 vmkernel: 8:19:14:42.209 cpu10:4247)WARNING: NMP: nmp_DeviceStartLoop: NMP Device "naa.600a0b8000744aaf000003164c44a639" is blocked. Not starting I/O from device.

Aug 10 09:35:53 vmott2 vmkernel: 8:19:14:43.217 cpu8:4247)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.600a0b8000744aaf000003164c44a639" - issuing command 0x4100020c5780

Aug 10 09:35:53 vmott2 vmkernel: 8:19:14:43.217 cpu8:4247)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.600a0b8000744aaf000003164c44a639" - failed to issue command due to Not found (APD), try again...

Aug 10 09:35:53 vmott2 vmkernel: 8:19:14:43.217 cpu8:4247)WARNING: NMP: nmp_DeviceAttemptFailover: Logical device "naa.600a0b8000744aaf000003164c44a639": awaiting fast path state update...

It seems to be referencing an issue with several CPUs on the hypervisor. does this have any bearing on the current situation? any ideas? : )

0 Kudos
xlcor
Contributor
Contributor

the service console is definitely network available, as I can SSH to the server and was able to gather the log file info that I posted. what was that service mgmt-vmware restart process that you mentioned? is that a CLI command?

0 Kudos
mittim12
Immortal
Immortal

That command is a service console command. It will not have any affect on your running VM's.






If you found this or any other post helpful please consider the use of the Helpful/Correct buttons to award points

0 Kudos
Troy_Clavell
Immortal
Immortal

forgive me for not going through the entire thread, but a few things I would check. First would be to ensure there is proper name resolution to and from each ESX Host as well as vCenter. Second, you may try restarting the management agents on the host that will not configure HA. From the service console issue

service mgmt-vmware restart

Finally, check your /etc/sysconfig/network settings.

0 Kudos
xlcor
Contributor
Contributor

so I run the command exactly as listed : service mgmt-vmware restart. correct?

0 Kudos
Troy_Clavell
Immortal
Immortal

the command must be run with root priveledges, but yes

[root@ ~]# service mgmt-vmware restart
Stopping VMware ESX Management services:
   VMware ESX Host Agent Watchdog                          [  OK  ]
   VMware ESX Host Agent                                   [  OK  ]
Starting VMware ESX Management services:
   VMware ESX Host Agent (background)                      [  OK  ]
   Availability report startup (background)                [  OK  ]
[root@ ~]#

0 Kudos
a_p_
Leadership
Leadership

Just saw I gave you the wrong location for the host agent logs. The vpxa logs are located in "/var/log/vmware/vpx/".

André

0 Kudos
xlcor
Contributor
Contributor

after running service mgmt-vmware restart I've been stuck at this step for the last 10 minutes or so with no progression (looks hung):

Stopping VMware ESX Management services:

VMware ESX Host Agent Watchdog

VMware ESX Host Agent

is that normal that the services would take so long to shut down, let alone restart?

Here's the contents of my /etc/sysconfig/network file:

NETWORKING=yes

HOSTNAME=vmott2.lmg.lan

GATEWAY=192.168.110.1

GATEWAYDEV=vswif0

IPV6_AUTOCONF=no

NETWORKING_IPV6=no

it looks good to my eye. are there any glaring items not listed?

0 Kudos
mittim12
Immortal
Immortal

That is not normal but I have seen it before. Never got a resolution as I ended up leaving it for the night and it had completly restarted by the next day.






If you found this or any other post helpful please consider the use of the Helpful/Correct buttons to award points

0 Kudos
Troy_Clavell
Immortal
Immortal

hostd may be hung or in somewhat of a crashed state. Let it try to restart. I think if you can get this to restart you should be able to configure HA.

....or if you can, and have other hosts in the cluster, vMotion the guests to the remaining hosts and restart this ESX Host, which will fix the hostd issue.

0 Kudos
xlcor
Contributor
Contributor

ok thanks! I looked at the vpxa logs and these are the lines that repeat over and over:

did not find a VM with ID 7 in the vmList

VM with vmid = 7 not found

did not find a VM with ID 7 in the vmList

VM with vmid = 7 not found

did not find a VM with ID 7 in the vmList

VM with vmid = 7 not found

Monitoring AAM health: vpxdDasStateOnLastInvocation(running) currentVpxdDasState(running) forceRunOfListNodes(0) isDasEnabled(0) skipOperation(1)

did not find a VM with ID 7 in the vmList

VM with vmid = 7 not found

did not find a VM with ID 7 in the vmList

VM with vmid = 7 not found

Increment master gen. no to (9556): Event:VpxaHalEvent::CheckQueuedEvents

Monitoring AAM health: vpxdDasStateOnLastInvocation(running) currentVpxdDasState(running) forceRunOfListNodes(0) isDasEnabled(0) skipOperation(1)

I'm not sure which VM has vmid of 7 though... how do I locate that?

0 Kudos
DSTAVERT
Immortal
Immortal

With HA problems I usually remove the affected host from the cluster, restart the services, check name resolution and then re add the host to the cluster. I don't know whether this would be appropriate in this situation or not.

-- David -- VMware Communities Moderator
0 Kudos
xlcor
Contributor
Contributor

it sounds like a great plan, BUT I have that one pesky VM running on that ESX server that's live, business critical, and listed in vCenter as Disconnected (and I can neither connect to it via Console or view anything other than its Summary stats). the Migrate option is greyed out and I can't even power it off (mission critical).

0 Kudos
mittim12
Immortal
Immortal

I might have read the original post wrong but I felt it was more than a HA issue since some of the options were grayed out. That is why I lead with the management service restart. If I did read it wrong I probably just made everything more complicated Smiley Happy






If you found this or any other post helpful please consider the use of the Helpful/Correct buttons to award points

0 Kudos
Troy_Clavell
Immortal
Immortal

I think it's a waiting game in hopes that hostd recovers. Otherwise, in my opinion, you'll have to take a downtime and reboot the ESX Host

0 Kudos
DSTAVERT
Immortal
Immortal

Removing a host from the cluster shouldn't affect the running VMs. I would certainly wait until a less critical time.

Try connecting the vSphere client directly to the host with the problem. If you can connect you at least have a little control.

-- David -- VMware Communities Moderator
0 Kudos
xlcor
Contributor
Contributor

the restart command actually worked on one of the ESX servers (I guess I was just being impatient and didn't leave it long enough), so now although I can actually get to the Connect dialogue box for that ESX server in vCenter, when I add in my login credentials in the Add Host Wizard screen, Authorization section, my login attempt for the ESX server by vCenter times out. Communication issue? I can ping it though...

0 Kudos