I have two ESX 4 hypervisors in my cluster of 6 that spontaneously dropped connectivity overnight (still not sure what happened), but when I try to view their respective Summary tabs in the vCenter 4 client interface I get the message "HA agent disabled on <host> in cluster <cluster> in <datcenter>. Cannot synchronize host <host>. Operation timed out." I can't interface with that ESX server at all (I tried to put it in maintenance mode and then reboot it, but the options are greyed out). There aren't even any Alarms listed in the tab in vCenter!! The odd thing is that the two VMs running on one of the ESX servers are pingable and live!! (although I can't see them through the console), and they are listed as being Disconnected in the Hosts and Clusters section of the Inventory option, although I can't Edit Settings at all, that option is greyed out. Does anyone have a clue what happened or how I can get my ESX servers back online? Thanks in advance.
The odd thing is that the two VMs running on one of the ESX servers are pingable and live!
It's just the host agent which cannot communicate with the vCenter Server. The VM's are not affected by this.
Take a look at the log files on the hosts "/var/logs". They should actually show you what happened and when it happened.
André
I would verify that the service console still has network connectivity. If it does you can use putty to access the server and do a service mgmt-vmware restart which may fix your vcenter to esx host connectivity. If it doesn't than your going to have to review the logs in the path that was specified by the other user so we can try and determine what the issue.
If the service console doesn't have network connectivity then at leas we know what the problem is and how to fix it.
If you found this or any other post helpful please consider the use of the Helpful/Correct buttons to award points
this is the repeated message set of the vmkernal and vmkwarning log files from last night and today (there's nothing in the vmksummary.txt file):
Aug 10 09:35:51 vmott2 vmkernel: 8:19:14:41.291 cpu12:4228)WARNING: NMP: nmp_DeviceStartLoop: NMP Device "naa.600a0b8000744aaf000003164c44a639" is blocked. Not starting I/O from device.
Aug 10 09:35:51 vmott2 vmkernel: 8:19:14:41.291 cpu2:4121)WARNING: ScsiDeviceIO: 2715: READ CAPACITY on device "naa.600a0b8000744aaf000003164c44a639" from Plugin "NMP" failed. Timeout
Aug 10 09:35:51 vmott2 vmkernel: 8:19:14:41.291 cpu2:4121)WARNING: Fil3: 1930: Failed to reserve volume f530 28 1 4c44edbc e6235910 1fe623f9 7f307d13 0 0 0 0 0 0 0
Aug 10 09:35:52 vmott2 vmkernel: 8:19:14:42.209 cpu10:4247)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world restore device "naa.600a0b8000744aaf000003164c44a639" - no more commands to retry
Aug 10 09:35:52 vmott2 vmkernel: 8:19:14:42.209 cpu10:4247)WARNING: NMP: nmp_IssueCommandToDevice: I/O could not be issued to device "naa.600a0b8000744aaf000003164c44a639" due to Not found
Aug 10 09:35:52 vmott2 vmkernel: 8:19:14:42.209 cpu10:4247)WARNING: NMP: nmp_DeviceRetryCommand: Device "naa.600a0b8000744aaf000003164c44a639": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
Aug 10 09:35:52 vmott2 vmkernel: 8:19:14:42.209 cpu10:4247)WARNING: NMP: nmp_DeviceStartLoop: NMP Device "naa.600a0b8000744aaf000003164c44a639" is blocked. Not starting I/O from device.
Aug 10 09:35:53 vmott2 vmkernel: 8:19:14:43.217 cpu8:4247)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.600a0b8000744aaf000003164c44a639" - issuing command 0x4100020c5780
Aug 10 09:35:53 vmott2 vmkernel: 8:19:14:43.217 cpu8:4247)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.600a0b8000744aaf000003164c44a639" - failed to issue command due to Not found (APD), try again...
Aug 10 09:35:53 vmott2 vmkernel: 8:19:14:43.217 cpu8:4247)WARNING: NMP: nmp_DeviceAttemptFailover: Logical device "naa.600a0b8000744aaf000003164c44a639": awaiting fast path state update...
It seems to be referencing an issue with several CPUs on the hypervisor. does this have any bearing on the current situation? any ideas? : )
the service console is definitely network available, as I can SSH to the server and was able to gather the log file info that I posted. what was that service mgmt-vmware restart process that you mentioned? is that a CLI command?
That command is a service console command. It will not have any affect on your running VM's.
If you found this or any other post helpful please consider the use of the Helpful/Correct buttons to award points
forgive me for not going through the entire thread, but a few things I would check. First would be to ensure there is proper name resolution to and from each ESX Host as well as vCenter. Second, you may try restarting the management agents on the host that will not configure HA. From the service console issue
service mgmt-vmware restart
Finally, check your /etc/sysconfig/network settings.
so I run the command exactly as listed : service mgmt-vmware restart. correct?
the command must be run with root priveledges, but yes
[root@ ~]# service mgmt-vmware restart Stopping VMware ESX Management services: VMware ESX Host Agent Watchdog [ OK ] VMware ESX Host Agent [ OK ] Starting VMware ESX Management services: VMware ESX Host Agent (background) [ OK ] Availability report startup (background) [ OK ] [root@ ~]#
Just saw I gave you the wrong location for the host agent logs. The vpxa logs are located in "/var/log/vmware/vpx/".
André
after running service mgmt-vmware restart I've been stuck at this step for the last 10 minutes or so with no progression (looks hung):
Stopping VMware ESX Management services:
VMware ESX Host Agent Watchdog
VMware ESX Host Agent
is that normal that the services would take so long to shut down, let alone restart?
Here's the contents of my /etc/sysconfig/network file:
NETWORKING=yes
HOSTNAME=vmott2.lmg.lan
GATEWAY=192.168.110.1
GATEWAYDEV=vswif0
IPV6_AUTOCONF=no
NETWORKING_IPV6=no
it looks good to my eye. are there any glaring items not listed?
That is not normal but I have seen it before. Never got a resolution as I ended up leaving it for the night and it had completly restarted by the next day.
If you found this or any other post helpful please consider the use of the Helpful/Correct buttons to award points
hostd may be hung or in somewhat of a crashed state. Let it try to restart. I think if you can get this to restart you should be able to configure HA.
....or if you can, and have other hosts in the cluster, vMotion the guests to the remaining hosts and restart this ESX Host, which will fix the hostd issue.
ok thanks! I looked at the vpxa logs and these are the lines that repeat over and over:
did not find a VM with ID 7 in the vmList
did not find a VM with ID 7 in the vmList
did not find a VM with ID 7 in the vmList
Monitoring AAM health: vpxdDasStateOnLastInvocation(running) currentVpxdDasState(running) forceRunOfListNodes(0) isDasEnabled(0) skipOperation(1)
did not find a VM with ID 7 in the vmList
did not find a VM with ID 7 in the vmList
Increment master gen. no to (9556): Event:VpxaHalEvent::CheckQueuedEvents
Monitoring AAM health: vpxdDasStateOnLastInvocation(running) currentVpxdDasState(running) forceRunOfListNodes(0) isDasEnabled(0) skipOperation(1)
I'm not sure which VM has vmid of 7 though... how do I locate that?
With HA problems I usually remove the affected host from the cluster, restart the services, check name resolution and then re add the host to the cluster. I don't know whether this would be appropriate in this situation or not.
it sounds like a great plan, BUT I have that one pesky VM running on that ESX server that's live, business critical, and listed in vCenter as Disconnected (and I can neither connect to it via Console or view anything other than its Summary stats). the Migrate option is greyed out and I can't even power it off (mission critical).
I might have read the original post wrong but I felt it was more than a HA issue since some of the options were grayed out. That is why I lead with the management service restart. If I did read it wrong I probably just made everything more complicated
If you found this or any other post helpful please consider the use of the Helpful/Correct buttons to award points
I think it's a waiting game in hopes that hostd recovers. Otherwise, in my opinion, you'll have to take a downtime and reboot the ESX Host
Removing a host from the cluster shouldn't affect the running VMs. I would certainly wait until a less critical time.
Try connecting the vSphere client directly to the host with the problem. If you can connect you at least have a little control.
the restart command actually worked on one of the ESX servers (I guess I was just being impatient and didn't leave it long enough), so now although I can actually get to the Connect dialogue box for that ESX server in vCenter, when I add in my login credentials in the Add Host Wizard screen, Authorization section, my login attempt for the ESX server by vCenter times out. Communication issue? I can ping it though...