VMware Cloud Community
Chamon
Commander
Commander

HA Agent error

One of the servers in the VI3 Cluster had a HA Agent error. I chose the ‘Reconfigure HA’ option and now HA on that server is hung. The last event was ‘uninstalling HA’. Tried to SSH into the box by Server name and it would not connect. Then tried by IP and was able to access it. Tried to use VIclient to connect to server by Server name and IP do not work. I can ping the ESX server by name and DNS is able to resolve the IP. It seems to me that a process or service is hung. I was thinking of removing the server from the cluster but there has to be a better way. Any Ideas?

0 Kudos
12 Replies
VirtualNoitall
Virtuoso
Virtuoso

Hello,

If you are up to date on patching you can try service mgmt-vmware restart. If you are not up to date on patching just double check that auto startup and shutdown of virtual machines is not enabled; if it is disabled it before you issue the command.

You could also try temporarily removing HA from the cluster and adding it back.

Many HA issues are down to name resolution. Make sure you can ping all hosts in your ESX cluster, from each node and VirtualCEnter, by fully qualified domain name.

masaki
Virtuoso
Virtuoso

Hosts must ping themselves, try with hostname -s to ping other hosts in the same HA cluster.

Mind case! it's case sensitive.

Tried to SSH into the box by Server name and it would not connect.

Correct case?

Tried to use VIclient... All ports 902,903 opened? Did you use Root user?

Sometime the better thing to do it's exactly to remove the host from the cluster and readding again.

Is your Default gateway pingable? It must be so.

If not you could use a das.isolation address

0 Kudos
Chamon
Commander
Commander

What in DNS would all of the sudden change? It can no longer be reached by the VCenter. The VMs are still running and functioning as normal. I am going to try the following unless someone thinks that this is a bad idea.

restart the following two services:

vmware-vpxa

mgmt-vmware (the VMs are not set to auto reboot)

Then last thing try restarting the vpxa service on the VCenter machine.

Thanks for the help!

0 Kudos
masaki
Virtuoso
Virtuoso

HA requires DNS. You can agree or not.

If HA is up and your HOSt is unreachable your vms will be started on another HOST.

So I guess that the other hosts can reach this HOST.

You can restart the agents but if you have network problems this will not solve

0 Kudos
Chamon
Commander
Commander

Yes I do agree that DNS is vital.

Now we are further in the whole. SSH, VIClient, and, Web Client do not work for access. Cannot connect. Can Ping it though. Alt F1 to get to the Console works but has the following error and will not allow us to get to a login prompt. This is the error:

‘Ext2-fs warning max mount count reached running e2fsck is recommended’

Think that the Shell may have crashed and it is waiting for the VMs to be shut down prior to completing the reboot process. Any ideas outside of rebooting the ESX? The VM must stay on for now and it will be difficult to obtain permission to shut them down even for a few minutes. Due to exorbitant amount of red tape it may take a week to get the proper permission. Thanks for all of the help.

Message was edited by: Me

Chamon

0 Kudos
masaki
Virtuoso
Virtuoso

To log on press ALT F2 to use another terminal.

The ext2-fs message is known (not a problem) I'll search and let you know.

0 Kudos
masaki
Virtuoso
Virtuoso

Ok look at this:

http://www.vmware.com/community/thread.jspa?messageID=648669&#648669

you can change the max count but you must understand also why it's coming out.

May be a disk is under reconstruction or damaged?

Message was edited by:

masaki

0 Kudos
Chamon
Commander
Commander

I will give it a try and let you know.

0 Kudos
Chamon
Commander
Commander

The alt F2 gets me to another terminal; prompts for user; I type root ; then nothing. It does not prompt for user password. If you type nothing shows up in the terminal. Any other ideas would be great. We are preparing the paperwork to reboot the ESX but it will take a week or so for the approval. Nice huh.

0 Kudos
masaki
Virtuoso
Virtuoso

Can you see anything strange on disks' leds?

Are they blinking faster and faster ?

Do you have a SAN or local storage?

In the first case look at the SAN management software to find disk errors and let me know.

Chamon
Commander
Commander

We were finally able to get approval to shut down the VMs. The Host was rebooted and once it came up it functioned properly. It did take about 10 minutes to finally come back up.

One it was HA was finished rebuilding all VMs that were on it had been started on other Hosts. The problem with that is that they were all set to keep powered on in the event of an isolation. Some of the VMs had been retired and were not to be powered on. Did this happen due to HA having to reconfigure and the default is to restart the VMs and the Host knew that it had been isolated? Any ideas on how to prevent this from happening in the future?

0 Kudos
masaki
Virtuoso
Virtuoso

If I understand your question The answer is Yes, you can choose each vm's behaviour on a fault.

Under HA options you can choose for each vm to power on it on a failure or not (leave it isolated).

0 Kudos