i dont trust HA

mtsm · ‎02-06-2009

Hi guys...its not the first time i have a bad experience with HA , today i had a network instability and HA simply got crazy , it began starting vms in another host (but the host that was supposed to be down wasnt) ...thena big mess began , after network instability stoped , i had same vms registered in more than one host , some of them showing as powered on in different hosts...so a BIG BIG mess ,i had to find out where they were really powered on , unregister by hand.

i tried restart hostd and vpxa by hand ..but the vmware´s vm inventory still wrong , seems esx doesnt really check if the VM is running on the host and trusts only on those xmls local databases , what causes a big mess..... its not the first time i had a problem like this , anyone is aware of this problem?

i am disabling HA in all hosts now...i think its better trying to fix a host than clean the mess that vmware ha does.

Troy_Clavell · ‎02-06-2009

I think one thing that may help is to create either another service console, extend the isolation response or use your vMotion NIC as an isolation NIC. This may help with some of the issues you are seeing.

Duncan blogged about some great HA advanced options if you'd like to take a look

http://www.yellow-bricks.com/ha-advanced-options/

weinstein5 · ‎02-06-2009

I agree with troy's suggestions also I would take a look at the isolation response of your vms - perhaps set the cluster defaul to leave power on that way if you encounter network instability the vms will remain running on their original hosts and all you will get are errors in events log of vc and the hosts saying the vm could not be started -

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

wazoo9000 · ‎02-06-2009

Another suggestion is to create a totally private network and additional service console ports for the beacon probing to use. This would eliminate the ability of the production network instability from causing havoc in your VM environment. That is interesting though, it was my understanding that HA was supposed to check the locks in the VMFS volumes as well. Are you using iSCSI for storage?

-------- If you found this information useful, please consider awarding points for "Correct" or "Helpful". Thanks.

Rumple · ‎02-06-2009

The other setting is just to set the entire cluster to leave powered on. If you set it to that, then in an isolation situation nothing happens to the VM's. its only on a host failure that the VM's power themselves back on.

also make sure you don't have the vm heartbeat configured as that will make a mess too.

Overall I have several different instances of clustered esx servers I manage and haven't had a problem yet with HA.

PS - if you are using nfs on a netapp, make sure you've enabled the file locking as per the latest best practice guide (previously it said to disable)

TomHowarth · ‎02-07-2009

do you by any chance have the HA heartbeat on the same network as your production LAN,

If you found this or any other answer useful please consider the use of the Helpful or correct buttons to award points

Tom Howarth

VMware Communities User Moderator

Blog: www.planetvm.net

Tom Howarth VCP / VCAP / vExpert
VMware Communities User Moderator
Blog: http://www.planetvm.net
Contributing author on VMware vSphere and Virtual Infrastructure Security: Securing ESX and the Virtual Environment
Contributing author on VCP VMware Certified Professional on VSphere 4 Study Guide: Exam VCP-410

mtsm · ‎02-07-2009

no , i am using SAN

mtsm · ‎02-07-2009

tom

in this particular cluster i am using a separated VLAN for both service-console and vmotion/vmkernel ..(this one i had the HA problem)

on my other cluster i have an etherchannel for service console(on the same lan for production) , and a closed vlan for vmotion/vmkernel...this etherchannel i have in trunk both service console vlan and vmkernel closed vlan

fyi · ‎02-07-2009

i am absolutely in agreement with you - i have been having this headache for months now!! -- never got any useful solution from the tech support -- i think the ha is causing excessive reservations on the san side there by causing i/o failures -- my request to the experts here is if there is any log of ha which i could watchto check whats exactly HA is doing and why -- i will be able to troubleshoot better - vmkernel logs and vpxa.log havent been providing much info as to why HA causes all the scsi reservations,duplicate registrations etc ...any help is greatly appreciated

mtsm · ‎02-07-2009

i am also having some scsi reservations , i am disabling ha on all cluster until i had some solid information about ha , im gonna try to modify some parameters as suggested

williambishop · ‎02-07-2009

That would be my guess, as it seems to be the root of a lot of people's issues out there. Most every environment you see this issue in has done it.

--"Non Temetis Messor."

mtsm · ‎02-07-2009

willian , has done what? scsi reservations?

williambishop · ‎02-07-2009

The issues with your HA when you had a network glitch, and other issues you have I imagine (do you also have slow presentations from the VC and or dropped hosts and services?). Do you have scsi reservations? I saw no sign of that, but the behavior you stated can pretty much be repeated verbatim in a lot of shops out there that don't isolate their HA/VC traffic away from the network lans that the guests use. It should not be combined.

--"Non Temetis Messor."

mtsm · ‎02-07-2009

no , in my case both vmkernel and service console are using a separeted vlan for their traffic , vm´s are running in another vlan

kathirkk23 · ‎02-09-2009

Hi,

I am using seperate switch for VMKernal,HA & VM's and its working fine for me,there is no issue on my HA.

It was tested and working fine.but HA is not 100 % available(online) its restarting all the VM's in another ESX server.

so HA- High Availbility is wrong since its not avail 100 % then how can we accept it as HA.

Kathir

Regards Kathir

williambishop · ‎02-09-2009

Why are you saying it's HA when it's most likely a configuration problem? In shops where it is set up correctly, it works, and makes available the guests. I know of a lot of instances, not just in my own environment where it works and has provided HA....If something doesn't work, my first assumption is my setup--which turns out to be the case every time so far.

--"Non Temetis Messor."

Argyle · ‎02-09-2009

mtsm, we had the same issue. The solution is to change the default VM isolation response to "leave powered on" as mentioned in previous posts. You can change this on cluster level in VC (verify the setting for each VM though).

The main reason is that network problems from spanning tree issues or human errors can cause havoc to your VMware environment (and physical environment). It could be a junior network admin or any person that make a human error to the network environment. If there is a risk for that in your environment, setting the isolation response to "leave powered on" is highly recommended, especially with many VMs. It doesn't help to put service console etc on seperate network, someone could mess that network up to

We found that the risk for human error was more likely than that other hardware network problems would trigger a isolation of a host, say both NICs fail. And even if it would occur that two NICs failed, the incident is still limited to one single ESX. Not the entire VMware farm as in the case with spanning tree problems or network configuration errors and all your VMs start to vmotion all over the place or even worse if all ESX servers think they are isolated and shut down all VMs.

The main reason we use HA is for hardware redundancy and quick restart if the entire server fails. If the ESX server crash and shut down it will also release the locks on LUN level on VMs, and other ESX servers kan restart them.

I also agree with williambishop. It's not an issue with VMware HA itself, but how you decide to configure it (and how your network is handled).

And in response to Kathir the purpose of VMware HA is not to give you 100% uptime, it's to minimize the downtime in case of say hardware failure on an ESX. It's about quick detection and restart and thus giving you higher availiblity on your VM than you would get on a single server.

TomHowarth · ‎02-09-2009

I agree with you William, in all cases other than the documented U2 issue, PAtch now released and not present in U3. I have found any HA issues to be configuration based either by incorrect placement of the HA heartbeat (on an imappropiate network) or DRS configuration issues,

If you found this or any other answer useful please consider the use of the Helpful or correct buttons to award points

Tom Howarth

VMware Communities User Moderator

Blog: www.planetvm.net

Tom Howarth VCP / VCAP / vExpert
VMware Communities User Moderator
Blog: http://www.planetvm.net
Contributing author on VMware vSphere and Virtual Infrastructure Security: Securing ESX and the Virtual Environment
Contributing author on VCP VMware Certified Professional on VSphere 4 Study Guide: Exam VCP-410

mtsm · ‎02-09-2009

Willian

What would be wrong with my envoirment? i am using a separeted vlan (vlan 32) for all my vmkernel traffic and my service console traffic , and another vlan (vlan 33) for all my guests , and still my HA is not 100%.

they are not in different switches...does it make any difference at all?

why does ha makes a big mess ?

TomHowarth · ‎02-09-2009

Hi,
I am using seperate switch for VMKernal,HA & VM's and its working fine for me,there is no issue on my HA.
It was tested and working fine.but HA is not 100 % available(online) its restarting all the VM's in another ESX server.
so HA- High Availbility is wrong since its not avail 100 % then how can we accept it as HA.

Because it is High Availability not Fault Tolerance, all HA does in monitor for a Host failure, and then restarts the Guests on another Host. If you need 100% uptime then buy a Tandem. or use a Clustered service.

If you found this or any other answer useful please consider the use of the Helpful or correct buttons to award points

Tom Howarth

VMware Communities User Moderator

Blog: www.planetvm.net

Tom Howarth VCP / VCAP / vExpert
VMware Communities User Moderator
Blog: http://www.planetvm.net
Contributing author on VMware vSphere and Virtual Infrastructure Security: Securing ESX and the Virtual Environment
Contributing author on VCP VMware Certified Professional on VSphere 4 Study Guide: Exam VCP-410

All

i dont trust HA