Bob_Jenkins
Contributor
Contributor

VMs flicking between ESX hosts and generating HA errors

Jump to solution

Dear All

This might seem an obvious one, but I would really appreciate some confirmation of my thoughts.

We have an ESX3.5 U3 environment with 16 hosts being managed by Virtual Centre. Each physical host has two NICs for each of the three vSwitches (Service Console, vmkernel and VM network). The two NICs of each vswitch are fed by two different Cisco 6509s for redundancy purposes. Last week, for a separate reason, one of the Ciscos was shut down, meaning that none of the three vswitches had redundancy of networking in place.

Yesterday it was noted that VMs were flicking between hosts - i.e. a VM would be visible on ESX06 for 5 secs, then ESX11 for 5, then 06 again, etc etc. The Event Log shows that HA is attempting to swap the VM from physical host x to physical host y repeatedly. It is still possible to RDP to the VMs, however there is no possibility of managing other aspects of the VM via VC (i.e. vmotion), as, I guess, VC never knows exactly where it should be!

My initial feeling is that HA is looking at the Service Console vswitches of all of the physical hosts and determining that there is a lack of redundancy and is therefore trying to move VMs from one host to another, then getting stuck in a loop. However, I thought that HA was only invoked (i.e. actually moving a VM from one host to another) if a particular host actually winked-out and fell of the network completely - not just if it lost some redundancy...

I would appreciate your thoughts. And by the way, I'm not sure when the networks team will be able to get the second Cisco up and running again, so would like to discount the possiblity that this problem might be caused by something else.

Many thanks in advance for your help.

0 Kudos
1 Solution

Accepted Solutions
Troy_Clavell
Immortal
Immortal
0 Kudos
17 Replies
weinstein5
Immortal
Immortal

HA only kicks if a ESX server fails or if you are doing VM monitroting and it determines there is a loss of VM Heartbeat that is provided by VMware tools. HA does not move VMs but restarts them so you will see an outage in accessing the VMs.

- HA will not activate if there is not Service Console redundancy it will just through a warning - is it a single VM that is doing this or all Vms in the cluster? I would disable per VM monitoring or at least reduce the heartbeat monitoring sensitivity? Why can't it be vmotion?

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
0 Kudos
Bob_Jenkins
Contributor
Contributor

Thanks. I have been speaking to HP and they reckon that this is perhaps a known issue around the Isolation Response characteristics of HA and that since the Service Console vSwitch doesn't have redundancy on the hosts, it is still trying to use HA to move the VMs to a host with a redundant SC...

0 Kudos
depping
Leadership
Leadership

HA will not move guest to a host with redundancy. It will give an error at the cluster level but that is it. Failover will only occur when for some reason your host appears to be isolated. You might want to take a look at your log files. they will most definitely reveal the truth.

Duncan

VMware Communities User Moderator

-


Blogging:

Twitter:

If you find this information useful, please award points for "correct" or "helpful".

0 Kudos
jbrauer
Contributor
Contributor

Are you using NFS? If so, do you have the NFS lock option set?

I had this problem on a two node cluster that was using NFS... I had a server intiate an HA event and it ended very badly... both hosts were running the VMs. I did not have the nfs lock option set. in the end, I had 8 VMs get corrupted....

I call vmware support and they told me to get off NFS as it was not their preferred solution.... Luckily i was able to isolate the problem for a couple weeks then our SAN vendor was onsite and i was able to talk to them and they told me about the nfs lock option... I think the nfs lock option was new in update 2, or atleast that is what the SAN vendor told me because we contracted them to help us setup NFS on this cluster since it was our first NFS deployment.

0 Kudos
depping
Leadership
Leadership

The NFS lock option has been around for a while and used to be a best practice to disable locking. This is what probably caused your issues. This best practice has been changed a couple of months ago.

Duncan

VMware Communities User Moderator

-


Blogging:

Twitter:

If you find this information useful, please award points for "correct" or "helpful".

0 Kudos
Bob_Jenkins
Contributor
Contributor

Hi - no we're not using NFS. To clarify, the VMs APPEAR to go from one host to another (5 secs here, 5 there), but actual HA failover does not occur. We were running perfectly (and there were no configuration changes), until the second Cisco was taken offline - then the problems started to occur. The vmware guy from HP still reckons that this is a potential know issue with U3 and that it occurs when the Service Console's redundancy is suddenly lost. The workaround is to disable/re-enable HA (or to re-establish the failed link)...

0 Kudos
weinstein5
Immortal
Immortal

I think I might now what is going on - I am assuming it is from the physical switches you are seeing the VMs 'flick' and I assume you have redundant physical NICs on your vSwitches - if this is the case what is happening is the is the vSwitches are detecting a network failure failing between the physical NICs - make sure you have set the speed and duplex on your physical nics -

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
0 Kudos
_bC_
Contributor
Contributor

Due to the design of my network (long story) I also get this problem...

then the line(s) between two diffrent datacenter breaks some machines always starts to jump/flicker between two ESX servers.

I havent done to much research about it but then it happens (once a year or so) I only restarts the Virtual Center Server Service and everything is back to normal again....

So I Guess the problem is in the VC Server somewhere, and it is trigged by a network communication error. I dont even belive the VM are moved at all, just rolledback in the final stage.

But since it doesnt happens so very often and I know it is mainly generated by a poor design from my side, I decided to ignore it...

But if anyone have solved this I would be very happy.

I hope, after every update I make, that it would be fixed, but this behaivour have always been there for me...

// bC

Troy_Clavell
Immortal
Immortal
0 Kudos
_bC_
Contributor
Contributor

I've been reading it, and I have the symptoms.

Since this is due to a network error and not a true host error the solution does not make a perfect match...

Couse it alway happens in the following way,

in DC1 I am running VirtualCenter on one VM, on a ESX server whitch are in part of a cluster.

in DC2 I am running some VMs on other ESX Servers whitch are members in the same cluster.

HA and DRS are enabled, each DC have their own SANs on whitch the ESX servers LUNS are located.

Then the line breaks the ESX servers "primary" LUNS are not affected so the ESX serers are still capable of executing "their" VMs.

But since there been a communication failure, VC initializes a VMotion couse it thinks the host is down (it cannot tell the difference from a network error or a host error, and since the management net should be redundant this network error should never be the case either...)

But my other ESX servers in DC1 are however capable of responding to the VC requests, but they does not have access to the LUNs located on the SAN in DC2 due to the network error...

And the ESX servers in DC2 are more or less not aware of the problem since they "only" lost their mappings to the DC1 LUNs.

And then the communication are established again between DC1 and DC2 the VMotion tries to finish the migration.

And then I have the exact same situation as if two 3 yearold childs fighting about the same toy (but in my case two ESX servers figthing about executing the same VM...).

And by restarting the Virtual Center Server Service the "fight" stops...

// bC

0 Kudos
depping
Leadership
Leadership

Sorry but that just doesn't make sense:

  1. HA doesn't use VMotion move VMs to a new host, vCenter does when the VM is started though

  2. When HA kicks in there's always downtime. (VM is powered off and powered on again on a different host)

  3. vCenter doesn't have any part in starting the VM's it's the failover coordinator, which is one of the primary ESX hosts

  4. Host from a datacenter should be bound to 1 cluster and all hosts should have access to all LUNs

Duncan

VMware Communities User Moderator

-


Blogging:

Twitter:

If you find this information useful, please award points for "correct" or "helpful".

0 Kudos
Pmarsha1
Enthusiast
Enthusiast

I have also seen this problem with NFS datastores, BUT have the locking off.

In my case the only fix was to reboot the host which the VM was seen to be flicking to, not the one that it's actually hosted on. It looks like a VMWare bug to me, not network related given all Hosts can see the NFS datastores and locking is off.

0 Kudos
_bC_
Contributor
Contributor

Nope, but still, I just can tell for me and my environment the flickering stops then the Virtual Center Server Service is restarted.

The issue must be HA related in some way, couse when I having a known communication maintenance, I just stop HA in advance and this never happens.

But if HA is on and a communication error occurs, then sometimes, not allways, some machines starts to flicker and a VCSS restarts solves it.

I forgot to write that I am using iSCSI over the same communication line, thats why the LUNs disappear... (one part of the "poor design" I mentioned earlier)

// bC

0 Kudos
depping
Leadership
Leadership

It sounds like you will most definitely need to redesign your environment to be completely honest.

And in a way it's HA related indeed.

Duncan

VMware Communities User Moderator

-


Blogging:

Twitter:

If you find this information useful, please award points for "correct" or "helpful".

0 Kudos
_bC_
Contributor
Contributor

The world is full of compromises, and I think I can live with this compromise.

(could also be related to the fact that mankind is lazy by nature...)

The setup is a part of our disaster recovery plan, and the pros, are for us, more than the cons with the current setup

(even who it is "poor design" from a pure technical point of view)...

The esiest way to fix it otherwise would be the common design "one DC = one cluster"

and just replicate SAN2SAN but then there would be more configuration to have it up and running again in case of DR.

I am currently having a mix of all, replicate with doubletake for all machines, and for larger fileshares (RAW lun mappings) S2S replication,

and have the VM_replica configured as coldstandby (I dont really trust DouleTakes failover) on the other SAN...

with this setup I am also more flexible with backups from a networkload perspective

So at the bottom line, The issue about the flickering is the only true con with the setup, all other cons are "easily" manageable.

// bC

0 Kudos
Bob_Jenkins
Contributor
Contributor

Dear All

Many thanks for the multiple detailed, helpful replies. I spent literally 6 hours on the phone yesterday to our VMware support guy at HP and, through a tonne of troubleshooting, finally got it resolved. Just to clarify, the "flicking" or alternating VM issue was, in our case at least, definitely not related to NFS. Here are the steps we took to resolve

Steps performed:

Disabled / re-enabled HA for the cluster. Left half an hour between disabling and re-enabling.

1. Tried to perform the steps as per the KB article from previous posting but failed with error message.

2. Error message refers to VMware Perl library not available (this is a red herring - I SHOULD have been able to elegantly unregister the ghost VMs using the instructions in the kb article).

3. Tried to remove the virtual machines from the inventory but the option was greyed out.

4. Tried to power off the virtual machines, but shows as powered on even though vm was shut down.

5. Logged in to the ESX hosts through putty and restarted the services (service mgmt-vmware restart).

6. The VM seemed to be stabilized and were not moving around...for now(!)

0 Kudos
_bC_
Contributor
Contributor

Great it worked out for you.

I forgot to wrote that you a have this issue again you could also try to connect to the ESX servers one by one with the Infrastructure Client (and not through Virtual Center) Sometimes I have noticed what if the problem is initiated by a networkerror the same machine are powered on, on two diffrent ESX servers and is therefor not possible to unregister it, but then it is powered off it unregister itself. (The machine usally does not have any disks and it wont boot up, but they are never the less powered on from the ESX point of view and there for blocking) And since this should never be an error, the only way to see it visual is with the Infrastructure client connected direct to the ESX server...

// bC

0 Kudos