Solved: Is this behaviour normal with Master isolation?

AllBlack · ‎08-29-2013

Hi,

I am testing HA in a vSphere 5.1 cluster and I want to know if the behaviour seen when a Master is isolated is normal.

I have 2 nodes in my dev setup. When I isolate my slave (Host B) the VM stays powered on as expected (isolation response Leave Powered On).
Host B shows as not responding and the VM has disconnected so this works as expected.

When I isolate my master (Host A) from management network the election process takes place. Host A shows as not responding as we'd expect. My VM appears as powered off on Host B.
The events log for the VM tells me that

-the vm is powered off on host B

-host B cannot open VMX file

-vSphere HA unsuccessfully failed over this VM

During all this time my VM is accessible and functional. As soon as Host A is no longer isolated my VM appears as powered on again it.
So everything seems to work as it should but the messages in vCenter say otherwise. Is this normal?

When I simulate a failed host everything works as expected, regardless whether it is Master or Slave

Please consider marking my answer as "helpful" or "correct"

kfarkas · ‎09-17-2013

The behavior you are seeing is expected.

Let's start with your recent tests. When you disconnect the existing master from the management network, the other FDM is still able to communicate with the master, and so it re-connects to it using the 2nd network. The election you observe is a result of this process -- the slave FDM lost access to the master, drops into the election state, got an "Am master" message from a master, and connected to it. A FDM master sends out "Am master" election messages on all its management networks every second, and a slave will connect to the master using any network from which it received this message.

The reason VC reports no master, as you hint at, is because VC cannot communicate with the master. VC knows there is a master because the other FDM would have told it who the master is. I'll file a PR for us to improve the config issue text.

Regarding your original posting, I think the difference in behavior you observed is due to a problem that we have fixed in the 5.5 release. When you isolated the master, a new master election occurred. There is a race (that we closed) between the new master learning that the VMs on the old host are powered on and the master workflow for restarting VMs. If the restart workflow executed too fast, the new master would attempt to restart VMs that it later found out were running on the isolated host.

Finally, a clarification of the following statement from support:

"As it is the Master which reports to vCenter, until a new Master has been established, the view in vCenter will not be accurate when it is isolated. (I believe vCenter checks the status of new Master/Slave elections every 2 minutes by default.)

VC actually checks for a master every 10s by default. The 2 minute value is how long VC tries to connect to a master before it reports via an event/config-issue it can't find one.

View solution in original post

weinstein5 · ‎08-29-2013

Host B does not know that the Host A is isolated - it assumes that Host A has failed and tries to restart the VM but it is running on Host A so the VM's files are locked open -

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

AllBlack · ‎08-29-2013

So are you saying that a slave can never tell that a master is isolated and therefor this behaviour is normal?

cheers

Please consider marking my answer as "helpful" or "correct"

weinstein5 · ‎08-29-2013

Yes it is normal - the slave does not know that the host is isolated it just sees that it is gone and assumes down - so it will try to restart the VMs that were running on the Master - if you were to set the isolation response to shut down the VM then the VM would have restarted on the slave as soon as the locks on the files were released

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

depping · ‎09-04-2013

AllBlack wrote:

So are you saying that a slave can never tell that a master is isolated and therefor this behaviour is normal?

cheers

Not sure I agree this is normal. The master should be able to tell the host is isolated normally. In this environment are you running storage on the same network links as the management network?

AllBlack · ‎09-04-2013

Thanks,

No, they make use of different links. Management network sits on a vSS making use of 1Gbps network. All other networks are making use of a vDS with 10Gbps links. All networking is seperated by vlans.

The behaviour, so far, appears to be only cosmetic but if it is not normal I'd like to fix it. There are no issues (on master or host) when I change the isolation response to shut down.

Please consider marking my answer as "helpful" or "correct"

depping · ‎09-10-2013

I will point one of the developers to this thread. Would you have log files / dumps available?

AllBlack · ‎09-10-2013

Thanks Duncan,

Yes, I have all of this available. I have opened an SR (#13370413009) and will upload the files today.
I shall post some screenshots here later too

Cheers

Please consider marking my answer as "helpful" or "correct"

AllBlack · ‎09-11-2013

Support is telling me that this is normal behaviour and offers the following explanation

"As it is the Master which reports to vCenter, until a new Master has been established, the view in vCenter will not be accurate when it is isolated. (I believe vCenter checks the status of new Master/Slave elections every 2 minutes by default.) That said, when the Master is not isolated and still communicating with VC, it will give a more accurate view of the underlying VMs since it can still communicate with the Slave hosts over the storage heartbeats if available"

Cheers

Please consider marking my answer as "helpful" or "correct"

AllBlack · ‎09-12-2013

I have done some more testing and have some questions relating to the above. Please excuse my ignorance.

Someone recommended enabling management traffic on another kernel as well. It is my understanding that all this does is give you redundancy for the HA heartbeats. I have enabled this on my storage vmkernel.

When I disconnect my masters management network an election appears to happen. The other host remains slave but there is no master available. vCenter cannot find an HA master (until I reconnect network to management switch and the host is still master). Would there need to be an election at all given that there is a redundant management option? And if there is an election why does the other host not become master?

What I do see now though is that the original master host and the VM appears as disconnected ( as I assumed would always be the case - hence the original question) .

Please consider marking my answer as "helpful" or "correct"

kfarkas · ‎09-17-2013

The behavior you are seeing is expected.

Let's start with your recent tests. When you disconnect the existing master from the management network, the other FDM is still able to communicate with the master, and so it re-connects to it using the 2nd network. The election you observe is a result of this process -- the slave FDM lost access to the master, drops into the election state, got an "Am master" message from a master, and connected to it. A FDM master sends out "Am master" election messages on all its management networks every second, and a slave will connect to the master using any network from which it received this message.

The reason VC reports no master, as you hint at, is because VC cannot communicate with the master. VC knows there is a master because the other FDM would have told it who the master is. I'll file a PR for us to improve the config issue text.

Regarding your original posting, I think the difference in behavior you observed is due to a problem that we have fixed in the 5.5 release. When you isolated the master, a new master election occurred. There is a race (that we closed) between the new master learning that the VMs on the old host are powered on and the master workflow for restarting VMs. If the restart workflow executed too fast, the new master would attempt to restart VMs that it later found out were running on the isolated host.

Finally, a clarification of the following statement from support:

"As it is the Master which reports to vCenter, until a new Master has been established, the view in vCenter will not be accurate when it is isolated. (I believe vCenter checks the status of new Master/Slave elections every 2 minutes by default.)

VC actually checks for a master every 10s by default. The 2 minute value is how long VC tries to connect to a master before it reports via an event/config-issue it can't find one.

AllBlack · ‎09-17-2013

Thanks for clearing this up and I appreciate that efforts are being made for changing the config issue text

Please consider marking my answer as "helpful" or "correct"