HA not working - VMware Technology Network VMTN

VMware Cloud Community

Here's an odd one that I'm hoping someone can shed some light on. I'm building a two node branch office style appliance with 2 ESXi 4.1 Update 1 servers (Installable) based on a fresh install.

Each server has a local VM that uses vmDirectPath to manage a dedicated SAS HBA using SANSymphony-V (previously SANMelody) from Datacore to provide replicated iSCSI storage between the two servers.

Storage networking between the boxes is via direct cable on vmnic2 and 3, using one for ESXi traffic and the other dedicated to replication traffic between the SSY VMs.

Vmnic0 and 1 are on a single vSwitch carrying management and VM traffic.

The setup is a closed box using pfSense as the firewall running in a VM which does the mapping to internal resources. VCenter is also running in a VM.

HA is configured and showed no errors during the activation. Non standard setting are host isolation response is set to Leave running in case the pfSense VM dies or it's host ESX goes down and it needs to be restarted on the other server.

One of my standard tests is to ensure that HA is running correctly by doing a hard stop on one of the servers and observing the reactions.

Which (from a practical standpoint) is nothing. No VMs are restarted on the second server. I did observe one message on the second server regarding purging configuration from microcode or microkernel (the VI Client is a non English install). But other than that, no other visible reaction to a regular user.

I've been digging around in the aam logs and have noticed a few things that seem odd, particularly that the backbone service gets shut down. However the aam service does clearly indicate that it has detected that the other server died, but at that point it goes no further.

Info from /var/log/vmware/amm/backbone/2_ftbb.log_bak:

Backbone info Fri Apr 8 15:36:21 2011

Initial path to site 1 using ip=172.16.16.101 pxi=0x80de234 ints=1

Backbone info Fri Apr 8 15:36:21 2011

Setting my incarnation number to 3

Backbone info Fri Apr 8 15:36:21 2011

learned my correct incarnation: I am site 2/3

Backbone info Fri Apr 8 15:36:21 2011

FD: new weight: 2, old 1, quorum 0; new size 2 old 1 (check when size > 50)

Backbone info Fri Apr 8 15:36:21 2011

site view 1/2: 1/3 2/3

Backbone info Fri Apr 8 15:39:33 2011

Received shutdown request for site 1/3

Backbone info Fri Apr 8 15:39:33 2011

Site 1/3 now marked as dead

Backbone info Fri Apr 8 15:39:33 2011

FD: new weight: 1, old 2, quorum 1; new size 1 old 2 (check when size > 50)

Backbone info Fri Apr 8 15:39:33 2011

site view 2/3: 2/3

Backbone info Fri Apr 8 15:39:33 2011

you are dead: 1/3

Backbone info Fri Apr 8 15:39:33 2011

you are dead: 1/3

Backbone info Fri Apr 8 15:39:46 2011

Shutting down: termination of <backbonesrv> detected)

Locally, it shows the storage as being up correctly, with half of the paths broken since the other server is no longer running. I've also tested by manually shutting down the first server's iSCSI server to ensure that the storage failover works correctly.

Does anyone have any ideas why HA would simply not work?

Oh and all IP addresses are in the private ranges so it's not that bug of addresses in the 6-9.0.0.0 ranges.

2 Replies

Well - I finally found some additional log entries that seem pertinent. In the ESX02 server logs, show via the VI Client, I have a number of entries stating:

Failover unsuccessful for Machine X, on server esx02, in cluster HA, in datacenter. Reason: the operation is not allowed in the current state.

However there is no additional detail concerning what exactly it thinks is wrong with it's state. The only thing that I can think of is that there is a loss of network connectivity on the two links that are directly connected to the other ESX Server. But the server can still access the local storage target so that shouldn't be an issue.

What's strange is that initial tests on an earlier version did in fact trigger the HA restart correctly with exactly the same hardware configuration.

/var/log/vmware attached as a zip file with all of the appropriate logs.

A quick update to clarify what happened and how to avoid it.

As it turns out, a das.isolationaddres must be external to the ESX. In my appliance, there is one VM that is used as the firewall/router for the virtual machines and the ESX servers and was thus the address used by default. However, turns out that you can't fake out the ESX into thinking that it can't go into isolation mode by using it's own address (even though you will see pings and vmkpings passing without a problem). On top of that, if your das.isolationaddress is on a VM that is running on the ESX server, it will also go into isolation mode if an HA event is triggered.

Standing back, this is actually a logical approach since if the IP address is local, the ESX cannot guarantee or have confidence that the physical network is truly available if it's pinging addresses that are local to itself, either Management interface or VM. But what's annoying is that it's not documented - I'm hoping to see a KB come along some time.

We got around this by reconfiguring the switches to present their admin IP on the same VLAN as the ESX servers and using them for das.isolationaddress.

What makes this difficult to identify is that the HA initial configuration will work exactly as expected and will generate no errors. So all of the usual HA issues don't show up. The initial client configuration works perfectly, but the isolation state is _only_ triggered when there is an HA event. I would have assumed that the isolation state would have been identified immediately upon activating HA, but this is not the case. Which means that it's not the same code path for the initial HA client configuration and when an HA even occurs.