Here's an odd one that I'm hoping someone can shed some light on. I'm building a two node branch office style appliance with 2 ESXi 4.1 Update 1 servers (Installable) based on a fresh install.
Well - I finally found some additional log entries that seem pertinent. In the ESX02 server logs, show via the VI Client, I have a number of entries stating:
Failover unsuccessful for Machine X, on server esx02, in cluster HA, in datacenter. Reason: the operation is not allowed in the current state.
However there is no additional detail concerning what exactly it thinks is wrong with it's state. The only thing that I can think of is that there is a loss of network connectivity on the two links that are directly connected to the other ESX Server. But the server can still access the local storage target so that shouldn't be an issue.
What's strange is that initial tests on an earlier version did in fact trigger the HA restart correctly with exactly the same hardware configuration.
/var/log/vmware attached as a zip file with all of the appropriate logs.
A quick update to clarify what happened and how to avoid it.
As it turns out, a das.isolationaddres must be external to the ESX. In my appliance, there is one VM that is used as the firewall/router for the virtual machines and the ESX servers and was thus the address used by default. However, turns out that you can't fake out the ESX into thinking that it can't go into isolation mode by using it's own address (even though you will see pings and vmkpings passing without a problem). On top of that, if your das.isolationaddress is on a VM that is running on the ESX server, it will also go into isolation mode if an HA event is triggered.
Standing back, this is actually a logical approach since if the IP address is local, the ESX cannot guarantee or have confidence that the physical network is truly available if it's pinging addresses that are local to itself, either Management interface or VM. But what's annoying is that it's not documented - I'm hoping to see a KB come along some time.
We got around this by reconfiguring the switches to present their admin IP on the same VLAN as the ESX servers and using them for das.isolationaddress.
What makes this difficult to identify is that the HA initial configuration will work exactly as expected and will generate no errors. So all of the usual HA issues don't show up. The initial client configuration works perfectly, but the isolation state is _only_ triggered when there is an HA event. I would have assumed that the isolation state would have been identified immediately upon activating HA, but this is not the case. Which means that it's not the same code path for the initial HA client configuration and when an HA even occurs.