Re: Host failover due to storage failure -- does t...

shcrgeek · ‎02-26-2009

Howdy, VM'rs.

We have an EMC DMX 3-20 fiber SAN, and 3 ESX 3.5u3 hosts in a cluster.

2 of these hosts have only one HBA. The other has two.

In our testing, we've found that on the two-HBA system, the host will see the failure of one HBA, and switch to the other. This is seamless. My one regret is that it won't show any alart or alarm indicating a path fail.

When we completely detached that running host from the SAN the host alerted to a storage failure -- but did not try to migrate the guests running within.

I've looked the documentation over, and searched the KB, but no joy. No mention, at least not with the terms I used.

So.. did I miss something trivial, or am I looking for something that doesn't exist yet?

Failover on total NIC failure does work, and it's pretty slick.

Thanks in advance for any replies

Texiwill · ‎02-26-2009

Hello,

In order for 'migration' to occur via DRS, the storage must be connected.

You need to instead look at implementing VM level failover using VMware HA. If the VM fails THEN it would failover to the other host. This is not a migration but a reboot of the VM on the new host.

Best regards,
Edward L. Haletky
VMware Communities User Moderator, VMware vExpert 2009
====
Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education.
Blue Gears and SearchVMware Pro Blogs -- Top Virtualization Security Links -- Virtualization Security Round Table Podcast

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

shcrgeek · ‎02-26-2009

Hi Texiwil.

Thank you for your reply.

We do use HA. It worked last month when for some unkown reason one of the hosts had some kind of CPU freeze.

That host was running 3.5.0,and after being brought back up was rebuilt with 3.5.0.u3 and rejoined to the cluster. It all appeared well.

Today I tested the NIC failover, part of my ongoing redundancy testuing.. and this time, unlike my initial test months ago, this time the NIC test failed as well.

I now have an HA problem.. and I honestly can't understand why. Even tho we have 3 hosts, all configured for HA, two hosts have ~40% RAM free and ~70% CPU free.

The other host is basically empty but for 3 test VMs in it -- this is the machine I"m testing with, this is the one with the two HBAs and the two NICs

With all that said, now the thing tells me even with all that goodness that I have zero failover capacity. Why? The setup is exactly as it was before, only now I have two 3.5.0.u3 hosts instead of one.

I'm stumped. I haven't removed and replaced the cluster yet, but.. why did I lose all failover? Makes no sense. I even turned "allow VMs to power on even if they violate constraints..."

My VC is 2.5.0 119598.

So now I'm thinking -- perhaps it is this freshly-discovered lack of failover that prevented my SAN test from going well.. but why? Why does it insist on it not having any failover capacity?

Any help will be appreciated.

Texiwill · ‎02-26-2009

Hello,

I have had this issue as well and just had to remove a host from the cluster and then bring it back into the cluster. Or even better disable VMware HA and reenable it. I would investigate the logfiles in /var/log/vmware/aam for issue related to failover and what is happening. It may be a simple issue. For example, I currently have auth issues which prevents HA from happening.

Best regards,
Edward L. Haletky
VMware Communities User Moderator, VMware vExpert 2009
====
Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education.
Blue Gears and SearchVMware Pro Blogs -- Top Virtualization Security Links -- Virtualization Security Round Table Podcast

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

francisbandi · ‎02-26-2009

HA failover calculations are based on the Least server configuration.

in this HA cluster if you have ESX host with less memory and CPU Hrz.. and the failover capacity will be based on that..host.

and a VM with more RAM and CPU allocation will be used for the calculation and come up with no of VMs that can be hosted in failover scenario.

I belive these are enforced after VC 2.5 U2.

Texiwill · ‎02-26-2009

Hello,

I would also ensure you have VC v2.5 U3. It will help.

Best regards,
Edward L. Haletky
VMware Communities User Moderator, VMware vExpert 2009
====
Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education.
Blue Gears and SearchVMware Pro Blogs -- Top Virtualization Security Links -- Virtualization Security Round Table Podcast

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

shcrgeek · ‎02-26-2009

Hi Francisbrandi

All 3 hosts, while being of slightly different physical makeup, are of the same family, and are of identical specs.. they're 2 Dell 1950's (1u) and 1 Dell 2950 (2u)

Each is 2 quad cores at 3ghz and 32gb RAM. They're monsters, every single one of them.

I'll look into upgrading the VC. Isn't build 11958 the almost-latest? I haven't dropped in the one that came out the other day yet.

shcrgeek · ‎02-26-2009

I tried the remove / readd trick already.. and the disable / re-enable HA trick..

Waah, this worked before the stupid CPU freeze and upgrade.. what the hell did I break..

I'll deal with after lunch. >.<

Thanks, folks. I'll keep trying.

Oh, and just in case -- I don't use reservations. I see that turn up lots in discussing this very problem..

francisbandi · ‎02-26-2009

You may want try this .. as we had a luck with it once..

-- create a new cluster and enable HA and DRS with 1 host failover

- disconnect each host from the old host (safe if you disconnect)

and join to new cluster.

There is chance that your cluster may be corrupted and not calculating the fail-over capacity properly.

we did this with luck in none of the cluster that showed up .. this HA error

shcrgeek · ‎03-04-2009

OK, so I fixed the broken HA thing.. there were a few guests that without my knowledge had reservations.

Removed those, and I'm back to having HA capacity -- and tested it.

But -- if I unplug both HBAs from the test system, there's no failover of any kind. The guests stay on the storage-less host, and ping, but nothing else.. can't get to them, can't log into them by console or RDP.

Worst of all, if I attempt to migrate them via VC away from the system with failed storage, the moves leave the machines in a state were they can't be powered on. Through some process that I don't understand, they later become available for power on, but when they do, they complain that their identifier has changed, and so I create a new id.

What's going on, folks? Am I going about this the wrong way? Am I expecting something from VM which it can't yet deliver? Or is something misconfigured in my system?

This kind of failover works with the network, if I unplug both network connections, the cluster eventually sees this and within minutes they're all runningn again. Yet this isn't working with storage.. it sees the failures, it doesn't alert me, but it sees them, you can see it in Path Management, when one HBA's gone they show up as "DEAD" with a red circle next to them.

So if it senses the outage, why not move the machines away?

Summary:

1. Fixed HA and tested it. Works, but only if the host doesn't have network anymore.

2. Tried unplugging both HBAs in the test system. This took 15 minutes to be noticed by the VC, and then it did absolultely nothing.

Again, thanks, and any help will be appreciated.

All

Host failover due to storage failure -- does this exist?