VMware Cloud Community
Dryv
Enthusiast
Enthusiast
Jump to solution

HBA Failure Outcome

Hi Chaps,

I have:

- A esxi cluster made up to HP blade servers

- each blade has a single dual port HBA

- One port goes to SAN 1 and the other goes to SAN 2

If one port on the single HBA fails... no issues

However what is the exected outcome if the actual HBA fails? I struggling to understand how ESXi would handle this...

Can I test this within vCenter? is there a way for me to be able to disconnect the HBA from the host within vCenter?

Thanks

Dryv

0 Kudos
1 Solution

Accepted Solutions
martinriley
Hot Shot
Hot Shot
Jump to solution

Ah yes, understood!

Disabling the paths from that host should produce the same result in that the host will not be able to connect to the datastores, this post has a neat (albeit dangerously powerful) script that may help.

You don't need to configure anything other than standard HA, everything else is built in.

Good luck!

View solution in original post

0 Kudos
10 Replies
martinriley
Hot Shot
Hot Shot
Jump to solution

Hi there,

So am I correct in saying that your host losing the HBA will lose access to all it's datastores in this instance?

If so then essentially your host will be looking at an All Paths Down (APD) condition, the IO for all VMs on that host will timeout and fail and the VMs will be halted.

If you have a HA enabled cluster and all admission control policies are met then these VMs will be restarted on other hosts, if not you will need to power the VMs down before they can be restarted elsewhere (or you restore access to the datastores for your host)

In terms of testing from vCenter I'm not sure how best to do this, usually vCenter will do all it can to prevent an APD as far as I'm concerned as obviously this isn't usually a desirable state!  If physically removing the cables from the HBA or switch isn't an option (pulling cables is my preferred way of DR testing anything!), another good way to do this remotely if possible is to shut down the ports on the switch or switches your HBA is connected to.

Hope this helps, good luck

vM

-----------------------

VCAP-DCD / VCAP-DCA / VCP-CLOUD / VCP-DT / VCP5 / VCP4

-----------------------

vMustard.com

0 Kudos
Dryv
Enthusiast
Enthusiast
Jump to solution

Hey... thanks for the response.

Yep, the host losing the HBA (both ports on the single HBA) will no longer be able to see its datastores.

I can disable the ports on the switch no issues... I can even pull cables... But because I'm running on blades, all blades share the same uplinks to the SAN switch, so if I disable all the SAN switch ports or unplug the cables, I disconnect all hosts from the SAN. I thought I might just be able to disable the HBA for a given host from within vCenter or the Blades iLO as I just want to test what happens if one blades HBA fails.

Do I need to set anything up to deal with an APD condition outside of configuring my cluster for HA?

...running ESXi 5.5 u2.

Thanks again

Dryv

0 Kudos
martinriley
Hot Shot
Hot Shot
Jump to solution

Ah yes, understood!

Disabling the paths from that host should produce the same result in that the host will not be able to connect to the datastores, this post has a neat (albeit dangerously powerful) script that may help.

You don't need to configure anything other than standard HA, everything else is built in.

Good luck!

0 Kudos
Dryv
Enthusiast
Enthusiast
Jump to solution

Fantastic. .. thank you so much.

My fear was that isolating the SAN would just leave the VM on the host in limbo as the host networking is fine and is therefore actively involved in network heartbeats for HA to determine its not dead.  I thought storage heartbeats were not even looked at until network heartbeats fail when HA determines what action to take. ...still learning. ..

... Now time to test! 

0 Kudos
martinriley
Hot Shot
Hot Shot
Jump to solution

No problem, be interesting to see your test results if you fancy posting back, always nice to see the theory turn into practice Smiley Happy

Incidentally it's not storage heartbeats at play here as such, as you mention that's for host isolation detection, ESXi uses other wiles to determine storage failure depending on the type of outage, eg complete loss of an array (APD) or loss of a LUN (PDL - essentially the array is able to 'tell' the hypervisor that a LUN is offline and may not come back).

VMware have actually enhanced this functionality in version 6 to make it more like the 'Host Isolation Response' configuration in a feature called VM Component Protection (VMCP) allowing you to dictate what will happen to a host's VMs when the various storage states are detected, this might be useful to you and could be worth checking out.

Thanks

vM

-----------------------

VCAP-DCD / VCAP-DCA / VCP-CLOUD / VCP-DT / VCP5 / VCP4

-----------------------

vMustard.com

0 Kudos
Dryv
Enthusiast
Enthusiast
Jump to solution

Thanks bud. Yep I'll post back... currently exploring the blade virtual connect to identify whether I can pull the hosts HBA card from there to simulate this.

Good news that there are other factors involved rather than just HA playing is part to determine SAN visibility to the host!

Thanks for script. .. but scripts scare me! I don't feel I'm experienced enough

0 Kudos
Dryv
Enthusiast
Enthusiast
Jump to solution

okay... small update..

I tried to disable the blade HBA from within the Blade Center but there was no such option. However what I was able to do was unassign the SAN from each port. So at present one port goes to SAN1 and the other to SAN2. By unassigning the SAN from each port, the blade in question lost its paths to the Storage... I could see that in vCenter... However the VM remained on the host, it didnt move. I was also able to SSH to the VM and login.... strange, given their is no paths left to the disk?

I might have to use the script you sent me a link for... If you dont mind, I might need some guidance, on what tools I might need to even get the script running in the first place... All I have installed is the VI Client.

Its all a test environment right now, so no issues if I break something... not bothered at all.

Thanks for the time so far!

0 Kudos
martinriley
Hot Shot
Hot Shot
Jump to solution

That's interesting, there's no way the VM would have been up and running with no storage so either it's not on the storage you were expecting or it's still managing to get to the storage somehow!

If you can enable SSH on the host and SSH in, run the following command to list your storage devices:


esxcli storage vmfs extent list


Then grab the 'Device Name' (eg. naa.60002ac000000000000001c5000abcd5f)  of the storage your VM is on and run


esxcli storage nmp device list -d <Device Name>

Where <Device Name> is the Device Name you grabbed in the first command, this will show you the working paths to the device, might give you some clues as to what's going on.

As for the script you'll need PowerCLI to run it, which means you'll need a Windows machine with connectivity to the vCenter.  If you're running Windows vCenter then it'll already be installed (though you might have to add the snapin) otherwise you can download PowerCLI and install it.

0 Kudos
Dryv
Enthusiast
Enthusiast
Jump to solution

I'll check again tomorrow. .. but so far I only have presented 2 LUNs to the cluster. .. haven't presented any others as currently running through basic verification tests...

.... It definately stayed on the host and didn't move off... let me try and reproduce and get some screenshots in. ..

Appreciate the responses. 

0 Kudos
Dryv
Enthusiast
Enthusiast
Jump to solution

I'm going to try something slightly different this time to try and simulate this HBA failure scenario. Instead of unpresenting the SAN as before from the blade center management interface I thought I would run through these steps:

To enable or disable a path for your storage in the vSphere Client:

  1. Select the ESXi/ESX host you want to modify and click theConfiguration tab.
  2. Click Storage.
  3. Select a datastore or mapped LUN.
  4. Click Properties.
  5. In the Properties dialog, select the desired extent, if necessary.
  6. Click Extent Device > Manage Paths and obtain the paths in the Manage Path dialog.
  7. Right-click the desired path and click Disable or Enable. If the currently active path is disabled, it forces a path failover

And disable every path to every presented LUN. .. I only have 2 LUNs so it won't take too long. .. let me know if you think this would adequately achieve what I'm trying to simulate. 

0 Kudos