I just have a quick question...
Right now, we have 2 ESX 3.0.1 hosts with Dual HBAs. In July, we are adding 2 more hosts. Is it really worth it to buy dual HBAs for my hosts when I have that many? Isn't HA around for host failures?
In the past, we have been buying our HBAs from our SAN vendor. They are a little bit more expensive than buying from some random vendor off of pricewatch, but they give lifetime replacement, so it's not that bad of a deal. We have standardized on QLogic cards if that matters.
If I switched the hosts to have only 1 HBA, I would probably keep a spare card lying around just in case one died... but is it really worth it to buy dual HBAs for my hosts, eat up ports on my fibre switch, etc etc waiting around for the rare case that a card is going to die?
What are your experiences?
Well, that's a question that only you can answer. You have to weigh the risk of a failure - and the associated downtime - against the cost of the HBA and supporting infrastructure. If you feel that having outage that cost your company $100,000 could have been avoided by spending $3,000 in infrastructure costs - then you'll buy the extra HBA and fibre port.
Thanks for the response, Ken
We're a university, so we're a bit different from the public sector in that sense. We can't really equate downtime to $$$. Also, our main administrative system cannot be virtualized as it is AIX.
I mainly asked this question to find out what others out there are doing.
It seems to me that memory is more likely to fail than anything else.. and there's not much you can do about that.
I think it's a good idea, if I am paying $20,000 plus for a server that will take the place of 12+ servers paying an extra $1,200 for an additional HBA is good insurance. I know if you use HA you're covered if a HBA or path to the SAN fails but it does mean a hard crash of all your VM's and possible data loss and corruption. If you have the money I would say do it, if money is tight then spend it on something more worthwhile like additional memory for the server. Depending on your SAN environment the chances of a failure can be very small, I recently had a new 585 take a dump with MCE errors and a PSOD, HP came out and replaced the memory and processor board. You are correct that the likelyhood of something else failing is greater. I just like to cover all my bases and use as much redundancy as possible.
My 2 cents...
There is one other thing I hadn't completely thought through...
We're booting our ESX hosts from the SAN.. So, since the current version of ESX cannot handle multipathing for the boot volume, the whole box is going to go down anyways if the HBA fails.
I guess that's a downside to booting from the SAN!
Oh... this is was a very heated discussion for us. We just bought a bunch of new 385g2 and 538g2 servers to replace our aging 380g2's (current environment).
For my 'sandbox' ESX server I spec'ed a single dual port HBA because is it is just that ... a test/sandbox and it is a 385 with limited slots (but more than a 380)
For the production environment I spec'ed two HBA's per server. When the order came in I had a single dual port HBA for the 585's. The manager who changed the order stated that in his umpteen years of being with the company he had never seen a failure of an HBA cause downtime. ...since the dual port HBA's were cheaper than two single port and accomplished the "same thing" he made the executive decision!
I'm sure when we have the first crash it will be more along the lines of why didn't "I" spec the hardware right.
...went to lunch with our SAN guy not long after the order came in and he was chuckling about it. We've had HBA failures before and the only reason the manager didn't really know about it was because the second HBA failed over the way it was designed!!
So to my point: I'm a firm believer in two HBA's in a production/critical box!
Eric
Ha thats funny, you have 2 distinct and different paths all the way to the SAN but a fibre card that is a single point of failure....
Make a diagram for the guy...showing three rivers...two of them have two bridges and one has one. Then ask the guy what happens to traffic if bridge A, B etc, go down.
Some people see a dual-path architecture as a mean to provide redundancy at the SAN/Switch level rather than at the box level (i.e. I am concerned if a switch fail Vs I am concerned if an HBA fail). It really boils down to your own design/requirements but dual-port HBA's has a play in some circumstances ......
Massimo.
In answer to mwheeler1982 question, you do always want two paths out of the server. Granted there is an additional cost associated with this, and maybe depending on your SLAs or environment you can skimp on this.
As for one dual port card vs two single port card, yes two is better then one, but in some environments this can be difficult to do. The typical 2U server has two (DL320) or three expansion (DL380, Dell 2950) slots, this leave you balancing between network & storage connections. If you have only two slots you have no choice but to go with a dual port card - unless you run VMotion, the Service Console and your VMs over the two on-board NICs, do-able...but tight. With three slots if you use single port cards you may be forced to use a quad port NIC if you need many network connections. This is of course a limit of small but powerful servers.
I know one of the big server makers is coming out with a 2U, Dual Quad Core and 128gb RAM server with four on-board network cards that they are targeting at the virtulization market - no doubt we'll find some other limitation .
In our environment we've made it a standard part of our server order to put two HBA's in each host. It was a long argument with management for the extra $$$, but in the end it has paid off for us.
Besides the fact that hardware failures do occur, there is also human error to deal with. Misconfigurations, improper zoning, poor cabling, accidents, etc.
In our case the most common failure we have had was on the FC switch itself. We have had 6 cases that I can remember where one of the four "Quads" on a blade would go out. Taking out all four ports on that quad. To get it fixed we had to replace the blade. Before going dual connected that meant downtime, and extra time and effort when replacing the failed blade. Now that the hosts are dual connected, we can do maintenance whenever we need to without downtime.
I would recommend dual HBA's in every host. Unless the VMs that are hosted there can be down for extended periods of time with short notice.
Make sure you document now, the peckerhead will deny he ever did such a thing in 1 year when everyone's forgotten(even him) that he did this....
CYA.
With fully redundant dual paths maintenance becomes less of a problem. If you need to upgrade firmware or work on one switch in one fabric, there should be no downtime needed for your ESX hosts. We have redundant switches, controllers, HBAs, etc. and we, too, are a university. Most of the redundancy is required for our hospital and financial applications, but the same needs are required since we currently have 200 VMs operating.
Bill S.
Keep with the dual. Do you have dual switches as well? Without them, it doesn't really matter. Personally, I have seen more switch failures than individual card failures.
We only install ESX with 2 x 1 port HBAs. We use dual ports HBAs in our dev servers.
Like others have said, it is not the HBA failure that you are insuring against, but rather the paths to the SAN. Think of the case where the SAN team needs to do maintenance on the fiber switches. Typically, they are redundant and they will make changes to one at a time. If you are not redundant, either you will go down hard, or you would need to take your ESX server down before the change.
-MattG