I have a client who is looking to deploy an IBM HS21 BladeCenter with the ESX 3.0.2 hosts booting from SAN. The blades would use iSCSI HBAs and boot from a NetApp SAN. Can anyone provide any recommendations, suggestions or things to watch out for? I believe the HBA is a Qlogic; the Server Confguration Guide (p.113) states that the HBA has to be a Qlogic QLA 4010. Is this the only HBA that can be used? Any experiences with booting from iSCSI SAN?
Thank you in advance.
Double-check the use of an HBA that uses the QLA 4010 driver, this was for ESX 3.0; ESX 3.0.1 and 3.0.2 ship with the QLA 4022 (http://www.vmware.com/pdf/vi3_io_guide.pdf, page 2)!!!
Other drivers may "work", but I would absolutely stick with an HBA that uses the Qla 4022 driver, as this is supported.
Here are some links that might help you:
If you found this information useful, please consider awarding points for "Correct" or "Helpful". Thanks!!!
You should be OK. You'll have to configure the card to boot from a particular lun. IBM (unlike HP) support a mezzanine version of the Qlogic QLA 4050/4052. This should work great.
Although with 3i warming up in the wings that's a lot of $$s to be throwing at specialized hw that can be replaced in a few weeks with a $5 usb stick... Granted, the card will still work well for your iSCSI shared VMFS storage, but unless you're dealing with racks and racks full of blades it seems like overkill to put ESX on the SAN. We've been running 30 HP Blades since May (not all ESX). We have had 0 hardware failures that I'm aware of.
We recently migrated 4 IBM HS20 and 2 IBM HS21 blades running VMware ESX from a Fiber Channel Boot-from-SAN to an iSCSI Boot-from-SAN configuration. We used the IBM 32R1923 iSCSI HBA which uses the qla4022 driver. It's basically the IBM blade server equivalent of the Qlogic QLA4052C.
Setting up Boot-from-SAN with these adapters was easy, but was not without a few issue. Mainly, involving failover of the COS (Console OS) in the event of a complete path failure. One of the issues was pretty simple, the default KeepAliveTimeout (KATO) for the boot volumes was set to 30 seconds as that's what the HBA is set to "out of box". The problem with this is that the failover time can basically be calculated by n2+5 where in is basically the KATO time (the ql4xportdownretrycount, which defaults to 14, sets this value for the dynamically discovered volumes -- volumes discovered by the settings you use in ESX itself -- but the boot LUN has to be configured in the HBA BIOS since it obviously has to exist before ESX is loaded). Well, 302+5 is ~65 seconds, so it will take at least that long for the boot LUN to fail over, however, the COS also has a timeout in where it will set the /dev/sda device offline if it can't complete a write in ~60 seconds. Since the two timeouts are so close together it can happen that the console OS times out the device before the failover happens. Once this happens the COS becomes effectively unusable, the VM's will survive, but since the COS is pretty much unusable you can't Vmotion them off or anything. The easy fix here is to simply use the iscli utility to set the KATO timeout for the HBA to 14 globally for both ports, and also individually for the boot LUN targets (target 0 set the TGT_KATO value).
The above is enought for a normal Qlogic HBA, unfortunately the IBM 32R1923 has another issue which impacts failover and causes it to fail. It appears that IBM defaults the value of the "TaskManagementTimeout" to a very high value (2560) whereas normal Qlogic 4052HBA's use a more typical value (10). Getting the IBM adapters changed to 10 was painful. It was very easy to actually change the value for each port globally using the iscli utility, however, any statically bound volumes keep the value that was sent when they were bound. Since boot volumes have to be bound before acutally running the iscli utility the boot volume retained the 2560 value for it's targets. The iscli utility has an option to allow you to change the individual target values for the parameter but attempting to change them kept giving an error "HBA parameter value 10 invalid for TGT_TaskManagementTimeout". To correct this issue I had to set the TaskManagementTimeout value at the HBA level for both ports, reboot the system, use the HBA BIOS setup to clear all persistent bindings, and finally reconfigure the Boot LUN. This finally allowed the boot LUN to pick up the TaskManagementTimeout value of 10.
After getting the KATO and TaskManagementTimeout values configured properly for the Boot LUN's we've found that iSCSI boot from SAN works great. We can time a failover with a stop watch and get nearly exactly 33 seconds every time, just like our old FC setup and the COS survives every time.
It sounds like a lot of work, and I guess it is, but once you figure it out it's actually pretty easy, about 5 minutes a blade in our case. Making sure that COS failover worked was a hugely important deal for us because, in a blade center environment a switch module failure could potentially kill an entire path for every blade. You wouldn't want this scenario to take down the COS on all of your ESX servers because, even though the VM's actually survive you'll still have to eventually reboot every ESX server and since you can't Vmotion the VM's once the COS fails it would still mean downtime for the VM's eventually.
I'm also trying a boot-from-SAN iSCSI setup, IBM Blades -QLA4052C / Nortel 2/3 Switches / FAS2050.
I only have one path configured to the filer. I have installed ESX and installed the attache kit.
1) ESX hangs on the 4022 driver for 5-10 minutes on startup (install routine did it also)
2) ESX fails to boot about half the time. It stops at a maintainance prompt and want me to put in a USB memory card to get the files.
usually I can reboot and it comes up fine.
Any ideas why this would happen?
good day, not sure if you resolved this yet.
in my boot-from-iscsi-san environment we had similar issue:
1) disable / unplug one iscsi nic if you have two paths, as the 3.5 says the install only wants one path.
2) if using jumbo frames on your switch, set to a value of 9022 or 9008 to give 'buffer' over 9000 bytes
3) turn off Spanning Tree on your switch or change to 'PortFast' method instead.