We have been running some tests in our environment with less than ideal results. Our environment consists of 6 vSphere4 hosts in one cluster connected to about 10 VMFS volumes. Each host has a dedicated vSwitch with two pNICs, 3 vmkernel ports per NIC, and each NIC is connected to a different physical switch. We have set the path policy to Round Robin so each volume has 6 active paths listed. On the EqualLogic side, we have 3 members in a group, one pool for that group, and the controllers are connected to / balanced across those same two physical switches (Nexus 5020). This was based off the guide from the EqualLogic site. http://www.equallogic.com/resourcecenter/assetview.aspx?id=8453
Our issue is when we try to test failover / availability by simulating a switch outage. The EqualLogic group sees numerous timeouts / login errors, but still has NICs active on the active controller, so there is no controller failover. The hosts lose one of the two NICs on the storage vSwitch, but even with one NIC up and online all VMFS volumes disappear from every host. When we bring the switch back online, all the VMFS volumes come back without taking any corrective measures, and the VMs appear as if nothing has happened. Trying to use the VMs we see performance degradation until we rescan / clean up the storage communication and then everything returns to normal.
We're at a loss on where to begin. We have followed the recommendations from EqualLogic and of course being the VMware guy I'm pointing at the storage. The hosts still show active paths when this happens and the errors are login timeouts which is the other reason I think it would be good to start at the storage side. Anyone out there with similar configurations who may have run into something similar?
s1xth-
Thanks for that confirmation, I agree with your analysis.
mikeddib-
I was told in pre-sales that upgrading minor firmware releases involved no downtime and could be done on the fly. The release notes for 4.3.4 indicate that it should be done during maintance window; which I would of course do anyway, and involves a reboot. Did you reboot your EQL box after the update? I only have one so I hate the idea of having to shutdown all of my VMs before doing this. Just curious.
dfollis- No problem....dont worry the fix is on the way...:-) Basically its just log noise.
dfollis- In regards to updating your firmware on your array. You dont need to shut the VM's done to do this. EQL has the process so fast now that the controllers failover in max 12 seconds with a large update. In a small incremental update (aka 4.3.2 to 4.3.4) the failover is 4 seconds. I updated last night with no issues and running VM's. I wait until a low in I/O activity then I update. If you are using the iSCSI intiator inside guest OS and using the HIT kit the timeout values are already set high enough so you shouldnt see any disconnects.
Hey there, has anyone watching this thread actually installed the patch from yesterday?
I think we are going to roll this into our development cluster and see what behavior we can test but didn't know if anyone had any experience with this already and had any words of advice. I will definitely post back once we have done the install on our side.
Check out this thread---(there are multiple threads referencing this
problem)...
http://communities.vmware.com/message/1507163#1507163
Sent from my iPhone
On Apr 2, 2010, at 1:02 PM, mikeddib <communities-emailer@vmware.com