VMware Cloud Community
mikeddib
Enthusiast
Enthusiast
Jump to solution

Problem with EqualLogic / MPIO / Failover

We have been running some tests in our environment with less than ideal results. Our environment consists of 6 vSphere4 hosts in one cluster connected to about 10 VMFS volumes. Each host has a dedicated vSwitch with two pNICs, 3 vmkernel ports per NIC, and each NIC is connected to a different physical switch. We have set the path policy to Round Robin so each volume has 6 active paths listed. On the EqualLogic side, we have 3 members in a group, one pool for that group, and the controllers are connected to / balanced across those same two physical switches (Nexus 5020). This was based off the guide from the EqualLogic site. http://www.equallogic.com/resourcecenter/assetview.aspx?id=8453

Our issue is when we try to test failover / availability by simulating a switch outage. The EqualLogic group sees numerous timeouts / login errors, but still has NICs active on the active controller, so there is no controller failover. The hosts lose one of the two NICs on the storage vSwitch, but even with one NIC up and online all VMFS volumes disappear from every host. When we bring the switch back online, all the VMFS volumes come back without taking any corrective measures, and the VMs appear as if nothing has happened. Trying to use the VMs we see performance degradation until we rescan / clean up the storage communication and then everything returns to normal.

We're at a loss on where to begin. We have followed the recommendations from EqualLogic and of course being the VMware guy I'm pointing at the storage. The hosts still show active paths when this happens and the errors are login timeouts which is the other reason I think it would be good to start at the storage side. Anyone out there with similar configurations who may have run into something similar?

0 Kudos
24 Replies
dfollis
Enthusiast
Enthusiast
Jump to solution

s1xth-

Thanks for that confirmation, I agree with your analysis.

mikeddib-

I was told in pre-sales that upgrading minor firmware releases involved no downtime and could be done on the fly. The release notes for 4.3.4 indicate that it should be done during maintance window; which I would of course do anyway, and involves a reboot. Did you reboot your EQL box after the update? I only have one so I hate the idea of having to shutdown all of my VMs before doing this. Just curious.

0 Kudos
s1xth
VMware Employee
VMware Employee
Jump to solution

dfollis- No problem....dont worry the fix is on the way...:-) Basically its just log noise.

http://www.virtualizationimpact.com http://www.handsonvirtualization.com Twitter: @jfranconi
0 Kudos
s1xth
VMware Employee
VMware Employee
Jump to solution

dfollis- In regards to updating your firmware on your array. You dont need to shut the VM's done to do this. EQL has the process so fast now that the controllers failover in max 12 seconds with a large update. In a small incremental update (aka 4.3.2 to 4.3.4) the failover is 4 seconds. I updated last night with no issues and running VM's. I wait until a low in I/O activity then I update. If you are using the iSCSI intiator inside guest OS and using the HIT kit the timeout values are already set high enough so you shouldnt see any disconnects.

http://www.virtualizationimpact.com http://www.handsonvirtualization.com Twitter: @jfranconi
0 Kudos
mikeddib
Enthusiast
Enthusiast
Jump to solution

Hey there, has anyone watching this thread actually installed the patch from yesterday?

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=101949...

I think we are going to roll this into our development cluster and see what behavior we can test but didn't know if anyone had any experience with this already and had any words of advice. I will definitely post back once we have done the install on our side.

0 Kudos
s1xth
VMware Employee
VMware Employee
Jump to solution

Check out this thread---(there are multiple threads referencing this

problem)...

http://communities.vmware.com/message/1507163#1507163

Sent from my iPhone

On Apr 2, 2010, at 1:02 PM, mikeddib <communities-emailer@vmware.com

http://www.virtualizationimpact.com http://www.handsonvirtualization.com Twitter: @jfranconi
0 Kudos