We've updated to ESX4 and have implemented round robin MPIO to our EQL boxes (we didn't use round robin under 3.5), however I'm seeing 3 - 4 entries per day on the EQL log that indicate a dropped connection. See logs below for EQL & vCenter views on the event.
EQL Log Entry
INFO 10/06/09 23:50:32 EQL-Array-1
iSCSI session to target '192.168.2.240:3260, iqn.2001-05.com.equallogic:0-8a0906-bc6459001-cf60002a3a648493-vm-exchange' from initiator '192.168.2.111:58281, iqn.1998-01.com.vmware:esxborga-2b57cd4e' was closed.
iSCSI initiator connection failure.
Connection was closed by peer.
Lost path redundancy to storage device naa.6090a018005964bc9384643a2a0060cf.
Path vmhba34:C1:T3:L0 is down. Affected datastores: "VM_Exchange".
6/10/2009 11:54:47 PM
I'm aware the the EQL box will shuffle connections from time to time, but these appear in the logs as follows, (although vCenter will still display a Lost path redunancy event.)
INFO 10/06/09 23:54:47 EQL-Array-1
iSCSI session to target '192.168.2.245:3260, iqn.2001-05.com.equallogic:0-8a0906-bc6459001-cf60002a3a648493-vm-exchange' from initiator '192.168.2.126:59880, iqn.1998-01.com.vmware:esxborgb-6d1c1540' was closed.
Load balancing request was received on the array.
Should we be concerned or is it now normal operations for the ESX iscsi initiator to drop and re-establish connections?
We are running what I think is the latest code v6.1.1.016. I reset all the port stats today so I have a clean slate going forward.
From: SteveH15547 <firstname.lastname@example.org>
Date: 12/28/2009 10:21 AM
I was just searching for this error online and found this thread...great stuff guys.
My situation is as follows:
2xPE1950 with vSphere
1xEnterasys N7 switch (will be adding another switch later on)
I've followed the documents for setting up the EQL with vSphere...but I do notice that I have seen some of these disconnections happening, especially after testing taking part of the network down...it seemed to take a couple of hours for the messages to stop and everything to settle down. Looking in vSphere...in general it seemed ok, but wanted to look into the messages.
If I find anymore information on this over the next few weeks, I'll post back on here.
We are still waiting for a patch from VMware. Hopefully it will be in January's patch release schedule. Hopefully.
I hope no one minds a quick side question somewhat related to MPIO. I noticed you can only enable vmotion on one of the six paths, is this true or am I missing something? It makes sense to me you should be able to enable down all paths for redundancy during vmotion in case one path fails but this is not the case? Has anyone noticed this also? Thanks.
From: s1xth <email@example.com>
Date: 12/30/2009 12:12 PM
You shouldnt be using your iSCSI network for vmotion. You need a separate network or vlan for vmotion. Many on here recommend a dedicated pNic, which is also what I recommend on its own layer 2 vlan.
Wow, I didn't realize. I guess I have some reading to do. I have a separate VLAN just for iSCSI separate from our production network (soon to be separate switches). I have two physical nics for the console on the VLAN maybe I will use one of these for vmotion. I have six nics, 2 for console, 2 for production, 2 for iSCSI. Console and iSCSI on same VLAN for isolation fro production network.
From: s1xth <firstname.lastname@example.org>
Date: 12/30/2009 02:43 PM
I would assume that by console you mean the service console. In the cluster of 3 servers that I run, I also have 6 GbE NIC ports (onboard dual plus add-in quad). Until I added iSCSI, 2 NIC ports (one in each NIC) were for the service console and vMotion with the remaining 4 for production networks (in one vSwitch with multiple VLANs). At that time, I moved 2 ports (one on each NIC) from the production networks to iSCSI as I felt that iSCSI traffic is more important than the production networks as I feel data loss on the storage would be more likely to lead to data loss and corruption.
Since you have 6 NIC ports, I'd think that this configuration would be best. If you had fewer, it would be acceptable to separate iSCSI from production networks by VLAN if the production network is throttled and with the understanding that it is not a recommended configuration (if not an unsupported). I'm pretty sure this is documented. I will try to find this and post.
vsp_40_iscsi_san_cfg.pdf p30 ("Configuring iSCSI Initiators and Storage", "Setting Up Software iSCSI Initiators", "Networking Configuration for Software iSCSI Storage"):
"VMware recommends that you designate a separate network adapter entirely for iSCSI." I believe the recommendation is also that vMotion be separated from iSCSI and production networks but is commonly combined with the service console.
Yeah, what I recommend is dedicating two physical nics for your service console vSwitch traffic (i.e. vmnic0 and vmnic1). What you can do is assign the service console port to use vmnic0 as the active adapter and make vmnic1 a standby on the nic teaming tab. Then, make a vmotion port and use vmnic1 as active and vmnic0 as standby. This way your service console traffic will always be on one adapter and vmotion traffic on another yet you still have redundancy in case of a path failure.
I prefer this setup because when performing a vmotion you can max out your bandwidth on the physical network card and you still want to be able to communicate with your ESX host via the service console port. So, if you only use 1 physical nic for both SC and vmotion traffic in my opinion your taking an unecessary risk.
Never thought to do that but it makes even more sense from a predictability standpoint while still offering the redundancy. Since I only have two virtual ports used on that vSwitch, it probably is already separating it but that behavior is less predicable than the manual assignment. Thanks for the idea.
Thanks for the info. I was doing some reading today on my day off and I think I will combine the vmotion and service console across the 2 onboard nics. I have two dual Intels I split for production/iSCSI on seperate VLANs.
From: grcumm <email@example.com>
Date: 12/30/2009 09:03 PM
Regarding the dropping issue...I am only in test currently, but wanting to go live ASAP. What are the implications of this drop in connections I am seeing every few days?
Will I risk losing data? Will it actually cut connections and disconnect clients/services when this happens?
I want to go live with this, but if there are chances of data loss, then obviously I will have to delay until this patch arrives.
Thanks for everyones comments.
Are you using a 3:1 iSCSI configuration or are you using a 1:1? If you are using a 3:1 setup I would switch to a 1:1 as you wont see any drops in the connections. If you decide to keep it at 3:1 you should not have any data problems as the array will always have other connections to use. I went the safe route and just removed 2 of the vmK ports until a patch is released. Re-adding the ports is a simple process anyway.
Hopefully we see a patch in this months January patch release.
I logged a call with vmware and was told yesterday:
"At the moment there is no confirmed release date for a fix to this issue."
I am still trying to see what this really means. They did ask if I would be interested in testing a possible fix, waiting to hear about on that also.
p.s. unrelated but is there a way to get alerted on the new patches? I added my email somewhere on the vmware site to get alerts but it doesn't seem to be working.
That stinks. I think this is more serious issue with the MPIO than originally expected. Really disappointed. The only good side to this problem is that I can still use MPIO in a 1:1 configuraton and not have drops. Still, there is no reason why it shouldnt work the way they intended.
I also signed up to be notfied about patch releases via email and also did not recieve any thing. I follow a few VM guys on twitter and I saw them post the patch releases this morning.
Hmmm...I don't want to scare you, but I am using the 1:1 scenario and seeing drops. Not very often, but every couple of days.
I'm having to hold back on going live with this now, and the client is obviously getting restless.
I REALLY hope vmware get this sorted soon.
I havent seen any drops in my 1:1 .... they definitly need to get this fixed soon. I have tickets open with Equallogic on this still and they are also being told the same thing, there is a patch 'in the works' but no release date or time frame.
Well I just talked to the vmWare support rep in charge of our case and he confirmed in our 1:1 scenario we do indeed have the same issue. He confirmed it is with vSphere only, so reinstalling to 3.5 is a possible "workaround". other than that he said 1 client at least is testing a possible fix, but they have no idea when a fix could be available, could be weeks or months.
Not really the news I/we were looking for.