We've updated to ESX4 and have implemented round robin MPIO to our EQL boxes (we didn't use round robin under 3.5), however I'm seeing 3 - 4 entries per day on the EQL log that indicate a dropped connection. See logs below for EQL & vCenter views on the event.
EQL Log Entry
INFO 10/06/09 23:50:32 EQL-Array-1
iSCSI session to target '192.168.2.240:3260, iqn.2001-05.com.equallogic:0-8a0906-bc6459001-cf60002a3a648493-vm-exchange' from initiator '192.168.2.111:58281, iqn.1998-01.com.vmware:esxborga-2b57cd4e' was closed.
iSCSI initiator connection failure.
Connection was closed by peer.
Lost path redundancy to storage device naa.6090a018005964bc9384643a2a0060cf.
Path vmhba34:C1:T3:L0 is down. Affected datastores: "VM_Exchange".
6/10/2009 11:54:47 PM
I'm aware the the EQL box will shuffle connections from time to time, but these appear in the logs as follows, (although vCenter will still display a Lost path redunancy event.)
INFO 10/06/09 23:54:47 EQL-Array-1
iSCSI session to target '192.168.2.245:3260, iqn.2001-05.com.equallogic:0-8a0906-bc6459001-cf60002a3a648493-vm-exchange' from initiator '192.168.2.126:59880, iqn.1998-01.com.vmware:esxborgb-6d1c1540' was closed.
Load balancing request was received on the array.
Should we be concerned or is it now normal operations for the ESX iscsi initiator to drop and re-establish connections?
Excellent...lets keep the thread updated with what EQL says to us all. Seems that we are all having the same issues and have done all the usual troubleshooting. I just opened a case via the EQL customer portal site, prolly wont hear anything until tommorrow.
I will also ask about a beta of the MPIO plugin also, would be great if that would come out soon, it seems like it is taking forever. *not saying that would fix the problem though.
I really pushed this EQL box to upper management, EQL better fix this...quick.
Not sure if this is related, but for others in this thread, there is a post here on the storage forums regarding [Change the value
of DefaultTimeToWait parameters. |t-243091]
According to Andy in the post, there is a patch coming out to fix this problem. There are a couple EQL users posting in this post also with very similiar scenario. They both may be related.....or not. Just wanted to point it out.
Yeah, I just got off the phone with the Equallogic support folks. They indicated that this was a known problem with the SW iscsi initiatior in ESX 4, and that I should contact VMWare support with the reference number PR484220 to be placed on a list to be notified when a patch was released. I will give them a call this afternoon to verify... He also suggested that I review the attached document just to be sure I did not miss anything on my config, which I will also do this afternoon.
Hope this helps!
I just noticed that everyone is mentioning ESX as their version,
anyone running ESXi 4? That is what I am running an still seeing the
issue, so I am assuming this patch will take care of box versions ESX/
Sent from my iPhone
On Dec 10, 2009, at 1:55 PM, tawatson <email@example.com
Just for reference, how many vmKernal ports do you guys have configured for your iSCSI traffic?
I currently am using the 3:1 setup as descirbed above in the document. I am using two pNics with 3 vmK ports assigned to each nic for a total of (6) paths, enabling RR giving me (6) active i/o paths.
What is everyone else doing? 1:1 or 3:1 or other?
My setup is exactly that. 6 paths, 3:1 on two nics. One server jumbo one regular. I used one port on two seperate dual nics in case one fails.
From: s1xth <firstname.lastname@example.org>
Date: 12/10/2009 03:56 PM
John...same thinking here...3:1..separate cards.
This is definitly not a EQL issue looks more to be vmware. I called
Dell (I bought my vmware licenses through Dell when I bought my
server) and they are pushing the case right to vmware with high
piority. I can't believe this hasn't been fixed yet!!! Iscsi is the
most used protocol and mpio is being used with all newer deployments.
Can't believe this wasn't fixed in U1 either. Really not happy about
Sent from my iPhone
On Dec 10, 2009, at 8:07 PM, johnz333 <email@example.com
I'm running ESXi and see this bug. I too contacted EQL and they referrenced the same VMware bug. Opened a case with VMware and am on the waiting list to be notified when a patch is available.
Same here, we bought ours through Dell. I was surprised too as iscsi is the meat of why virtualization is so awesome. I have to say that MPIO is fairly new with vSphere 4 but this is no excuse.
From: s1xth <firstname.lastname@example.org>
Date: 12/10/2009 08:27 PM
Agreed. I would love to know vmware's thinking behind this. I am hoping this 'patch' will be included in the next round of updates, hopefully this month. I just cant believe with how big iSCSI is, and like you said, shared storage is what makes virtualization that this isnt a top pirioty. Let alone the fact that vmware has been pushing their 'totally rewritten' initiator for iscsi.
Lets all cross our fingers we see something this month, I dont want to push my deployment back any further, there is no way I can sign off saying that everything is solid and working when I have connections dropping.
Its ridiculous this patch was not included with Patch 1 a few weeks ago.
I bought this issue up with VMWare not long after Vsphere went gold - VMWare claimed it must be an issue with my storage, or the network switches or config. Quite ironic it turns out to be a VMWare issue after all.
As you say lets hope this is fixed during December. Hopefully it wont be long before Equallogic also release the MPIO module for Vsphere.
I wanted to run something else by you guys as well. I came across a document that explains that ESX sends 1000 or so commands down each path during the round robin. Some commands are short and therefore do not benefit from this round robin. The document I came across had a suggestion from a Dell EQL tech that said the optimum setting for each path was 3 and you could tweak this in the ESX advanced settings on the Initiator. I would be reluctant to do it while we are dropping paths but its sounds logical.
The command is this: esxcli -- server
Date: 12/11/2009 09:08 AM
I recommend to anyone with questions about the IOPS setting mentioned in that multivendor blog post should read this thread.
The gist of the thread is that 3 is probably too low for the IOPs setting, and that 300 might be better, but that it also depends on your storage system.
We are running ESX4/Dell PE R710/Intel quad port/PS5000s/Cisco 3750/Jumbo Frames with the latest firmware, and generally we don't have any iSCSI connection problems. We are using Round Robin and my iSCSI port group to physical NIC ratio is 1:1, not 3:1, as I had vSphere up-and-running before that TR doc came out of EqualLogic.
I have set a test volume to 300 IOPs with no detrimental effects. I haven't done any i/o meter testing and cannot tell you if it increased or decreased performance (yet), but the setting has not caused connectivity issues to that volume.
I have from time-to-time had the "connection closed by peer" error message and an EqualLogic tech informed me that was the system load balancing. It does not seem to actually cause a disconnection, and I've noticed it rarely happens now. I run Virtu-Al's daily PowerShell script report and most days it is clear of errors. If you are not using this script I highly recommend it.[Virtu-Al daily report v2 PS script|http://www.virtu-al.net/2009/08/18/powercli-daily-report-v2/]
And for those that want to further info about SAN storage for VMware, there is a great post on the subject by Chad Sakac, VMware I/O queues, “micro-bursting”, and multipathing
Interesting RobVM. Thanks for the post.
I just made the change to my setup and set the iops to 300 like you have yours set to. Is there a way to view what this setting is at to confirm the change was made successfully? I am corrent in saying that the default iops is 1000 correct?
I think the thing that is throwing us all of is that vmware is now saying that this IS indeed a bug in sw initiator and that is causing these drops and they shouldnt be happening.
I also want to add, that I believe we all have been told the same thing, that this is the EQL balancing the connections. This is not true, as the connection drop IS being registered on the VMware side as a drop and path's are being lost in result of this.
Yes, the default is 1000 as far as I know.
To check your setting for a particular volume, use the getconfig instead of setconfig command:
esxcli –server esxhostname nmp roundrobin setconfig –d naa.xxxxx
To list all of your devices
esxcli nmp device list
Thanks...those commands work. I will monitor to see if this helps at all...but I am assuming the connections will still drop.
I have much the same setup as RobVM except with two PS4000s and two DELL Powerconnect 5424 switches with 4 gig ports set up in a LAG between them; 1:1 iscsi port to physical NIC, using three physcial NIC port, jumbo frames, etc.. I see the same drops everyone else is seeing, but Equallogic support told me it was part of the load balancing that the PS4000s do when I contacted them a couple months ago when first setting them up. I've not seen any performance issues or any other issues so far, but our VMs are not that demanding.