We've updated to ESX4 and have implemented round robin MPIO to our EQL boxes (we didn't use round robin under 3.5), however I'm seeing 3 - 4 entries per day on the EQL log that indicate a dropped connection. See logs below for EQL & vCenter views on the event.
EQL Log Entry
INFO 10/06/09 23:50:32 EQL-Array-1
iSCSI session to target '192.168.2.240:3260, iqn.2001-05.com.equallogic:0-8a0906-bc6459001-cf60002a3a648493-vm-exchange' from initiator '192.168.2.111:58281, iqn.1998-01.com.vmware:esxborga-2b57cd4e' was closed.
iSCSI initiator connection failure.
Connection was closed by peer.
Lost path redundancy to storage device naa.6090a018005964bc9384643a2a0060cf.
Path vmhba34:C1:T3:L0 is down. Affected datastores: "VM_Exchange".
6/10/2009 11:54:47 PM
I'm aware the the EQL box will shuffle connections from time to time, but these appear in the logs as follows, (although vCenter will still display a Lost path redunancy event.)
INFO 10/06/09 23:54:47 EQL-Array-1
iSCSI session to target '192.168.2.245:3260, iqn.2001-05.com.equallogic:0-8a0906-bc6459001-cf60002a3a648493-vm-exchange' from initiator '192.168.2.126:59880, iqn.1998-01.com.vmware:esxborgb-6d1c1540' was closed.
Load balancing request was received on the array.
Should we be concerned or is it now normal operations for the ESX iscsi initiator to drop and re-establish connections?
I agree, even in my testing without Jumbo frames the drops still occur so this is not part of the problem. Jumbo Frames are fully supported across the storage stack, from the Host (running esxi 4) to the switches, to the EQL Hardware. There is NO problem running Jumbo frames with the MS initator and using MPIO with the HIT kit.
This IS a VMware problem. Period.
I haven't enabled Jumbo frames and normally notice at least 1 host dropping at least one path overall every early morning. I've got 5 active EqualLogic LUNs and 3 ESX 4 servers with the standard 6 VMKs per host (for a total of 90 paths) and haven't noticed anything other than the usual path drops.
Correct me if I'm wrong on this as this is my impression and understanding from what I've read about Jumbo frames and iSCSI:
I believe that Jumbo frames are primarily relevant for squeezing the last few percent of throughput (as stated before by DwayneL; about 2% per that calculation) and slightly relevant for reducing the CPU overhead from the TCP and iSCSI calculations (including checksums/digests). Changes from the iSCSI data digests should be zero (or near zero) as the same amount of data must still be processed.
There is the possibility that it also helps in reducing the potential delay incurred from these calculations and (probably least of all) reducing the overhead and delays on the network switch. The changes in these last two I believe would be so small as to be indistinguishable from statistical noise, especially considering the delays of a HDD head seek should be several orders of magnitude larger.
Overall, the difference of everything, except possibly the throughput, should be negligible except in memory to memory (or memory-like, ie SSD) or large sequential transfers where delays like a head seek can be negated.
You are correct about Jumbo Frames. A good example where I've seen it help, is on larger scale Exchange servers. Say 5000 active users. When doing simulations at that level, even a high end server was running out of CPU cycles. Enabling Jumbo Frames freed up the CPU enough to pass jetstress as a first step, then loadgen after that to better verify that the configuration (server/network/SAN) would support the expected load.
Also in the mix, is the OS itself. In my direct experience, not lab tested, I've seen more improvement with Jumbo Frames enabled on Linux vs. Windows for example.
What I suggest is, that unless the configuration is known to work well with Jumbo Frames, leave them off initially. Generate a baseline, then enable them and observe the results. Jumbo Frames and standard frames can co-exist on the same SAN without a problem. Since it's done as a per session basis.
Has anyone heard/got an updates on this recently?
It seems from this thread people are running this in production, and although they see drops occuring, no-one has reported any "all paths down" situations. Would that be a fair estimation of the situation now?
Some update from VMware would be nice for sure.
I'll be attending the Dell/Equallogic User Conference in Dallas, Tx in 2 weeks. I'll be making it a point to speak with the Eql engineers on this issue since VMware is now sayings its a 'equallogic' problem. I'll let you all know what I hear!
Update! - Thought I would share this information with you guys, from what I am being told the fix for this problem will be released in PATCH 5. NOT Update 5 which has been passed on incorrectly by Dell/VMware. We should hopefully see this patch included in the next patch release cycle or following.
I mentioned the EQL MPIO module in one session already and got a response "we are working on it". I am going to be in an upcoming MPIO dedicated session soon and I will mention more!!
Between e-mail messages on Wednesday and Saturday of last week, I did get an update from VMware Support regarding my case.
1) "unfortunately the Equallogic documentation has an error. There is a limitation in ESX4.0 that causes disconnects under certain iSCSI configurations, including the one they chose. As per that limitation, and our guide (http://www.vmware.com/pdf/vsphere4/r40_u1/vsp_40_u1_iscsi_san_cfg.pdf page 34), you may only use 1-1 mapping of vmkernel ports to physical nics." "This limitation will not be present in the next version of ESX (4.1)."
Oops. That really causes some issues for Dell/EqualLogic customers when Dell/EqualLogic says something that VMware specifically says not to do. For reference, vsp_40_iscsi_san_cfg.pdf p30,p32 appears to be the same documentation but unfortunately it seems like it merely alludes to the fact that you're only supposed to do 1:1 (VMK:pNIC) rather than being explicit.
2) "In addition, there is currently a bug open due to a second issue with connectivity with the 100E series from Equallogic. To use multipathing with vmkernel ports on the same subnet, you will need to be on Patch P03 (the latest out), or go to a single vmkernel port and disable multipathing."
I'm uncertain if this applies to the newer PS models or not but I'm waiting to hear back.
Thanks for the news on Patch 5.
I recieved some pretty 'solid' information regarding the patch release date. At this time, it is tenatively scheduled for end of March. I dont want to give exact date specifed as I am under NDA. This MAY change, but at this time the patch is ready to go and just didnt make the latest round of patches released a few days ago.
Finally...after months of anticipation and countless posts discussing workarounds, the most recent Patch 5 from VMware for vSphere 4 has been released!! The very first fix mentioned is related to our exact problem, multiple vmk's used for the swiscsi adapter connecting to the same subnet. There is no mention of Equallogic or any vendor names that this effects (as I expected). This patch release just came out late this afternoon, April 1st. I haven't applied this patch yet to my environment, but I will try it on a host in a HA cluster over the holiday weekend.
Anyone else try the patch yet? I will post any results if I see any difference!! Lets hope the drops are gone for good.
Just installed the update on 3 hosts in my HA cluster....now to watch my email for the drops that will never come again.......I will post back Monday the results.....
From: s1xth <firstname.lastname@example.org>
Date: 04/01/2010 07:13 PM
S1xth, I'm pretty sure I speak for most out here: Thank you for the notification of the update. I thought I had subscribed myself in the last 1-2 weeks but either I didn't or VMware hasn't sent notification.
I thought that I heard from someone that the issue was within the software iSCSI initiator and that it was possible for customers with just one vmKernel port assigned to iSCSI to experience what we've been seeing for months and that it got worse as you went from 1 vmk to 2 vmks on independent switches to 2 vmks on one switch to the 4-8 vmks on one switch that the Dell/EQL document recommended in the first place and that most that were seeing the issue had multiple vmks assigned. I also heard that the procedure Dell/EQL recommended was reached after working with VMware and that having 4-8 vmks in a single switch would become the recommended practice once this patch was released.
Can anyone out there speak on this rumor?
Also, considering how wide spread this issue is and the severity of its impact, I'm surprised it's not -Update2. Maybe this is coming soon? Thoughts, comments?
grcumm - Glad I could be some help! That patch alert email never worked for me, I was alerted to the release from a twitter feed.
The issue itself should be corrected with this patch. EQL supports both methods of configuring the vmK's on vswitches, either a single vswitch or multiple vswitches. The main benefit of having a single vSwitch is slightly less memory overhead along with a better looking networking GUI (less clutter).
I just updated one of my hosts in a cluster this evening, I will be monitoring to see if this issue is resolved, and hopefully we can close this forum post out for good!
So far so good. Went all weekend through backups and a whole day of normal production activity without a single connection dropped! Looks good so far.
From: s1xth <email@example.com>
Date: 04/04/2010 08:03 PM
After about a week of normal use, absolutely no issues noticed.
It looks like it's fixed and the Dell/EqualLogic document will probably be the recommended direction, assuming this patch is installed.
I agree. I have had the patch installed on production servers for over 1 week now without a single issue. This appears to be a solid fix.
From: grcumm <firstname.lastname@example.org>
Date: 04/15/2010 09:40 AM
I've been running the patch on a single ESX host for the past week, and looks like it's done the trick. No disconnects at all. I will be deploying to our remaining ESX hosts at the weekend.
Thanks everyone (especially s1xth) for all the updates and commentary on this, it's been most helpful (and reassuring) to know that there are others in the same boat.
I am just glad that everyone is experecing the same results and that the problem has been resolved. It to a while but glad that it is fixed!