VMware Cloud Community
jackyareva
Contributor
Contributor

Vsphere 4.0 drop connection from Equallogic PS6000 storage(Load balancing request was received on the array)

Recently, our 3 Vsphere 4.0 host occurred some connection issue to Equallogic storage.

We can find out below error information in the Vsphere 4.0 host event

Lost connectivity to storage device

naa.6090a058d00264f380acb401000000cd. Path

vmhba33:C0:T0:L0 is down. Affected datastores:

"sz_iscsi_iso".

error

11/8/2009 12:23:46 PM

I also can find out the error message in the Equallogic storage event view like below

Level:INFO

Time: 11/8/09 12:23:48PM

Member:PS02

Subsystem:MgmtExec

Event ID: 7.2.15

ISCSI session to target '192.168.100.108:3260, iqn.2001-05.com.equallogic:0-8a0906-f36402d05-cd00000001b4ac80-sz-iscsi-iso' from initiator '192.168.100.15:59492, iqn.1998-01.com.vmware:CNSHZ01VS02-657d4151' was closed

Load balancing request was received on the arrary

Not just one ESX host occurred this issue, all 3 hosts occur this issue about 3-5 times everyday(We didn't enable MPIO in our ESX host)

Can anyone give us some information or the solution about this issue?

Thanks in advanced

0 Kudos
22 Replies
joemazur
Contributor
Contributor

Not sure, getting something similar here. In my case, I have 4 ESX hosts using VMware RR to load balance paths to a PS6000 group. I have 4 physical NICS per ESX host dedicated to iSCSI network / VMkernels (segregated from VM traffic) set up per VMware / Dell docs (exactly). No performance issues but a bunch of disconnect / reconnects in the SAN group manager event log (also seems to happen after hours - maybe timing out?). I also have jumbo frames configured and had to do a bunch of reconfiguring to get it to work - using Dell PC 6248 switches between ESX and PS6000, all VMkernels / vmks and vswitches set to 9000 MTU, kept disconnecting. Finally negotiated after setting ports in physical 6248 switches to 8000 MTU (although it still reports as standard frame length in Group Manager). No drops after that. But I'm wondering if I should forget about jumbo frames and go back to 1500 MTU on everything.

I am not getting the load balancing request error you are seeing. Could this be related to 'load balancing' feature on EqualLogic? This is enabled by default (Group Configuration / Advanced). Perhaps check this in conjunction with your vSphere initiator settings. Sounds like you don't use MPIO?

0 Kudos
jackyareva
Contributor
Contributor

Dear Joemazur,

First, thanks a lot for your feedback

Yes, I agreed with you. It seems that the feature load balancing in Equallogic has some conflict with Vmware Host. I have already opened a ticket to equallogic and I am waiting for their feedback and solution.

After solve the issue, I will post the solution here

0 Kudos
shychen
Contributor
Contributor

Good on you man. I've got exactly same things with you. Finally I sort it out by increase the MTU up to 9216 instead of 9000 on Netgear switch. Now it seems everything is working fine for me.

jackyareva
Contributor
Contributor

Dear Shychen,

Thanks very much for obtain a good news from you. But I have one question about it, you just change the MTU on the switch? or change the MTU both in the switch and vmware server?

0 Kudos
joemazur
Contributor
Contributor

You will need to have consistent MTU settings from point A to point B. Starting from the VMkernels, following the VMware doc vsp_40_iscsi_san_cfg.pdf , section titled 'Configuring iSCSI Initiators and Storage' - the vmks, vmnics and vSwitch must all be set to the Jumbo Frame MTU value. Then, the physical switch ports between your ESX host server and your SAN ports must be set to the correct MTU. Dell also has a document with the same steps detailed - from Dell's EqualLogic site, find Configuring_VMware_vSphere_Software_iSCSI-with_Dell_EqualLogic_PS_Series_Storage.pdf . This has more detail on the CLI configuration required on the ESX host which must be done, this cannot be configured by GUI.

I say 'Jumbo Frame MTU value' because I wasn't able to get it to work using MTU 9000 either. I tried every possible suggested value at the physical switch ports starting at 9216. Nothing would negotiate with the PS6000 (which is set at 9000 by default). Finally, I was able to connect and stay connected by setting my physical switch ports to 8000 MTU (vSwitches and all other iSCSI initiator still set to 9000). In the PS6000 event log, this connection is identified as using 'standard-sized frames'. When it was set to 9000, it called it Jumbo Frames (but did not stay connected).

I hope this is useful information for all here.

joemazur
Contributor
Contributor

I should say 'somewhat consistent' MTU settings, reading my own post, since I was able to get it to work with MTU=8000 on the physical switch ports but 9000 everywhere else. I mean 'consistent' in that all points of iSCSI connectivity must have some MTU setting close to 'jumbo' size other than the standard 1500. At least that is my experience. I haven't actually tried to set my vSwitch / vmk / vmnics to MTU 8000 to match the switch.

0 Kudos
shychen
Contributor
Contributor

I should share my story here. Initially, I setup vSwitch and vmkernal port mtu as 9000, Netgear switch mtu 9000 and Lefthand storage mtu 9000. Then the connection goes up and down. I vmkping -s 9000 to the Lefthand nodes and can't get responds. If I vmkping -s 8000 or even -s 8500, I can get reply with no doubt. Finally, I found if you vmkping -s 9000, the ESX actually sending 9008 bytes. However, Netgear switch MTU is 9000. I think this might cause the problem. So once I change MTU on Netgear Switch to 9216, everything is working perfectly for me.

I am not sure what exactly happened on your situation. Hopefully my story can help you.

P.S. Please double check your SAN network topology as well. I have another case due to stdupid network topology issue. I made a routing loop between LAN and SAN. That could cause simillar issue.

0 Kudos
joemazur
Contributor
Contributor

Yes, I seem to recall that I got MTU 9008 from vmkping also and then hurried to set this on my physical switch ports, convinced that I had the answer, only to find that even this didn't work.

Regarding topology, that's not an issue for me. I have iSCSI isolated from LAN traffic on dedicated switches and connect to PS6000 on a dedicated management port in the LAN. This does reduce potential I/O by 25% (4 controller ports on PS6000) but I haven't seen any issues.

0 Kudos
jackyareva
Contributor
Contributor

Dear joemazur,

Thanks very much for you to provide us useful information about this issue.

In fact, currently I couldn't change the MTU value in switch as in the October we just finished vmware&storage virtualization project, this project support by Dell(we purchase the hardware and software from dell, also including the implment fee). After found this issue, I deliver the issue to dell, at the beginning they guess the issue may cause from the storage, so they escalated the case to Equallogic support. Today, the Equallogic give us a feedback that they didn't find any unusual thing within the storage.

So it indicated that the issue probably from the switch, but I also need to wait their solution to fix it. They have a standard configuration document, they recommend not change any configuration without their advise.

I also deliver your useful information to dell, hope can help them find out the root reason.

After I obtained the solution from dell, I will post here and write down the root reason result this issue to share all of you.

0 Kudos
jackyareva
Contributor
Contributor

Today, according to the dell support proposal, we tried to change the MTU value in the switch, but we found that we couldn't find out the change MTU method both in Web interface or the command. It seems that Power Connect 5424 don't provide the change MTU function with ipv4.

We can change the MTU value in the ESX host, but I don't know whether change the value in the ESX can fix the issue.

0 Kudos
candal02
Contributor
Contributor

Hi All,

I just ran into this problem last week, when I first set up MPIO between my ESX hosts, my Dell 5448 switches, and my Dell EqualLogic. The hosts kept losing connetion to the EqualLogic. After a while I noticed this primarily occured whenever an ESX host was restarted. After much frustration, I narrowed down the problem to my 2 (redundant) Dell 5448 switches self-rebooting for some reason. So, I checked to see if there were any firmware upgrades for these switches, and to my suprise, the latest firmware update apparently resolves the exact problem I'm having. Here is a quote from the website...

"This maintainence release addresses the following issue

- When iSCSI traffic is passing through the switch, deleting the active iSCSI target port causes the switch to reboot. This will happen when large number of iSCSI sessions are in place."

Of course whenever I reboot an ESX host, this causes the deletion of an active iSCSI target port, thereby causing the switch to reboot. I haven't yet upgraded the firmware on my switches (will be doing this Monday morning), but I will let you know if this resolves the situation.

For those of you running Dell 5448 switches with iSCSI mpio, I suggest you considering upgrading to this firmware. For those of you running other switches, I know this doesn't help, but you might want to check if there are any firmware updates for your switches that might resolve your problem.

0 Kudos
ericsl
Enthusiast
Enthusiast

Jackyareva,

You didn't indicate what type of switch you have. According to the last poster they found a fix specifically for their Dell switch.

However, I would like to know if this is a host related issue? Or is it only for specific hardware. For instance, my environment has Foundry switches. Would I also have to set MTU higher than 9000? Just on the ports that the hosts are connected to?

In another post that is somewhat related (getting MPIO to work with Equallogic) Andre notes that the switch MTU size needs to be set higher than the host but does not elaborate on why...

http://communities.vmware.com/message/1427619

Eric

0 Kudos
jackyareva
Contributor
Contributor

Hello,

Thanks a lot for you to share you experience.

I just checked the switch 5424 status, it keep running for several days, so my issue should not cause by the switch reboot. According to your suggestion, I tried to search firmware for 5424, l also find out the new firmware upgrade which release on October 2009, it enhance some function and fixup the switch reboot bug.

But anyway, I will make a schedule for upgrade switch firmware in order to fixup switch reboot bug.

About my issue, dell support still follow up the case, if have any news, I will post here to share with all of you

0 Kudos
jackyareva
Contributor
Contributor

Dear ericsl,

Our switch is Dell Power Connect 5424, we used to 2 this kind of switch. In fact, we tried to change the MTU value according to Andre suggestion, but we couldn't find out where can set the MTU value both in web interface or the CLI(we couldn't find out the parameter to change the MTU value for ipv4, but we can find out where to set the MTU value for ipv6. We used ipv4)

About upgrade firmware, I also find out the update one for my switch 5424, but it just fixup the switch reboot issue. My switch keep running for several days and not occur reboot issue.

Addition information, we setup two dell equallogic storage with one group, these two storage are different, one is PS6000XV(SAS 15K 450g), one is PS6000X(SAS 10K 600G). I don't know whether the storage is different within one group result the issue.

0 Kudos
candal02
Contributor
Contributor

Ok so my problem is solved. I updated the firmware on both of our 5448 (iSCSI network) switches, and rebuilt the config on each of them from scratch. The new config is really quite simple:

  • Enabled jumbo frames

  • Disabled iSCSI optimiztion (since this is only necessary on switches that run iSCSI and LAN network traffic at the same time...I dedicated these switches to iSCSI).

  • Disabled flowcontrol (even though it is recommended to enable flowcontrol for iSCSI, i read some conflicting reports about whether or not it actually improves throughput, and since I'm not worried about throughput at the moment, I didn't take the chance. In the near future, I will enable flowcontrol and test throughput and reliability in our network).

  • Enabled RSTP

  • Enabled a link aggregation group (4 ports in total) between the 2 switches.

After doing this, all is well. I think the key though was to upgrade the firmware.

0 Kudos
s1xth
VMware Employee
VMware Employee

I have the exact same setup as you...except a PS4000 and two 24 port switches instead. No issues here and everything seems to be working very well. Glad to hear your issues are resolved.

http://www.virtualizationimpact.com http://www.handsonvirtualization.com Twitter: @jfranconi
0 Kudos
dwilliam62
Enthusiast
Enthusiast

I'd like to clear up a couple of issues here:

1.) Connection Load Balance. That's a normal informational message, it's not an error. The Dell/EQL array uses an iSCSI feature called async logout to ask an initiator to log out a specific iSCSI session, then go immediately to the discovery address and re-login. This feature allows the array to balance the session to another available Ethernet port. After that logout message you should have seen a login connection message within a few seconds at most. Typically less than a second. So this operation doesn't interrupt ongoing I/O.

2.) Jumbo Frames. The Dell Power Connect 54xx/62xx series switches need to have their MTU size set to 9216. When the switch is set to 9000 it doesn't include a small overhead, typically 14bytes. On the array you see "9000" for an MTU size, but it knows to allow the extra bits. Other switches have this same issue, so it's a good general practice to set the MTU size to the maximum allowable size for any switch.

3.) 54xx iSCSI optimization. The optimization uses a small table to keep track of all the iSCSI connections. The Connection Load Balance feature causes the table to use an entry every time the connection gets balanced. Eventually overflowing that table. The latest firmware increases that table, but it's still theoretically possible to hit the issue. Turning off iSCSI optimization prevents that possibility.

4.) Flowcontrol. Flowcontrol is an important feature when dealing with iSCSI SANs. Flowcontrol comes into play when the HOST asks the switch to stop sending data for a short period, to allow the host to catch up. If you disable Flowcontrol on the switch, when the server asks the switch to stop, the switch will ignore it and keep sending data to the server. If this continues, the server will drop those packets and they'll have to be resent. If this continues on, the symptom is dropped connections "6 second timeouts" in the Dell GUI and reduced performance.

Regards,

-don

0 Kudos
s1xth
VMware Employee
VMware Employee

1.) Connection Load Balance. That's a normal informational message,

it's not an error. The Dell/EQL array uses an iSCSI feature called

async logout to ask an initiator to log out a specific iSCSI session,

then go immediately to the discovery address and re-login. This feature

allows the array to balance the session to another available Ethernet

port. After that logout message you should have seen a login

connection message within a few seconds at most. Typically less than a

second. So this operation doesn't interrupt ongoing I/O.

We are aware of this feature. This is NOT the EQL Load Balancing. These are being logged as errors on the EQL side and has been confirmed by MANY EQL techs level 3 and higher that this is NOT load balancing. Although this does not effect i/o it should not happen. People are using the MS intiator with MPIO and have no issues, same array with VMware MPIO and connections drop. I have found that backing down to a 1:1 setup instead of a 3:1 setup stops these messages and drops.

2.) Jumbo Frames. The Dell Power Connect 54xx/62xx series switches

need to have their MTU size set to 9216. When the switch is set to 9000

it doesn't include a small overhead, typically 14bytes. On the array

you see "9000" for an MTU size, but it knows to allow the extra bits.

Other switches have this same issue, so it's a good general practice to

set the MTU size to the maximum allowable size for any switch.

The 5424 switch can not have its MTU size changed as far as I am aware. There is NOT a command in the 5-series switch to change the MTU size that I am aware of. On a 6-series switch you can change the MTU size as it is shown in the EQL configuration guide for the 6-series switches but not the 5-series switch. If you know of a way to configure the MTU size on a 5-series please let me know, I would like to test this.

3.) 54xx iSCSI optimization. The optimization uses a small table to

keep track of all the iSCSI connections. The Connection Load Balance

feature causes the table to use an entry every time the connection gets

balanced. Eventually overflowing that table. The latest firmware

increases that table, but it's still theoretically possible to hit the

issue. Turning off iSCSI optimization prevents that possibility.

iSCSI optimization is disabled per the EQL configuration document.

4.) Flowcontrol. Flowcontrol is an important feature when dealing

with iSCSI SANs. Flowcontrol comes into play when the HOST asks the

switch to stop sending data for a short period, to allow the host to

catch up. If you disable Flowcontrol on the switch, when the server

asks the switch to stop, the switch will ignore it and keep sending data

to the server. If this continues, the server will drop those packets

and they'll have to be resent. If this continues on, the symptom is

dropped connections "6 second timeouts" in the Dell GUI and reduced

performance.

Flowcontrol is enabled on my switches (5424's).

Regards,

-don

http://www.virtualizationimpact.com http://www.handsonvirtualization.com Twitter: @jfranconi
0 Kudos
dwilliam62
Enthusiast
Enthusiast

I didn't want anyone reading this thread to see the below message as an error or indication of some problem.

From the original posting I replied to:

=========================================================================================

I also can find out the error message in the Equallogic storage event view like below

Level:INFO

Time: 11/8/09 12:23:48PM

Member:PS02

Subsystem:MgmtExec

Event ID: 7.2.15

ISCSI session to target '192.168.100.108:3260, iqn.2001-05.com.equallogic:0-8a0906-f36402d05-cd00000001b4ac80-sz-iscsi-iso' from initiator '192.168.100.15:59492, iqn.1998-01.com.vmware:CNSHZ01VS02-657d4151' was closed

Load balancing request was received on the arrary

=================================================================================================

That is a CLB event, not an error.

There have been connection drops with the ESX SW iSCSI initiator in 4.0 when more than one GbE interface configured on a single vSwitch hosting the Vmkernel ports. With or without RR being enabled. VMware reference number: PR484220. In /var/log/vmkiscsid.log you see "no-op" failures and connection gets terminated from the VMware side. Waiting on a patch from VMware, but their support said using one vSwitch with one VMkernel port with one GbE NIC assigned to it will also prevent the connection drops. You just need to repeat that for as many GbE interfaces as you want to use for iSCSI.

Those I've seen running a 2:1 ratio of VMkernel ports to physical GbE port on one vSwitch will still see the error but not cause any disruption as the other connections handle the traffic. A little while later the connection is restored when that Vmkernel port re-logins to the array.

FYI: I haven't converted my ESX servers to that configuration yet. So I can't say how well it works.

Re: 54xx & Jumbo Frames. You are correct, sorry for the mixup.

-don

0 Kudos