VMware Cloud Community
sergio_514
Contributor
Contributor

iSCSI random disconnections ....where to look first?

Hi all,

I'm experiencing a very unpleasant issue...

I have 3 IBM x3650M3 in a cluster with two NICs (BMC 7905) each dedicated to iSCSI, and on the other side, an IBM DS3524 (our old SAN) and a NetApp 2240 (New SAN)

I have permanent warning messages about the iSCSI performance and from time to time, a very bad disconnection (last one was this Sat) which brings the affected host and the VMs almost down. The disconnection doesn't occur always on the same host.

We have experienced this issue before upgrading our switches to Nexus 5k, so I think the network shouldn't be the problem.


All 3 hosts are running VMware vSphere 5.0 with the latest patches. Using the NetApp vCenter add-on, I setup all the parameters according to the recommended values.

VMware support says "It's the network"

Cisco says "The network is fine. Stable. No issues"

NetApp says "I can see that the connection dropped from the other side...not here"

The IBM SAN (where we still have some LUNs) says "Connection dropped unexpectedly". We don't have support on the IBM NAS.

The iSCSI vmkernel are configured according to "best practices" (individual IPs, one active and one unused nic, no failover)

My next step is to upgrade the firmware on the boxes.

Any clue where should I start looking at? Is there any "special" extra setting to configure?

Thanks!!

/var/log # vmware -vl

VMware ESXi 5.0.0 build-3086167

VMware ESXi 5.0.0 Update 3

# ethtool -i vmnic5

driver: igb

version: 2.1.11.1

firmware-version: 3.18-0

bus-info: 0000:15:00.1

~ # esxcli network nic get -n vmnic1

   Advertised Auto Negotiation: true

   Advertised Link Modes: 10baseT/Half, 10baseT/Full, 100baseT/Half, 100baseT/Full, 1000baseT/Full

   Auto Negotiation: true

   Cable Type: Twisted Pair

   Current Message Level: -1

   Driver Info:

         Bus Info: 0000:0b:00.1

         Driver: bnx2

         Firmware Version: bc 6.2.0 NCSI 2.0.11

         Version: 2.0.15g.v50.11-5vmw

   Link Detected: true

   Link Status: Up

   Name: vmnic1

   PHYAddress: 1

   Pause Autonegotiate: true

   Pause RX: false

   Pause TX: false

   Supported Ports: TP

   Supports Auto Negotiation: true

   Supports Pause: true

   Supports Wakeon: true

   Transceiver: internal

   Wakeon: MagicPacket(tm)

11 Replies
Nick_Andreev
Expert
Expert

I would look closer at your switch configuration. Can you confirm that there are no port channels or anything along these lines configured on the Nexus switches?

---
If you found my answers helpful please consider marking them as helpful or correct.
VCIX-DCV, VCIX-NV, VCAP-CMA | vExpert '16, '17, '18
Blog: http://niktips.wordpress.com | Twitter: @nick_andreev_au
Reply
0 Kudos
sergio_514
Contributor
Contributor

Hi Nick,

Thank you for being interested!

All iSCSI port are access ports, autoneg on. There is no routing involved. All the ESXi and the SANs resides on the same IP network.

There are 2 port channels on these switches, one peer-link between the switches (trunking everything) and the other to our core L3 switch (trunking everything too).

I have set up the iscsi according to some best practices doc:

- One vmk port per interface, with its own IP, selected as "Active Adapter" on the Nic teaming tab and setting the other nic as "Unused Adapter". Override switch failover order checked.

- On the Storage adapter, I'm using the ISCSI Software adapter, and on the Network Configuration tab I have both vmk selected. They are both green and Port group policy compliant.

- Policies on Targets:  on the Netapp they are RR and on the IBM are MRU (according to IBM doc, RR is not supported).


All path are showing Active, Active (I/O) or Stand by.

I read about a bug where iSCSI traffic was trying to go through an Unused nic, so it was discarded, but that bug should have been fixed. As a workaround, it was suggested to explicitly set on each vmk the Nic Teaming Fallback option in NO...but the behaviour is like that...at some point, traffic is trying to go out but never made it.

We dont have any L2 issue, but when this problem happens, we "fix" it disconnecting one of the iSCSI nics on each ESXi host. That triggers something which redirects all the traffic through the other nic, the UP link, and everything reconnects again.

Reply
0 Kudos
ThompsG
Virtuoso
Virtuoso

Hi,

I'm assuming here that the two nics you are using are dedicated to ISCSI traffic and they are in the same vSwitch - if so the try this: separate the nics reserved for ISCSI into there own vSwitches so you don't need to have a Nic as unused.

This would give you two vSwitches with one active Nic each and a VmKernel for ISCSI use. From here continue to configure ISCSI as before.

This may prevent the situation you are facing.

Kind regards.

Reply
0 Kudos
sergio_514
Contributor
Contributor

I can do that, but it will increase the complexity on the targets and routes to the NAS access.

According to VMWare, independent vSwitches for iSCSI vmk should be configured with different IP networks. So I would have to split everything between two IP subnets (still using the same L2).

VMware KB: Considerations for using software iSCSI port binding in ESX/ESXi


In my case, my target has several IPs, and I'm using port binding....


Couple of weeks ago I setup a splunk server to collect logs and after checking them, it appears that only the connection to the IBM SAN is dropping. I checked our FW version and we are a couple of subversions behind the latest one. Need to learn how to upgrade the FW now.



Reply
0 Kudos
sergio_514
Contributor
Contributor

Is there any special config for the vSwitch I could be missing??

Reply
0 Kudos
ThompsG
Virtuoso
Virtuoso

Hi there,

Sorry about that - I was pretty sure it use to be VMware's recommendation that you separate out the iSCSI nics over vSwitches however this now says otherwise: vSphere 5.5 Documentation Center

Here as you mentioned, VMware recommend port binding should only be used if the network adapters reside in the same virtual switch. Apologies for false information as I certainly don't want to advocate additional complexity especially around your storage connectivity.

Kind regards.

Reply
0 Kudos
Nick_Andreev
Expert
Expert

I checked our FW version and we are a couple of subversions behind the latest one. Need to learn how to upgrade the FW now.

Check on the VMware compatibility list which array firmware revisions are supported for your storage array.

---
If you found my answers helpful please consider marking them as helpful or correct.
VCIX-DCV, VCIX-NV, VCAP-CMA | vExpert '16, '17, '18
Blog: http://niktips.wordpress.com | Twitter: @nick_andreev_au
Reply
0 Kudos
sergio_514
Contributor
Contributor

Please, no need to apologize!.

In one of IBM's redbooks, an example is provided using two NICs on the VMware host, each nic using a different VLAN to the storage.

Reply
0 Kudos
sergio_514
Contributor
Contributor

It looks like the issue was triggered by some path "misunderstanding" between the IBM DS3524 SAN and the host, and apparently, when VMware has iSCSI trouble to connect with the SAN, it causes big performance hit and impacts other SANs, make the host as unresponsive, VMs loose storage connections, etc.

So last week I've upgraded the IBM SAN firmware from 7.38 to the latest (8.20), and to enforce a path selection, I selected ALUA as "host type". I verified on the host side that ALUA was also showing as policy for the LUNs running some esxcli storage command.

I've been performing several Datastore and host migration operations with no issue. (so far)

hussainbte
Expert
Expert

It seems generally when there are more than one iSCSI arrarys with Active standby and active active array configurations connecting through single software iSCSI adapter, these kind of issues are reported.


I don't think it anyware mentions that such a configuration is not supported by VMware, but to understand more about this issue a complete understanding of the iSCSI sessions and vmkernel behavior while connecting to different targets is required, if you still have a support case open with VMware, i suggest you ask for more info on this

If you found my answers useful please consider marking them as Correct OR Helpful Regards, Hussain https://virtualcubes.wordpress.com/
Reply
0 Kudos
Nick_Andreev
Expert
Expert

Glad the issue is resolved now.

---
If you found my answers helpful please consider marking them as helpful or correct.
VCIX-DCV, VCIX-NV, VCAP-CMA | vExpert '16, '17, '18
Blog: http://niktips.wordpress.com | Twitter: @nick_andreev_au
Reply
0 Kudos