spookyneo1
Contributor
Contributor

Heavy impact on iSCSI performances when changing iSCSI switches

Hi,

We're upgrading our iSCSI switches to 10Gbps and I can't get the new one to work fine. I'm looking for tips and advices.

This is the setup that we currently have and is working :

  • DELL EMC ME4024 (10Gbps) SAN
  • 2x Cisco 2960X (1Gbps interfaces)
  • 2x DELL R730 using Broadcoms 5719 and 5720 (both 1Gbps interfaces)
    • Hosts are running 7.0U2c

We are replacing the Ciscos with DELL EMC S4128T-ON 10Gbps network switches. Both new switches have the latest firmwares. I've configured the new switches and when I migrate my hosts and SAN to these DELL switches, the iSCSI performances (access to datastores) are taking a big hit. Switching back the iSCSI connections to the Ciscos fixed everything immediatly.

  • Seeing CRC errors slowly increasing on the new switches on the host's interfaces.
  • Doing Rescan of all HBAs will take about 10 minutes, while on Ciscos is a matter of < 1 minute.
  • Starting up a VM (not Booting, only Starting it into VMware) takes about 2-3 minutes to reach 100%.
  • OS in a VM will take a long time to boot up.
  • ESXi hosts log multiple performances degraded against the iSCSI datastores
  • If a host try to boot while connected to the new switches, it will take about 10 minutes to start while usually it takes < 5 minutes.
    • Even after it is booted, many services are not running fine, including the web UI of the host.

So this leads me to believe that with the new switches in place, access to the datastores through iSCSI isn't working fine. Of course, seeing CRC errors, I first though about changing the cables. I'm only getting CRCs at the hosts levels, not the SAN ports. So I swapped the cables for known good working cables and that didn't fix it. Tried other interfaces on the switches, nope. But that happens on both S4128T-ON for both hosts, so 4 interfaces...I don't think I'm that unlucky here.

Here are some notes I took over the past days :

  • Jumbo frames is enabled and confirmed working with vmkping :
    • Switches interfaces and VLAN MTUs is set at 9216
    • vmkernel and vswitches are using MTU 9000
    • Jumbo frames are enabled on the SAN
    • vmkping between hosts and SAN are working(vmkping -I vmkX x.x.x.x -d -s 8972)
  • Spanning-tree disabled on the switches
  • All cables are CAT6.
  • I have confirmed with DELL EMC Support that the switches are properly configured according to their recommendations.
  • Delayed Ack is disabled, even though it is not mandatory to do for the ME4024 SAN.
  • In case of duplex/speed mismatch between the hosts and the new switches, I forced 1000/Full at the hosts level and the switch level. The link was working, but iSCSI latency was still there.
  • Updated Broadcom firmwares (July 2021) from DELL's website.
  • Disabled Energy Efficient/Saving on each Broadcom NIC in the BIOS to have maximum performances.
  • Broadcoms are using driver ntg3 4.1.5.0-0vmw.702.0.0.17867351 in VMware.
    • Confirmed certified for 7.0u2c in the VMware HCL

 

I am running out of ideas. It has to be a config somewhere I am missing or an incompatibility between Broadcoms (either firmware, VMware driver, etc) and the new DELL switches.

Any ideas ?

Thanks !

0 Kudos
13 Replies
Jay-AWC
Contributor
Contributor

I have a very similar setup and also replacing 1gb ISCSI Switches with 10gb S4128T and having the same performance issues and after testing and working with Dell support still can not improve performance and have moved back to the old 1gb Switches for now, did you ever resolve this issue .... please ....

0 Kudos
spookyneo1
Contributor
Contributor

Unfortunately, the issue is still outstanding.

Turns out after a couple of months of troubleshooting with DELL and VMware (mostly with DELL), the S4128T switches still have some issues with some specific 1GB NIC when running with an MTU higher than default. Of course, our iSCSI traffic is running with an MTU of ~9000. I say "still have some issues" because the S4128T have had connectivity issues in the past with 1GB NIC that was fixed in an OS10 10.5.3.0. Now it seems that DELL acknowledges that there is still an issue but with higher MTUs. However, since it is a very specific case and I was the only client with this issue so far, they don't know if they are going to fix it. For now, we're back running the iSCSI network with the Cisco 1GB switches and the S4128T are sitting in the rack doing nothing.

We're waiting for new servers that should come in in a few months and will have 10GB NIC so new OS10 or not, our S4128T switches should be put in production in a few month with 10GB NIC.

0 Kudos
Jay-AWC
Contributor
Contributor

Thanks so much for the reply sounds very similar to my issue with Dell R640 Servers, Broadcom 5719 and 5720 NICS & S4128T Switches, I have also moved back to the 1gb ISCSI Switches in my case N2024, So I guess I will need to upgrade our Host NIC's to 10gb, thanks for the info it's much appriciated.

0 Kudos
spookyneo1
Contributor
Contributor

Did you try to install the latest OS10 code on the switches ? The latest code is 10.5.3.2. As I said, 10.5.3.0 has a fix for 1GB NIC and maybe, this will fix your issue. We may (or may not) have the same root cause. 🙂

0 Kudos
Jay-AWC
Contributor
Contributor

I haven't checked as Dell ProDeploy & Tech Support have said they are on the latest version, but I will check and confirm with Dell, Thanks

0 Kudos
Tech320
Contributor
Contributor

Hello,

Sorry to bring up an old thread, but I was wondering if you've deployed the Dell switches with servers that have a 10GB NIC? I'm getting ready to deploy S4128T-ON switches for iSCSI when I found this thread. My R640's have 10GB NIC's so it sounds like I won't have any issues. However, I wanted to see if that solved your performance issues.

Thanks.

0 Kudos
Tech320
Contributor
Contributor

Hello,

Sorry to bring up an old thread, but I was wondering if you've deployed the Dell switches with servers that have a 10GB NIC? I'm getting ready to deploy S4128T-ON switches for iSCSI when I found this thread. My R640's have 10GB NIC's so it sounds like I won't have any issues. However, I wanted to see if that solved your performance issues.

Thanks.

0 Kudos
spookyneo1
Contributor
Contributor

Hi,

We have not deployed yet no. We received the new R640 servers with 10GB NICs few days ago and they should be deployed in April/May. So far, we're still running on the Cisco's and we will deploy on the DELL switches once the new R640 are put in production.

 

I doubt you'll have any issues with 10G NICs. The issue/bug that I experienced was with old Broadcom 1G NICs while using Jumbo Frame.

0 Kudos
Jay-AWC
Contributor
Contributor

Hi,

The switch firmware update and NIC update's still did not resolve the issue so we went back to using the Dell N2024 1gb switches, I have ordered new servers R650XS with 10gb NIC's these hopefully shoud be installed in June, I think you should be ok if you have 10gb NIC's and 10gb switching as the issue with us by the looks of it was having 10gb switches and 1gb NIC's in the host for some reason Dell could not work out either the switches or hosts could not control the traffic properly from 10gb to 1gb. I will update once new servers installed.

0 Kudos
Tech320
Contributor
Contributor

Thanks for both replies. Makes me feel a little better about deploying these into production.

Tags (1)
0 Kudos
briless
Contributor
Contributor

Had similar issue with the same switches.   I disabled flow control on the ISCSI switch ports and all the problems instantly went away.

This is the command I ran on my switch ports - flowcontrol transmit off

0 Kudos
Tech320
Contributor
Contributor

Perfect, thanks for the info. Sadly I haven't been able to deploy my switches yet. But this is good to know once I do.

0 Kudos
spookyneo1
Contributor
Contributor

FYI, our new servers have been deployed with 10GB NIC. The DELL 10GB switches are working flawlessly with 10GB NIC.

Good to know about flowcontrol. I believe I tried disabling it few months ago when messing around, but I can't recall.

0 Kudos