Software iSCSI failover lockup/freeze

MattPetersGW · ‎08-25-2020

Hi all,

We have a really simple setup-3 hosts, two switches and storage:

3x HPE ProLiant DL360 Gen10

2x Cisco Nexus n9k

1x HPE/Nimble HF40

vSphere 6.7 u3

Each host has a dedicated iSCSI standard switch with two ports assigned. It has two portgroups, with a vmk assigned to each port group. One uplink port is assigned to the vmk, the other is set inactive and vice versa for the other portgroup. These are cross connected to the Nexus switches. iSCSI discovery is to the Nimble group IP.

The switches have a stack link (apologies, not much of a network guy so actual detail might be lacking) and dedicated iSCSI VLAN which the hosts and storage are connected to. Everything is configured jumbo frames host-switch-storage and is running 10Gb over DAC cabling.

Storage is a basic setup, the iSCSI side is dedicated to data flow. vCenter is on a standard VMFS datastore and everything else is vVOL. VASA integration is enabled from storage. Nimble Connection Service and Path Selection Plug-in are installed on the hosts (latest version of these, 7.0) and the datastore/vVOLs are using Nimble_PSP_Directed for path selection policy.

We need this to be highly available and for the most part it is. The problem comes when testing iSCSI path failure, whether that is by switch failure, host NIC fault or storage NIC fault. If connected to vCenter web, it will become unresponsive for 30-45 seconds (selecting another item will not load anything, the blue circle spins in the top right of the screen). Similar happens with Windows guests, if connected through remote console or RDP, we lose access for 30-45 seconds. Ping continues to work. Predictably, vSphere logs path degradation error.

Now, according to Path Failover and Virtual Machines this is expected behaviour but then in Array-Based Failover with iSCSI it says reconnection happens quickly, I'm under the impression we're using array based failover?

Ideal scenario is we don't lose any access to machines for longer than a few seconds, not tens of seconds. VMWare, Nimble and Cisco support are all scratching their heads over this.

daphnissov · ‎08-25-2020

Have you done any testing to measure and compare the failover recovery time for VMs that are hosted on VMFS datastores rather than vVols? Curious to know if this extended failover time is specific to one or both types of storage.

------------------
How to Ask for Help on Tech Forums
https://neonmirrors.net

MattPetersGW · ‎08-25-2020

Good question, something I will check in a couple of days on site.

MattPetersGW · ‎09-02-2020

I've done a bit more testing. There's no difference in recovery time between a standard datastore and the vVOL storage.

I have found that, when connected to the web interfaces of vCenter (standard datastore), vRops (vVOL) or RDP to a Windows server, they will all 'pause' or sit unresponsive when a storage link goes down. However, if I connect directly to the ESXi hosts and use remote console, all of the machines continue to respond. I tested two Windows servers on two different hosts, running ping and watching the clock on both. Not dropped packets, and the clocks continued to work.

I'm getting really confused, as it looks as though the machines stay operational behind the scenes but any IP protocol (HTTP/S, RDP) stop working. Even ping stays working from outside this environment inbound.

All

Software iSCSI failover lockup/freeze