ESXi 6.7 software iSCSI balanced writes but unbala...

Asteroza · ‎07-18-2018

I have an unusual problem, that is fairly hard to chase down and am wondering if it may be a change in ESXi behavior.

Environment

Have a series of 2 NIC ESXi servers that have been in use since ESXi 5.5 and upgraded all the way to ESXi 6.7 (free license)

Have a single managed switch (HP Procurve/Aruba 2350) with HP trunks to both NIC's on each server

Have 4 iSCSI subnets with no routing

iSCSI storage provided via a NexentaStor box with 4 NIC (no aggregation to switch)

each NIC has one IP on each iSCSI subnet via VLAN, for a total 16 IP's presented

NexentaStor iSCSI server presents a VAAI compatible LUN with 4 targets, each individual target with 4 IP's from the same iSCSI subnet

ESXi uses single vSwitch with NIC teaming with trunk

under vSwitch, have 4 iSCSI portgroups for each ISCSI VLAN, each iSCSI portgroup has a single vmknic

each iSCSI portgroup has override and use only one pNIC with other unused (alternate which pNIC is active and unused per portgroup)

use iSCSI portbinding on iSCSI vmk (all available for portbinding)

use round robin path selection, all active (all 16 paths)

previous behavior under ESXi 6.5 was mostly balanced reads and balanced writes. This is because while outbound writes are evenly split due to round robin path selection between 4 vmk (and thus evenly split between between 2 pNIC), return traffic must go through switch trunk load balance algorithm. Thus it's blind luck which iSCSI session will go through which trunk port and on to which pNIC. Basic statistics says I have better odds of more even balance if the number of TCP sessions is higher (think bell curve of possibilities). With only two targets with one IP each, worst case is all traffic goes to one pNIC. So the temporary solution was increasing the target IP count as the odds of all 16 sessions being on one pNIC is low. Note that while the distribution of traffic from switch to ESXi is due to switch internal load balance algorithm, traffic from NexentaStor to the switch was evenly balanced, suggesting that reads are being commanded evenly. Since the iSCSI server had 4 NIC's, it was thought to be better to balance traffic across all 4 NIC's, so need target/target IP's in multiples of 4.

after the ESXi 6.7 upgrade though, something changed and now reads are coming almost exclusively from a single NexentaStor NIC, though the switch seems to evenly distribute the read traffic from the switch to the ESXi servers. Which NIC is being almost exclusively read from can change after rebooting the ESXi server. The NexentaStor iSCSI server had no changes during the ESXi upgrade.

my guess is some weird behavior due to the large number of targets and target IP, and target IP distribution, and carryover from the upgrade. Each LUN has 4 targets, each target has 4 IP's in the same subnet. Each NexentaStor NIC has the same final octet in each subnet.

So

NIC 1 has addresses x.x.A.11 x.x.B.11 x.x.C.11 x.x.D.11

NIC 2 has addresses x.x.A.12 x.x.B.12 x.x.C.12 x.x.D.12

NIC 3 has addresses x.x.A.13 x.x.B.13 x.x.C.13 x.x.D.13

NIC 4 has addresses x.x.A.14 x.x.B.14 x.x.C.14 x.x.D.14

4 targets

target W (x.x.A.1-4)

target X (x.x.B.1-4)

target Y (x.x.C.1-4)

target Z (x.x.D.1-4)

There is also the somewhat changed guidance from VMware regarding iSCSI portbinding for 6.7

https://docs.vmware.com/en/VMware-vSphere/6.7/com.vmware.vsphere.storage.doc/GUID-BC516B24-7EAB-4ADA...

Looking at the current iSCSI settings, the host client shows weird behavior.

Clicking the "Software iSCSI" button under Storage -> Adapters shows it sorta thinks port binding is active as it lists the 4 port bindings previosuly used under ESXi 6.5, however if I select the iSCSI software adapter, the "Configure iSCSI" button becomes un-greyed out and if I click it, it shows the iSCSI configuration details but no port bindings (and can't add port bindings).

I confirmed via ESXTOP that the vmk are evenly reading, but I can't see which path they are selecting or the load per path. Based on what the switch is showing though, it suggests the possibility that ESXi is reading from 4 targets but only the (first?) IP of each target, which ends up being all the same iSCSI server NIC.

Since I am a free license user, I don't have a vCenter to keep an eye on things and get more detailed information. The host client is sorely lacking, as you can't even see network activity by vmk (at least the old C# client showed that). While the host client won't let you even select a path selection policy, I did confirm via CLI that it was still round robin, and that it thinks all paths are active. Since the NexentaStor iSCSI server feeds other ESXi servers of lesser version, I would prefer to not make any changes that only work well for ESXi 6.7 but not 6.5 or 5.5 if I can avoid it.

There was a mention somewhere that having more than 32 paths causes weird behavior, which can occur if I am using 2 or more LUN's from the iSCSi server (previously the C# client under an early version of 6.5 showed a boatload of paths). The VMware guidance might suggest that the 1:1 port binding means for 2 pNIC vSwitches, you can only add 2 vmk correctly for port binding, and adding more than that might lead to undefined behavior?

So, anybody have an idea what the hell is wrong? One possible solution might be adjusting the vmk count and redefining the iSCSI target, but with only 2 pNIC's but 4 NIC's on the storage server, what would be the correct/better target/target IP mapping and vmk count?

All

ESXi 6.7 software iSCSI balanced writes but unbalanced reads?