VMware Cloud Community
aenagy
Hot Shot
Hot Shot

Multicast performance test failed on vSAN 6.2

I have setup a four node vSAN 6.2 cluster (vCenter 6.0 build 3617395). The Multicast test fails, but the other two tests pass. The hosts (ESXi 6.0.0 Update 3 build-5050593) are nested (on ESXi 6.0.0 Update 3 build-5050593). Health shows up all green except 'Hardware Compatibility' due to nesting. I have a dedicated VLAN and subnet and vmkernel interfaces exclusively for vSAN traffic. I have also added static routes for the multicast address on each host[1]:

Network   Netmask         Gateway  Interface Source

-------   -------         -------  --------- ------

default   0.0.0.0         10.1.4.1 vmk0      MANUAL

10.1.4.0  255.255.255.0   0.0.0.0  vmk0      MANUAL

10.1.6.0  255.255.255.0   0.0.0.0  vmk1      MANUAL

10.1.7.0  255.255.255.0   0.0.0.0  vmk2      MANUAL

10.1.9.0  255.255.255.0   0.0.0.0  vmk4      MANUAL

224.2.3.4 255.255.255.255 10.1.9.0 vmk4      MANUAL

When I experiment with the switch (single Cisco SG300 in L3 mode) by enabling and disabling IGMP snooping and IGMP querier I get varying results -- see below. I have already posted a question to the Cisco forum to see what, if any, configuration changes could/should be done. As I look at the results below I see very high packet loss which I'm assuming is contributing to the test result being yellow or red.

Any ideas or suggestions?

Thanks.

P.S. I read Why proactive multicast performance test fails when health check passes? , but this doesn't seem to apply.

PowerCLI J:\Data\Projects\Programming\PowerShell\vSphere_networking\change_ESXi_static_routes> Write-Host "Start test with Snooping Enabled and Querier Enabled" ; $TestResult = Test-VsanNe

tworkPerformance -Cluster ( Get-Cluster ) -Verbose ; $TestResult.HostResult | Format-Table -AutoSize ; Write-Host "End test with Snooping Enabled and Querier Enabled"

Start test with Snooping Enabled and Querier Enabled

VERBOSE: 2017-08-10 13:40:02 Test-VsanNetworkPerformance Started execution

VERBOSE: 2017-08-10 13:40:46 Test-VsanNetworkPerformance Finished execution

BandwidthBytesPerSecond IsClientInMulticast Host                       JitterMilliseconds LossPercentage LossDatagrams SentDatagrams Status TotalBytes

----------------------- ------------------- ----                       ------------------ -------------- ------------- ------------- ------ ----------

              131072000                True topvcpesxi04.corp.ad.local                                                                       114022020

               34207775               False topvcpesxi03.corp.ad.local 0.0260000005364418 10             7819          77565         yellow  102526620

               26277255               False topvcpesxi02.corp.ad.local 0.0299999993294477 30             23970         77565         yellow   78784650

               30538787               False topvcpesxi01.corp.ad.local 0.0260000005364418 19             15265         77565         yellow   91581000

End test with Snooping Enabled and Querier Enabled

******************************************************************************************************************************************************

PowerCLI J:\Data\Projects\Programming\PowerShell\vSphere_networking\change_ESXi_static_routes>

PowerCLI J:\Data\Projects\Programming\PowerShell\vSphere_networking\change_ESXi_static_routes> Write-Host "Start test with Snooping Disabled and Querier Enabled" ; $TestResult = Test-VsanN

etworkPerformance -Cluster ( Get-Cluster ) -Verbose ; $TestResult.HostResult | Format-Table -AutoSize ; Write-Host "End test with Snooping Disabled and Querier Enabled"

Start test with Snooping Disabled and Querier Enabled

VERBOSE: 2017-08-10 13:45:20 Test-VsanNetworkPerformance Started execution

VERBOSE: 2017-08-10 13:46:03 Test-VsanNetworkPerformance Finished execution

BandwidthBytesPerSecond IsClientInMulticast Host                       JitterMilliseconds LossPercentage LossDatagrams SentDatagrams Status TotalBytes

----------------------- ------------------- ----                       ------------------ -------------- ------------- ------------- ------ ----------

              131072000                True topvcpesxi04.corp.ad.local                                                                        96023340

               12020978               False topvcpesxi03.corp.ad.local 0.0430000014603138 62             40791         65321         red      36059100

               22177906               False topvcpesxi02.corp.ad.local 0.0560000017285347 30             20052         65321         yellow   66545430

               17085680               False topvcpesxi01.corp.ad.local 0.0500000007450581 46             30445         65321         red      51267720

End test with Snooping Disabled and Querier Enabled

******************************************************************************************************************************************************

PowerCLI J:\Data\Projects\Programming\PowerShell\vSphere_networking\change_ESXi_static_routes>

PowerCLI J:\Data\Projects\Programming\PowerShell\vSphere_networking\change_ESXi_static_routes> Write-Host "Start test with Snooping Disabled and Querier Disabled" ; $TestResult = Test-Vsan

NetworkPerformance -Cluster ( Get-Cluster ) -Verbose ; $TestResult.HostResult | Format-Table -AutoSize ; Write-Host "End test with Snooping Disabled and Querier Disabled"

Start test with Snooping Disabled and Querier Disabled

VERBOSE: 2017-08-10 13:46:37 Test-VsanNetworkPerformance Started execution

VERBOSE: 2017-08-10 13:47:26 Test-VsanNetworkPerformance Finished execution

BandwidthBytesPerSecond IsClientInMulticast Host                       JitterMilliseconds LossPercentage LossDatagrams SentDatagrams Status TotalBytes

----------------------- ------------------- ----                       ------------------ -------------- ------------- ------------- ------ ----------

              131072000                True topvcpesxi04.corp.ad.local                                                                       140358540

               26909635               False topvcpesxi03.corp.ad.local 0.131999999284744  42             40548         95481         yellow   80751510

               29760610               False topvcpesxi02.corp.ad.local 0.130999997258186  36             34722         95481         yellow   89315730

               27572945               False topvcpesxi01.corp.ad.local 0.131999999284744  41             39191         95481         yellow   82746300

End test with Snooping Disabled and Querier Disabled

******************************************************************************************************************************************************

PowerCLI J:\Data\Projects\Programming\PowerShell\vSphere_networking\change_ESXi_static_routes>

PowerCLI J:\Data\Projects\Programming\PowerShell\vSphere_networking\change_ESXi_static_routes> Write-Host "Start test with Snooping Enabled and Querier Disabled" ; $TestResult = Test-VsanN

etworkPerformance -Cluster ( Get-Cluster ) -Verbose ; $TestResult.HostResult | Format-Table -AutoSize ; Write-Host "End test with Snooping Enabled and Querier Disabled"

Start test with Snooping Enabled and Querier Disabled

VERBOSE: 2017-08-10 13:48:14 Test-VsanNetworkPerformance Started execution

VERBOSE: 2017-08-10 13:48:57 Test-VsanNetworkPerformance Finished execution

BandwidthBytesPerSecond IsClientInMulticast Host                       JitterMilliseconds LossPercentage LossDatagrams SentDatagrams Status TotalBytes

----------------------- ------------------- ----                       ------------------ -------------- ------------- ------------- ------ ----------

              131072000                True topvcpesxi04.corp.ad.local                                                                        93794820

               15191524               False topvcpesxi03.corp.ad.local 0.0560000017285347 11             8994          77506         red     100712640

               14165799               False topvcpesxi02.corp.ad.local 0.0549999997019768 17             13617         77506         red      93916830

               14672569               False topvcpesxi01.corp.ad.local 0.0710000023245811 14             11334         77506         red      97272840

End test with Snooping Enabled and Querier Disabled

PowerCLI J:\Data\Projects\Programming\PowerShell\vSphere_networking\change_ESXi_static_routes>

[1] vSAN Multicast performance test fails (2135495) | VMware KB

... which lead me to section "Multicast performance test of Virtual SAN health check does not run on Virtual SAN network " ...

VMware Virtual SAN 6.1 Release Notes

0 Kudos
4 Replies
TheBobkin
Champion
Champion

Hello aenagy,

It wouldn't be surprising to see Health check tests failing in a nested environment - they are not designed for this structure and thus the tests 'pass' or 'fail' thresholds are more likely to be reached.

Is cluster formation and communication functional and the cluster stable?

If so then Multicast traffic is getting passed and adequate.

Check cluster membership revision count in #esxcli vsan cluster get, if it is not incrementing and cluster is formed then it is probably fine, check that you can make VMs with a Storage Policy applied etc. .

I recall the other issue you referenced, I think that one was actually an issue with broken Health test and thus that kb walks through the method of spoofing it so it can be 'all-green' again, I don't think the results of the tests you showed are indicative of this as you got some warning (yellow) results, not just failed (red).

Bob

0 Kudos
aenagy
Hot Shot
Hot Shot

Bob:

It wouldn't be surprising to see Health check tests failing in a nested environment - they are not designed for this structure and thus the tests 'pass' or 'fail' thresholds are more likely to be reached.

Given that there is hardly any load and its such a small environment I'm still a bit surprised. I checked esxtop on the physical host and see what was happening with the network traffic and I see that the ports for the nested hosts have high %DRPRX. I am using a modified William Lam's virtual ESXi appliance in conjunction with the mac learning filter, but this doesn't seem to do be working the way it should. The NFS and iSCSI on the same pHost seem to work fine and the vSAN seems OK as near as I can tell.

Is cluster formation and communication functional and the cluster stable?

vSAN health looks OK other than "Data > Virtual SAN object health" shows up with a white "i" in a blue circle rather than green/yellow/red icon.

Check cluster membership revision count in #esxcli vsan cluster get, if it is not incrementing and cluster is formed then it is probably fine, check that you can make VMs with a Storage Policy applied etc.

Output from "esxcli vsan cluster get" seems OK (Health State: HEALTHY, Membership Entry Revision and Member count on all four nodes have the same value). I have been able to twice clone a virtual machine to the default vSAN policy and then one vm to a FTT-1-Raid5 and the other vm to a FTT=1-Raid1 policy. The initial cloning for a 16 GB vm took more than an hour which needless to say is alarming considering the vSAN cluster is all SSD. ESXTOP tells the story:

Nested ESXi03:

12:56:42am up 15:22, 546 worlds, 0 VMs, 0 vCPUs; CPU load average: 0.04, 0.04, 0.03

ADAPTR PATH                 NPTH   CMDS/s  READS/s WRITES/s MBREAD/s MBWRTN/s DAVG/cmd KAVG/cmd GAVG/cmd QAVG/cmd

vmhba0 -                       0     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00

vmhba1 -                       4   113.55    73.81    39.74     0.29     2.48   394.55     4.21   398.75     6.28

vmhba32 -                       1     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00

vmhba33 -                       6     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00

Nested ESXi04:

12:55:44am up 15:09, 544 worlds, 0 VMs, 0 vCPUs; CPU load average: 0.04, 0.03, 0.03

ADAPTR PATH                 NPTH   CMDS/s  READS/s WRITES/s MBREAD/s MBWRTN/s DAVG/cmd KAVG/cmd GAVG/cmd QAVG/cmd

vmhba0 -                       0     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00

vmhba1 -                       4    79.35     0.00    79.35     0.00     0.31   328.38     0.01   328.39     0.00

vmhba32 -                       1     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00

vmhba33 -                       6     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00

Obviously there is something seriously wrong. Even though this is a consumer grade SSD (Samsung 850 in a HP Proliant DL360 Gen8) I wouldn't expect some this this bad.

I recall the other issue you referenced, I think that one was actually an issue with broken Health test and thus that kb walks through the method of spoofing it so it can be 'all-green' again, I don't think the results of the tests you showed are indicative of this as you got some warning (yellow) results, not just failed (red).

Grasping at straws...

I might have been expecting too much out of nested ESXi, but this really takes the cake. I did the HOL-1708-SDC-1 lab recently and don't remember it being this bad. Mind you the virtual machine was tiny (500 MB or so). That being said I just want to make sure that I haven't missed anything obvious.

0 Kudos
TheBobkin
Champion
Champion

Hello,

That is very high DAVG and I wouldn't expect much performance with that slow a response time.

I am unaware of whether this is expected with nested home-labs (I just use HOL or physical labs) so maybe a home-labber can weigh in here.

Do you have seperate SSDs partitioned backing the cache and capacity tier drives?

What driver and version used for the local disk connection on the host?

Bob

0 Kudos
aenagy
Hot Shot
Hot Shot

That is very high DAVG and I wouldn't expect much performance with that slow a response time.

Yeah. This is worse than the spinning disks (3 x HP SAS 10k, RAID5) by orders of magnitude.

Do you have seperate SSDs partitioned backing the cache and capacity tier drives?

All of the nested ESXi hosts and their VMDKs are provisioned on the same VMFS datastore on the Samsung 850 EVO.

What driver and version used for the local disk connection on the host?

I'm using HP's OEM ESXi:

[root@TOPELHhost01:~] esxcfg-scsidevs -a

vmhba1  hpvsa             link-n/a  sata.vmhba1                             (0000:00:1f.2) Intel Corporation HP Dynamic Smart Array B120i RAID controller

vmhba2  hpsa              link-n/a  sas.50014380278cf830                    (0000:03:00.0) Hewlett Packard Enterprise Smart Array P420

[root@TOPELHhost01:~] vmkchdev -l | grep -i -e vmhba2

0000:03:00.0 103c:323b 103c:3351 vmkernel vmhba2

[root@TOPELHhost01:~] vmkload_mod -s hpsa

vmkload_mod module information

input file: /usr/lib/vmware/vmkmod/hpsa

License: GPL

Version: Version 6.0.0.124-1OEM, Build: 2494585, Interface: 9.2 Built on: Nov 16 2016

Build Type: release

Required name-spaces:

  com.vmware.driverAPI#9.2.3.0

  com.vmware.vmkapi#v2_3_0_0

Parameters:

  heap_max: int

    Maximum attainable heap size for the driver.

  heap_initial: int

    Initial heap size allocated for the driver.

  reply_queues: int

    Specify desired number of reply queues. 1-16, default is 4, not to exceed number of online CPUs.

  hpsa_simple_mode: int

    Use 'simple mode' rather than 'performant mode'

  hpsa_allow_any: int

    Allow hpsa driver to access unknown HPE Smart Array hardware

[root@TOPELHhost01:~] vmware -v

VMware ESXi 6.0.0 build-5050593

[root@TOPELHhost01:~]

0 Kudos