I have setup a four node vSAN 6.2 cluster (vCenter 6.0 build 3617395). The Multicast test fails, but the other two tests pass. The hosts (ESXi 6.0.0 Update 3 build-5050593) are nested (on ESXi 6.0.0 Update 3 build-5050593). Health shows up all green except 'Hardware Compatibility' due to nesting. I have a dedicated VLAN and subnet and vmkernel interfaces exclusively for vSAN traffic. I have also added static routes for the multicast address on each host[1]:
Network Netmask Gateway Interface Source
------- ------- ------- --------- ------
default 0.0.0.0 10.1.4.1 vmk0 MANUAL
10.1.4.0 255.255.255.0 0.0.0.0 vmk0 MANUAL
10.1.6.0 255.255.255.0 0.0.0.0 vmk1 MANUAL
10.1.7.0 255.255.255.0 0.0.0.0 vmk2 MANUAL
10.1.9.0 255.255.255.0 0.0.0.0 vmk4 MANUAL
224.2.3.4 255.255.255.255 10.1.9.0 vmk4 MANUAL
When I experiment with the switch (single Cisco SG300 in L3 mode) by enabling and disabling IGMP snooping and IGMP querier I get varying results -- see below. I have already posted a question to the Cisco forum to see what, if any, configuration changes could/should be done. As I look at the results below I see very high packet loss which I'm assuming is contributing to the test result being yellow or red.
Any ideas or suggestions?
Thanks.
P.S. I read Why proactive multicast performance test fails when health check passes? , but this doesn't seem to apply.
PowerCLI J:\Data\Projects\Programming\PowerShell\vSphere_networking\change_ESXi_static_routes> Write-Host "Start test with Snooping Enabled and Querier Enabled" ; $TestResult = Test-VsanNe
tworkPerformance -Cluster ( Get-Cluster ) -Verbose ; $TestResult.HostResult | Format-Table -AutoSize ; Write-Host "End test with Snooping Enabled and Querier Enabled"
Start test with Snooping Enabled and Querier Enabled
VERBOSE: 2017-08-10 13:40:02 Test-VsanNetworkPerformance Started execution
VERBOSE: 2017-08-10 13:40:46 Test-VsanNetworkPerformance Finished execution
BandwidthBytesPerSecond IsClientInMulticast Host JitterMilliseconds LossPercentage LossDatagrams SentDatagrams Status TotalBytes
----------------------- ------------------- ---- ------------------ -------------- ------------- ------------- ------ ----------
131072000 True topvcpesxi04.corp.ad.local 114022020
34207775 False topvcpesxi03.corp.ad.local 0.0260000005364418 10 7819 77565 yellow 102526620
26277255 False topvcpesxi02.corp.ad.local 0.0299999993294477 30 23970 77565 yellow 78784650
30538787 False topvcpesxi01.corp.ad.local 0.0260000005364418 19 15265 77565 yellow 91581000
End test with Snooping Enabled and Querier Enabled
******************************************************************************************************************************************************
PowerCLI J:\Data\Projects\Programming\PowerShell\vSphere_networking\change_ESXi_static_routes>
PowerCLI J:\Data\Projects\Programming\PowerShell\vSphere_networking\change_ESXi_static_routes> Write-Host "Start test with Snooping Disabled and Querier Enabled" ; $TestResult = Test-VsanN
etworkPerformance -Cluster ( Get-Cluster ) -Verbose ; $TestResult.HostResult | Format-Table -AutoSize ; Write-Host "End test with Snooping Disabled and Querier Enabled"
Start test with Snooping Disabled and Querier Enabled
VERBOSE: 2017-08-10 13:45:20 Test-VsanNetworkPerformance Started execution
VERBOSE: 2017-08-10 13:46:03 Test-VsanNetworkPerformance Finished execution
BandwidthBytesPerSecond IsClientInMulticast Host JitterMilliseconds LossPercentage LossDatagrams SentDatagrams Status TotalBytes
----------------------- ------------------- ---- ------------------ -------------- ------------- ------------- ------ ----------
131072000 True topvcpesxi04.corp.ad.local 96023340
12020978 False topvcpesxi03.corp.ad.local 0.0430000014603138 62 40791 65321 red 36059100
22177906 False topvcpesxi02.corp.ad.local 0.0560000017285347 30 20052 65321 yellow 66545430
17085680 False topvcpesxi01.corp.ad.local 0.0500000007450581 46 30445 65321 red 51267720
End test with Snooping Disabled and Querier Enabled
******************************************************************************************************************************************************
PowerCLI J:\Data\Projects\Programming\PowerShell\vSphere_networking\change_ESXi_static_routes>
PowerCLI J:\Data\Projects\Programming\PowerShell\vSphere_networking\change_ESXi_static_routes> Write-Host "Start test with Snooping Disabled and Querier Disabled" ; $TestResult = Test-Vsan
NetworkPerformance -Cluster ( Get-Cluster ) -Verbose ; $TestResult.HostResult | Format-Table -AutoSize ; Write-Host "End test with Snooping Disabled and Querier Disabled"
Start test with Snooping Disabled and Querier Disabled
VERBOSE: 2017-08-10 13:46:37 Test-VsanNetworkPerformance Started execution
VERBOSE: 2017-08-10 13:47:26 Test-VsanNetworkPerformance Finished execution
BandwidthBytesPerSecond IsClientInMulticast Host JitterMilliseconds LossPercentage LossDatagrams SentDatagrams Status TotalBytes
----------------------- ------------------- ---- ------------------ -------------- ------------- ------------- ------ ----------
131072000 True topvcpesxi04.corp.ad.local 140358540
26909635 False topvcpesxi03.corp.ad.local 0.131999999284744 42 40548 95481 yellow 80751510
29760610 False topvcpesxi02.corp.ad.local 0.130999997258186 36 34722 95481 yellow 89315730
27572945 False topvcpesxi01.corp.ad.local 0.131999999284744 41 39191 95481 yellow 82746300
End test with Snooping Disabled and Querier Disabled
******************************************************************************************************************************************************
PowerCLI J:\Data\Projects\Programming\PowerShell\vSphere_networking\change_ESXi_static_routes>
PowerCLI J:\Data\Projects\Programming\PowerShell\vSphere_networking\change_ESXi_static_routes> Write-Host "Start test with Snooping Enabled and Querier Disabled" ; $TestResult = Test-VsanN
etworkPerformance -Cluster ( Get-Cluster ) -Verbose ; $TestResult.HostResult | Format-Table -AutoSize ; Write-Host "End test with Snooping Enabled and Querier Disabled"
Start test with Snooping Enabled and Querier Disabled
VERBOSE: 2017-08-10 13:48:14 Test-VsanNetworkPerformance Started execution
VERBOSE: 2017-08-10 13:48:57 Test-VsanNetworkPerformance Finished execution
BandwidthBytesPerSecond IsClientInMulticast Host JitterMilliseconds LossPercentage LossDatagrams SentDatagrams Status TotalBytes
----------------------- ------------------- ---- ------------------ -------------- ------------- ------------- ------ ----------
131072000 True topvcpesxi04.corp.ad.local 93794820
15191524 False topvcpesxi03.corp.ad.local 0.0560000017285347 11 8994 77506 red 100712640
14165799 False topvcpesxi02.corp.ad.local 0.0549999997019768 17 13617 77506 red 93916830
14672569 False topvcpesxi01.corp.ad.local 0.0710000023245811 14 11334 77506 red 97272840
End test with Snooping Enabled and Querier Disabled
PowerCLI J:\Data\Projects\Programming\PowerShell\vSphere_networking\change_ESXi_static_routes>
[1] vSAN Multicast performance test fails (2135495) | VMware KB
... which lead me to section "Multicast performance test of Virtual SAN health check does not run on Virtual SAN network " ...
Hello aenagy,
It wouldn't be surprising to see Health check tests failing in a nested environment - they are not designed for this structure and thus the tests 'pass' or 'fail' thresholds are more likely to be reached.
Is cluster formation and communication functional and the cluster stable?
If so then Multicast traffic is getting passed and adequate.
Check cluster membership revision count in #esxcli vsan cluster get, if it is not incrementing and cluster is formed then it is probably fine, check that you can make VMs with a Storage Policy applied etc. .
I recall the other issue you referenced, I think that one was actually an issue with broken Health test and thus that kb walks through the method of spoofing it so it can be 'all-green' again, I don't think the results of the tests you showed are indicative of this as you got some warning (yellow) results, not just failed (red).
Bob
Bob:
It wouldn't be surprising to see Health check tests failing in a nested environment - they are not designed for this structure and thus the tests 'pass' or 'fail' thresholds are more likely to be reached.
Given that there is hardly any load and its such a small environment I'm still a bit surprised. I checked esxtop on the physical host and see what was happening with the network traffic and I see that the ports for the nested hosts have high %DRPRX. I am using a modified William Lam's virtual ESXi appliance in conjunction with the mac learning filter, but this doesn't seem to do be working the way it should. The NFS and iSCSI on the same pHost seem to work fine and the vSAN seems OK as near as I can tell.
Is cluster formation and communication functional and the cluster stable?
vSAN health looks OK other than "Data > Virtual SAN object health" shows up with a white "i" in a blue circle rather than green/yellow/red icon.
Check cluster membership revision count in #esxcli vsan cluster get, if it is not incrementing and cluster is formed then it is probably fine, check that you can make VMs with a Storage Policy applied etc.
Output from "esxcli vsan cluster get" seems OK (Health State: HEALTHY, Membership Entry Revision and Member count on all four nodes have the same value). I have been able to twice clone a virtual machine to the default vSAN policy and then one vm to a FTT-1-Raid5 and the other vm to a FTT=1-Raid1 policy. The initial cloning for a 16 GB vm took more than an hour which needless to say is alarming considering the vSAN cluster is all SSD. ESXTOP tells the story:
Nested ESXi03:
12:56:42am up 15:22, 546 worlds, 0 VMs, 0 vCPUs; CPU load average: 0.04, 0.04, 0.03
ADAPTR PATH NPTH CMDS/s READS/s WRITES/s MBREAD/s MBWRTN/s DAVG/cmd KAVG/cmd GAVG/cmd QAVG/cmd
vmhba0 - 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
vmhba1 - 4 113.55 73.81 39.74 0.29 2.48 394.55 4.21 398.75 6.28
vmhba32 - 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
vmhba33 - 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Nested ESXi04:
12:55:44am up 15:09, 544 worlds, 0 VMs, 0 vCPUs; CPU load average: 0.04, 0.03, 0.03
ADAPTR PATH NPTH CMDS/s READS/s WRITES/s MBREAD/s MBWRTN/s DAVG/cmd KAVG/cmd GAVG/cmd QAVG/cmd
vmhba0 - 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
vmhba1 - 4 79.35 0.00 79.35 0.00 0.31 328.38 0.01 328.39 0.00
vmhba32 - 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
vmhba33 - 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Obviously there is something seriously wrong. Even though this is a consumer grade SSD (Samsung 850 in a HP Proliant DL360 Gen8) I wouldn't expect some this this bad.
I recall the other issue you referenced, I think that one was actually an issue with broken Health test and thus that kb walks through the method of spoofing it so it can be 'all-green' again, I don't think the results of the tests you showed are indicative of this as you got some warning (yellow) results, not just failed (red).
Grasping at straws...
I might have been expecting too much out of nested ESXi, but this really takes the cake. I did the HOL-1708-SDC-1 lab recently and don't remember it being this bad. Mind you the virtual machine was tiny (500 MB or so). That being said I just want to make sure that I haven't missed anything obvious.
Hello,
That is very high DAVG and I wouldn't expect much performance with that slow a response time.
I am unaware of whether this is expected with nested home-labs (I just use HOL or physical labs) so maybe a home-labber can weigh in here.
Do you have seperate SSDs partitioned backing the cache and capacity tier drives?
What driver and version used for the local disk connection on the host?
Bob
That is very high DAVG and I wouldn't expect much performance with that slow a response time.
Yeah. This is worse than the spinning disks (3 x HP SAS 10k, RAID5) by orders of magnitude.
Do you have seperate SSDs partitioned backing the cache and capacity tier drives?
All of the nested ESXi hosts and their VMDKs are provisioned on the same VMFS datastore on the Samsung 850 EVO.
What driver and version used for the local disk connection on the host?
I'm using HP's OEM ESXi:
[root@TOPELHhost01:~] esxcfg-scsidevs -a
vmhba1 hpvsa link-n/a sata.vmhba1 (0000:00:1f.2) Intel Corporation HP Dynamic Smart Array B120i RAID controller
vmhba2 hpsa link-n/a sas.50014380278cf830 (0000:03:00.0) Hewlett Packard Enterprise Smart Array P420
[root@TOPELHhost01:~] vmkchdev -l | grep -i -e vmhba2
0000:03:00.0 103c:323b 103c:3351 vmkernel vmhba2
[root@TOPELHhost01:~] vmkload_mod -s hpsa
vmkload_mod module information
input file: /usr/lib/vmware/vmkmod/hpsa
License: GPL
Version: Version 6.0.0.124-1OEM, Build: 2494585, Interface: 9.2 Built on: Nov 16 2016
Build Type: release
Required name-spaces:
com.vmware.driverAPI#9.2.3.0
com.vmware.vmkapi#v2_3_0_0
Parameters:
heap_max: int
Maximum attainable heap size for the driver.
heap_initial: int
Initial heap size allocated for the driver.
reply_queues: int
Specify desired number of reply queues. 1-16, default is 4, not to exceed number of online CPUs.
hpsa_simple_mode: int
Use 'simple mode' rather than 'performant mode'
hpsa_allow_any: int
Allow hpsa driver to access unknown HPE Smart Array hardware
[root@TOPELHhost01:~] vmware -v
VMware ESXi 6.0.0 build-5050593
[root@TOPELHhost01:~]