I have build a All Flash vSAN at my home, however my performance is less than expected. I am seeing 35k IOPS when running a Proactive Test - Stress Test.
The hosts are:
3x Dell PE R720 2x 2690 Xeon 192GB RAM
Brocade 1020 NICs connected to Brocade ICX 7450 10Gbe
2 Disk Groups in each host 2x Intel S3700 400GB and 6x S3500 480GB
ESXi 6.5 U1 loaded from Dell ISO
Does this seem right to you guys? At work I built a similar 4 node cluster on Lenovo nodes and I see 125k IOPS using the same test. Only difference is my vSAN at home only has about 2TB free of 8TB, while the cluster at work has nothing on it. I understand I have VMs running at home, but not 90k worth of I/O!
Any advice or input is appreciated.
Can you please go to:
Cluster - Monitor - vSAN - Health
Do you have any RED or Yellow statements? If so can you please send me all of them?
Then, what is the connection you are using between your 3 hosts? You write 10GB, but howmuch NICs are you using for the vSAN Kernel?
To be clear, you can use 1 Kernel for vMotion, SAN and Management and have all 3 Services for the same Kernel or then separate them in to different Kernels.
In our environment we use it this way:
We have a total of 8 Uplinks on our vDS Switch.
I use 2 vKernels with vSAN service activated at 2x 10GB per Host.
This gives you when you do this (putty on a host of the 3) esxcli vsan cluster unicastagent list 2 IPs per Host for the vSAN communication.
Our connection was also very poor, but because I used all 3 services on only 1 Kernel and that didn't help!
Try to do that and make sure you use 10GB all the way, that you have at least 4 Uplinks for a fair speed, so that you can separate VM Network, vMotion and vSAN.
And send me those Red and Yellow states.
While it is very nice to have a decent home set-up such as that, are you comparing like for like hardware here?:
3 to 4 node is a big enough difference to start.
Is the Lenovo cluster hooked to nearer $10k worth of switch and/or two switches and over how many links?
LSI SAS 9207-8i has a *relatively* low queue depth compared to most newer controllers and not supported for vSAN 6.5 so can't say whether driver/firmware holds up performance-wise (+ never checked if it was deemed unsupportable on 6.5 for performance/compatibility reasons or was EOL'd by LSI). - Going to assume you have this correctly in pass-through mode for the disks.
Try comparing performance tests with different IO profiles - potentially your home-lab is only far further behind in certain areas.
Make sure all drivers and firmware are up to scratch.
If you want to get deeper into comparing the performance of any cluster more granularly I would advise using vSAN Observer:
Or 3rd party alternative: (don't mind the URL, link is safe for work :smileygrin: )
Hope this helps
I wouldn't advise multiple vmknics per vSAN as first choice, there are caveats to configuring it this way - also redundancy/availability can be configured better lower in the stack (NIC):
"Virtual SAN does not support multiple VMkernel adapters on the same subnet. You can use multiple VMkernel adapters on different subnets, such as another VLAN or separate physical fabric. Providing availability by using several VMkernel adapters has configuration costs including vSphere and the network infrastructure. Network availability by teaming physical network adapters is easier to achieve with less setup."
I have Sexigraf running on the cluster at home. The servers at work do have a dedicated 10Gbe link for VSAN, while my house does not, but I am not running those links anywhere near capacity and VSAN is free it use them all when they need to.
The switch at work is a Brocade Fabric VDX, while mine at my house it a Brocade ICX. Little different but I defiantly dont have a D-Link at home. The ICX is a enterprise grade switch that can be used for distribution or access layer. It has lots of capacity.
What Firmware version are your LSI 9207-8i cards on? By the look of that screenshot, that is not FW19 but something newer. I had major issues with FW20 and went back to FW19. Now all is good again. I had shitty performance and disks dropping out at random, especially at heavy load.
I don't know which donkey put FW20 on the HCL, but it should be removed.
About switches: I don't know your Brocade but in general, switches with small "per-port buffers" will give issues when going full-monty using NFS, iSCSI, FCOE or vSAN etc.
Storage over IP is very demanding, especially under prolonged load.
For anyone else seeing this problem, it was the lsiprovider vib I had installed to manage the LSI controllers. I found another thread talking about that, I removed it and the latencies are back under 1ms.
Can you please start your performance tests , also start the observer buy logging into your vcenter server RVC console and stop the observer once done . If you need to isolate networking issues with need to see if have any out of order (ooo)packets and re-transmits under your networking section . See how to login to RVC How to log into RVC or Ruby vSphere Console for vSAN? and command to start Observer for vCSA/VC below , adjust mxrun time in hour below example is for running for an hour .
vsan.observer . --run-webserver --force --generate-html-bundle /tmp --interval 30 --max-runtime 1
vsan.observer . --run-webserver --force --generate-html-bundle c:\\temp --interval 30 --max-runtime 1
Please let me know if you can share the esxi VM-support bundle and Observer bundle I can take a look at them offline .
Hareesh K G