Re: Poor All Flash vSAN Performance

wreedMH · ‎11-22-2017

Hello,

I have build a All Flash vSAN at my home, however my performance is less than expected. I am seeing 35k IOPS when running a Proactive Test - Stress Test.

The hosts are:

3x Dell PE R720 2x 2690 Xeon 192GB RAM

LSI 9207-8i

Brocade 1020 NICs connected to Brocade ICX 7450 10Gbe

2 Disk Groups in each host 2x Intel S3700 400GB and 6x S3500 480GB

ESXi 6.5 U1 loaded from Dell ISO

Does this seem right to you guys? At work I built a similar 4 node cluster on Lenovo nodes and I see 125k IOPS using the same test. Only difference is my vSAN at home only has about 2TB free of 8TB, while the cluster at work has nothing on it. I understand I have VMs running at home, but not 90k worth of I/O!

Any advice or input is appreciated.

mprazeres183 · ‎11-22-2017

Hi wreedMH

Can you please go to:

Cluster - Monitor - vSAN - Health

Do you have any RED or Yellow statements? If so can you please send me all of them?

Then, what is the connection you are using between your 3 hosts? You write 10GB, but howmuch NICs are you using for the vSAN Kernel?

To be clear, you can use 1 Kernel for vMotion, SAN and Management and have all 3 Services for the same Kernel or then separate them in to different Kernels.

In our environment we use it this way:

We have a total of 8 Uplinks on our vDS Switch.

I use 2 vKernels with vSAN service activated at 2x 10GB per Host.

This gives you when you do this (putty on a host of the 3) esxcli vsan cluster unicastagent list 2 IPs per Host for the vSAN communication.

Our connection was also very poor, but because I used all 3 services on only 1 Kernel and that didn't help!

Try to do that and make sure you use 10GB all the way, that you have at least 4 Uplinks for a fair speed, so that you can separate VM Network, vMotion and vSAN.

And send me those Red and Yellow states.

Best regards,
Marco

Check my blog, and if my answere resolved the issue, please provide a feedback. Marco Frias - VMware is my World www.vmtn.blog

TheBobkin · ‎11-22-2017

Hello wreedMH,

While it is very nice to have a decent home set-up such as that, are you comparing like for like hardware here?:

3 to 4 node is a big enough difference to start.

Is the Lenovo cluster hooked to nearer $10k worth of switch and/or two switches and over how many links?

LSI SAS 9207-8i has a *relatively* low queue depth compared to most newer controllers and not supported for vSAN 6.5 so can't say whether driver/firmware holds up performance-wise (+ never checked if it was deemed unsupportable on 6.5 for performance/compatibility reasons or was EOL'd by LSI). - Going to assume you have this correctly in pass-through mode for the disks.

Try comparing performance tests with different IO profiles - potentially your home-lab is only far further behind in certain areas.

Make sure all drivers and firmware are up to scratch.

If you want to get deeper into comparing the performance of any cluster more granularly I would advise using vSAN Observer:

https://kb.vmware.com/s/article/2064240

Or 3rd party alternative: (don't mind the URL, link is safe for work :smileygrin: )

http://www.sexigraf.fr/

Hope this helps

Bob

TheBobkin · ‎11-22-2017

Hello mprazeres183

I wouldn't advise multiple vmknics per vSAN as first choice, there are caveats to configuring it this way - also redundancy/availability can be configured better lower in the stack (NIC):

"Virtual SAN does not support multiple VMkernel adapters on the same subnet. You can use multiple VMkernel adapters on different subnets, such as another VLAN or separate physical fabric. Providing availability by using several VMkernel adapters has configuration costs including vSphere and the network infrastructure. Network availability by teaming physical network adapters is easier to achieve with less setup."

https://docs.vmware.com/en/VMware-vSphere/6.0/com.vmware.vsphere.virtualsan.doc/GUID-031F9637-EE29-4...

Bob

wreedMH · ‎12-02-2017

I just found something very interesting when rebooting one of my nodes. My S3700 cache drives have negotiated to 3.0Gbps. What gives?

wreedMH · ‎12-02-2017

I have Sexigraf running on the cluster at home. The servers at work do have a dedicated 10Gbe link for VSAN, while my house does not, but I am not running those links anywhere near capacity and VSAN is free it use them all when they need to.

The switch at work is a Brocade Fabric VDX, while mine at my house it a Brocade ICX. Little different but I defiantly dont have a D-Link at home. The ICX is a enterprise grade switch that can be used for distribution or access layer. It has lots of capacity.

srodenburg · ‎12-03-2017

What Firmware version are your LSI 9207-8i cards on? By the look of that screenshot, that is not FW19 but something newer. I had major issues with FW20 and went back to FW19. Now all is good again. I had shitty performance and disks dropping out at random, especially at heavy load.

I don't know which donkey put FW20 on the HCL, but it should be removed.

srodenburg · ‎12-03-2017

About switches: I don't know your Brocade but in general, switches with small "per-port buffers" will give issues when going full-monty using NFS, iSCSI, FCOE or vSAN etc.

Storage over IP is very demanding, especially under prolonged load.

wreedMH · ‎12-03-2017

I am running FW 20. I will go back to FW 19 and see is that helps. I am still concerned that my cache S3700s are showing up as 3Gbps drives...

srodenburg · ‎12-03-2017

"S3700s are showing up as 3Gbps drives"

I had HGST 600GB 10k 6G SAS drives do that. With FW19, they are on 6 again.

wreedMH · ‎12-03-2017

OK I just downloaded FW19. Ill flash all three of my cards today. I am just running the inbox driver, that OK?

srodenburg · ‎12-04-2017

yes that's ok. I use the standard 6.5 U1 driver too.

wreedMH · ‎12-05-2017

Went to FW 19 on all the LSI 9207 controllers. Performance was worse. I am now back on FW 20.00.07.00 on the 9207s

Take a look at my NAA latency. Dont these seem really high for Intel DC SSD Drives?

wreedMH · ‎12-06-2017

For anyone else seeing this problem, it was the lsiprovider vib I had installed to manage the LSI controllers. I found another thread talking about that, I removed it and the latencies are back under 1ms.

srodenburg · ‎12-14-2017

FYI: VMware has removed FW20 from the HCL and is back at FW19. I guess I was not the only one with problems.

srodenburg · ‎12-14-2017

Can you provide a link to that thread?

I use the LSIProvider for 6.5 and have no performance issues :smileyconfused:

wreedMH · ‎12-14-2017

also google lsiprovider latency. Lots of articles.

Since it didnt give me the info I needed, I just removed it.

LSI SAS 2308 terrible write performance in RAID 1, ESXi 5.5 u1 + updates

hkg2581 · ‎12-18-2017

@wreedMH

Can you please start your performance tests , also start the observer buy logging into your vcenter server RVC console and stop the observer once done . If you need to isolate networking issues with need to see if have any out of order (ooo)packets and re-transmits under your networking section . See how to login to RVC How to log into RVC or Ruby vSphere Console for vSAN? and command to start Observer for vCSA/VC below , adjust mxrun time in hour below example is for running for an hour .

Appliance :

vsan.observer . --run-webserver --force --generate-html-bundle /tmp --interval 30 --max-runtime 1

Windows :

vsan.observer . --run-webserver --force --generate-html-bundle c:\\temp --interval 30 --max-runtime 1

Please let me know if you can share the esxi VM-support bundle and Observer bundle I can take a look at them offline .

Regards,

Hareesh K G

Thanks, Hareesh K G Personal Blog : http://virtuallysensei.com

wreedMH · ‎12-26-2017

Hkg,

I do have some out of orders and re-transmits on my networking section. The cluster is current doing a rebalance. I will post a full Observer bundle when I can.

I am running Brocade 1020 CNA 10GBe Nics

wreedMH · ‎12-29-2017

Hkg,

I have some logs for you to analyze. How do I get them to you?