Solved: Re: VSAN performance testing: What's the behavior ...

pfuhli · ‎04-02-2016

our VSAN node config

Dell R730xd

768GB RAM

PERC730

capacity tier: 21 x SAS Disks 1,2TB

caching tier: 3 x SANDISK SX350-3200

dedicated 10GE NIC for VSAN

We experience issues while driving a lot of VMs on our cluster which themselves generate a lot of IOPS (similar to IOmeter).

Issues we are facing:

- hosts became disconnected from vCenter

- vmx files became corrupt

This does not occur if we use only one VM per host for generating the load. In this case these are our observations:

Read %	Write %	Block Size KB	Oustanding IO	File Size GB	# Workers	Random %	Sequential %	FTT	# Components	read reservation	total IOPS	totl MBPS	avg Lat ms
45	55	4	5	10	4	95	5	1	6	0	8.092	33,15	0,6
45	55	4	5	10	4	95	5	1	12	0	10.821	44,33	0,46
45	55	4	5	10	4	95	5	1	6	100	12.611	51,65	0,39
45	55	4	5	10	4	95	5	1	12	100	11.374	46,59	0,43
45	55	4	64	10	4	95	5	1	12	100	29.746	121,8	2,15
100	0	4	64	10	4	100	0	1	12	0	50.576	207,1	1,26
100	0	4	64	10	4	100	0	1	12	100	50.571	207,1	1,26
100	0	0,512	64	10	4	0	100	0	1	100	67.330	34,47	3,8

So these numbers look pretty good, don't they?

From our point of view it would be ok if latency goes up when we increase the load with more VMs. But as already lined out above our cluster became broken after increasing the number of VMs at a certain point.

Currently we're testing with a subset of four VSAN nodes with the config mentioned above. In this cluster we were able to power on 187 of the load gen machines before three of four hosts went into "not responding" state.

So we wonder if somebody has also done extensive performance testing. If so we would be very interested in the observations you made. Maybe this helps us to find the flaw of our construct.

Regards,

daniel

stephan87 · ‎04-27-2016

With VMware Support we find out that the issues we are facing are caused by the load gen vms....each vm is generating 2048 outstanding ios...so with 187 vms on the cluster we have unreal outstanding ios (382976....when i remember the observer graphs our cluster could reduce it down to 13000 per host) which we will never see in real world scenarios. With this high outstanding ios vsan 6.1 has problems and vmware is working on a solution....maybe with 6.2 and qos it is resolved?

Some other changes:

- newer driver for our network cards intel x710

- newer driver and firmware (beta from dell) for our perc h730 controller

and some advanced parameters:

esxcfg-advcfg -s 100000 /LSOM/diskIoTimeout

esxcfg-advcfg -s 4 /LSOM/diskIoRetryFactor

esxcfg-advcfg -s 2047 /LSOM/heapSize (only with these parameter we were able to create 3 diskgroups - with 7 metal disks and 1 flash device - for each host)

View solution in original post

zdickinson · ‎04-02-2016

Good morning, if I understand correctly, all looks good in testing with IOMeter; but then under real world it falls over. I would be interested to see what you experience with a different storage card. I think I would stay away from any Perc card and go with LSI. Thank you, Zach.

pfuhli · ‎04-02-2016

Seems that Dell only supports PERC controllers in the R730xd. Here is what the datasheet of the servers lines out:

Intern:
PERC H330
PERC H730
PERC H730P
Extern:
PERC H830

The only LSI controller which is listed on the VSAN HCL for Dell servers is LSI SAS 9207-8i

I don't know if there are technical restrictions preventing this controller to be listed in the datasheet of the R730xd or if Dell only want's to sell it's own controllers.

Regards,

daniel

zdickinson · ‎04-03-2016

Good morning, I was not aware the LSI was not available in the 730XD. We're running them in 720s. I forgot to note if this was production. If it is, I'm not sure what you could try. If it's not, I would just try the LSI and see what happens. Thank you, Zach.

pfuhli · ‎04-03-2016

Hi Zach,

because of all the trouble we are still not in production state.

So we will see if we can get these LSI controllers and see what happens.

Regards,

daniel

stephan87 · ‎04-04-2016

Here are some messages from vmkernel.log attached after setting cluster under stress. We have also the following info from support till today: "We managed to exhaust a different Memory Heap". Now we are waiting for an answer from engineering.

srodenburg · ‎04-04-2016

I run a 6-node cluster. Completely HCL compliant (LSI 9207-8i). Each node has Sandy Bridge CPU's, plenty RAM for the size of the environment. SAS HDD's and SSD's.

One diskgroup per node with 4 x 600GB SAS 10k drives and a 200GB SSD. All 6G SAS.

Network: Fast, very low latency 10G.

Policy: FTT = 1 , SW = 1

During normal operation, the environment feels snappy and responsive. I'm really happy with the performance.

Until something goes wrong...

I lost one node once (in VSAN v6.1 days), triggering the re-build of the data on the other nodes. Capacity before the nodes crash was around 50% so plenty of space for the rebuild.

During the rebuild, the latencies seen bij the VM's sky-rocketed and one could simply go home. The VM's became so sluggish and basically unworkable. It took many hours for the data to be rebuild.

You basically get into a tail-biting situation. The rebuild causes latencies to go boooom, in turn slowing down the rebuild, in turn making it worse with the latencies etc. etc. etc.

The lesson I learned is "do NOT build single disk-group hybrid nodes". Meaning 1 SSD with a bunch of spindles. Either go all-flash (which I cannot afford) or when going Hybrid, use smaller SSD's with fewer spindles per disk-group. The more you can "divide" the data over disk-groups, the lower the impact is during a rebuild as the system can run jobs in parallel.

Rebuilds are very ugly. At least in smaller "hybrid" environments and single diskgroups per node. You need to be patient. Send the folks home and wait it out. Don't touch it while it churns away, because it will byte you in the arse (pun intended).

pfuhli · ‎04-04-2016

We planned several tests including rebuild. So, once our cluster is stable under load we execute the rebuild tests and I'll let you know...

stephan87 · ‎04-27-2016

With VMware Support we find out that the issues we are facing are caused by the load gen vms....each vm is generating 2048 outstanding ios...so with 187 vms on the cluster we have unreal outstanding ios (382976....when i remember the observer graphs our cluster could reduce it down to 13000 per host) which we will never see in real world scenarios. With this high outstanding ios vsan 6.1 has problems and vmware is working on a solution....maybe with 6.2 and qos it is resolved?

Some other changes:

- newer driver for our network cards intel x710

- newer driver and firmware (beta from dell) for our perc h730 controller

and some advanced parameters:

esxcfg-advcfg -s 100000 /LSOM/diskIoTimeout

esxcfg-advcfg -s 4 /LSOM/diskIoRetryFactor

esxcfg-advcfg -s 2047 /LSOM/heapSize (only with these parameter we were able to create 3 diskgroups - with 7 metal disks and 1 flash device - for each host)

All

VSAN performance testing: What's the behavior of your cluster under real stress?