our VSAN node config
Dell R730xd
768GB RAM
PERC730
capacity tier: 21 x SAS Disks 1,2TB
caching tier: 3 x SANDISK SX350-3200
dedicated 10GE NIC for VSAN
We experience issues while driving a lot of VMs on our cluster which themselves generate a lot of IOPS (similar to IOmeter).
Issues we are facing:
- hosts became disconnected from vCenter
- vmx files became corrupt
This does not occur if we use only one VM per host for generating the load. In this case these are our observations:
Read % | Write % | Block Size KB | Oustanding IO | File Size GB | # Workers | Random % | Sequential % | FTT | # Components | read reservation | total IOPS | totl MBPS | avg Lat ms |
45 | 55 | 4 | 5 | 10 | 4 | 95 | 5 | 1 | 6 | 0 | 8.092 | 33,15 | 0,6 |
45 | 55 | 4 | 5 | 10 | 4 | 95 | 5 | 1 | 12 | 0 | 10.821 | 44,33 | 0,46 |
45 | 55 | 4 | 5 | 10 | 4 | 95 | 5 | 1 | 6 | 100 | 12.611 | 51,65 | 0,39 |
45 | 55 | 4 | 5 | 10 | 4 | 95 | 5 | 1 | 12 | 100 | 11.374 | 46,59 | 0,43 |
45 | 55 | 4 | 64 | 10 | 4 | 95 | 5 | 1 | 12 | 100 | 29.746 | 121,8 | 2,15 |
100 | 0 | 4 | 64 | 10 | 4 | 100 | 0 | 1 | 12 | 0 | 50.576 | 207,1 | 1,26 |
100 | 0 | 4 | 64 | 10 | 4 | 100 | 0 | 1 | 12 | 100 | 50.571 | 207,1 | 1,26 |
100 | 0 | 0,512 | 64 | 10 | 4 | 0 | 100 | 0 | 1 | 100 | 67.330 | 34,47 | 3,8 |
So these numbers look pretty good, don't they?
From our point of view it would be ok if latency goes up when we increase the load with more VMs. But as already lined out above our cluster became broken after increasing the number of VMs at a certain point.
Currently we're testing with a subset of four VSAN nodes with the config mentioned above. In this cluster we were able to power on 187 of the load gen machines before three of four hosts went into "not responding" state.
So we wonder if somebody has also done extensive performance testing. If so we would be very interested in the observations you made. Maybe this helps us to find the flaw of our construct.
Regards,
daniel
With VMware Support we find out that the issues we are facing are caused by the load gen vms....each vm is generating 2048 outstanding ios...so with 187 vms on the cluster we have unreal outstanding ios (382976....when i remember the observer graphs our cluster could reduce it down to 13000 per host) which we will never see in real world scenarios. With this high outstanding ios vsan 6.1 has problems and vmware is working on a solution....maybe with 6.2 and qos it is resolved?
Some other changes:
- newer driver for our network cards intel x710
- newer driver and firmware (beta from dell) for our perc h730 controller
and some advanced parameters:
esxcfg-advcfg -s 100000 /LSOM/diskIoTimeout
esxcfg-advcfg -s 4 /LSOM/diskIoRetryFactor
esxcfg-advcfg -s 2047 /LSOM/heapSize (only with these parameter we were able to create 3 diskgroups - with 7 metal disks and 1 flash device - for each host)
Good morning, if I understand correctly, all looks good in testing with IOMeter; but then under real world it falls over. I would be interested to see what you experience with a different storage card. I think I would stay away from any Perc card and go with LSI. Thank you, Zach.
Seems that Dell only supports PERC controllers in the R730xd. Here is what the datasheet of the servers lines out:
Intern:
PERC H330
PERC H730
PERC H730P
Extern:
PERC H830
The only LSI controller which is listed on the VSAN HCL for Dell servers is LSI SAS 9207-8i
I don't know if there are technical restrictions preventing this controller to be listed in the datasheet of the R730xd or if Dell only want's to sell it's own controllers.
Regards,
daniel
Good morning, I was not aware the LSI was not available in the 730XD. We're running them in 720s. I forgot to note if this was production. If it is, I'm not sure what you could try. If it's not, I would just try the LSI and see what happens. Thank you, Zach.
Hi Zach,
because of all the trouble we are still not in production state.
So we will see if we can get these LSI controllers and see what happens.
Regards,
daniel
I run a 6-node cluster. Completely HCL compliant (LSI 9207-8i). Each node has Sandy Bridge CPU's, plenty RAM for the size of the environment. SAS HDD's and SSD's.
One diskgroup per node with 4 x 600GB SAS 10k drives and a 200GB SSD. All 6G SAS.
Network: Fast, very low latency 10G.
Policy: FTT = 1 , SW = 1
During normal operation, the environment feels snappy and responsive. I'm really happy with the performance.
Until something goes wrong...
I lost one node once (in VSAN v6.1 days), triggering the re-build of the data on the other nodes. Capacity before the nodes crash was around 50% so plenty of space for the rebuild.
During the rebuild, the latencies seen bij the VM's sky-rocketed and one could simply go home. The VM's became so sluggish and basically unworkable. It took many hours for the data to be rebuild.
You basically get into a tail-biting situation. The rebuild causes latencies to go boooom, in turn slowing down the rebuild, in turn making it worse with the latencies etc. etc. etc.
The lesson I learned is "do NOT build single disk-group hybrid nodes". Meaning 1 SSD with a bunch of spindles. Either go all-flash (which I cannot afford) or when going Hybrid, use smaller SSD's with fewer spindles per disk-group. The more you can "divide" the data over disk-groups, the lower the impact is during a rebuild as the system can run jobs in parallel.
Rebuilds are very ugly. At least in smaller "hybrid" environments and single diskgroups per node. You need to be patient. Send the folks home and wait it out. Don't touch it while it churns away, because it will byte you in the arse (pun intended).
We planned several tests including rebuild. So, once our cluster is stable under load we execute the rebuild tests and I'll let you know...
With VMware Support we find out that the issues we are facing are caused by the load gen vms....each vm is generating 2048 outstanding ios...so with 187 vms on the cluster we have unreal outstanding ios (382976....when i remember the observer graphs our cluster could reduce it down to 13000 per host) which we will never see in real world scenarios. With this high outstanding ios vsan 6.1 has problems and vmware is working on a solution....maybe with 6.2 and qos it is resolved?
Some other changes:
- newer driver for our network cards intel x710
- newer driver and firmware (beta from dell) for our perc h730 controller
and some advanced parameters:
esxcfg-advcfg -s 100000 /LSOM/diskIoTimeout
esxcfg-advcfg -s 4 /LSOM/diskIoRetryFactor
esxcfg-advcfg -s 2047 /LSOM/heapSize (only with these parameter we were able to create 3 diskgroups - with 7 metal disks and 1 flash device - for each host)