VMware Cloud Community
MtheG92
Enthusiast
Enthusiast
Jump to solution

2 vs 3 or more Disk Groups

I am configuring vSAN All-Flash SAS based systems with 2 disk groups per default since I started working with vSAN.

When it come to the question, why should I use two disk groups, I mostly can argument with the double I/O path available to the data streams and additional write buffer capacity. This results in higher IOPS/throughput and lower latency over all.

I know that this is not a double boost for the I/O performance but regarding to some resources there can be an over 50% performance optimization in write latency.

But how does this looks like with 3 or more disk groups? Is there a significant performance benefit (in numbers)?

A further question is about the logical cache limitation of 600GB. This limitation is set on a per disk group level as far as I know, is that correct?

Yes, in total 3TB (5x 600GB) write buffer cache can be provided per host with 5 disk groups. Source: https://blogs.vmware.com/virtualblocks/2019/10/01/write-buffer-sizing-vsan/

Thanks in advance,

MtheG

1 Solution

Accepted Solutions
TheBobkin
Champion
Champion
Jump to solution

Hello MtheG92​,

Welcome (back) to Communities.

"But how does this looks like with 3 or more disk groups? Is there a significant performance benefit (in numbers)?"

While it won't account for the performance improvements made in later versions (most significant being in 6.7 U3/7.0), this VMworld session has some decent comparisons of benchmarks with different numbers of DGs (Disk-Groups):

https://static.rainfocus.com/vmware/vmworldus17/sess/1489529911389001s06n/finalpresentationPDF/STO25...

https://videos.vmworld.com/global/2017/videoplayer/2582

Bear in mind that whether 2 vs 3/4 DGs will be beneficial or make a noticeable difference in performance may come down to whether you are pushing the system to its limits or not.

Another thing to consider is that if you for instance had 2 DGs, each composed of 1x600GB Cache-Tier + 4x2TB Capacity-Tier, adding a 3rd DG with the same layout may improve the overall node/clusters IOPS output, but if you are filling it to the same level as the others then from a VM-perspective it won't necessarily be gainful (e.g. it is stronger but lifting a heavier weight) - though that being said, it may smooth performance out.

Where you are more likely to see a tangible performance improvement is if you split the same amount of capacity over more DGs as then you would have a better usable:used Cache:Capacity ratio.

"A further question is about the logical cache limitation of 600GB. This limitation is set on a per disk group level as far as I know, is that correct?"

Yes, but also consider that anything beyond 600GB will be used to extend the longevity of the device via wear-levelling (beyond what is already non-visible/reserved by manufacturer for this purpose).

Is there any scope in this project for using NVMe Cache-Tier devices? I mention it as these can pull so much weight considering smaller sizes generally used, a 375GB P4800x likely far outperforms 2x600GB of your average Class D/E write-intensive SSD as Cache-Tier and thus while they appear pricey they may be worth the investment.

Bob

View solution in original post

Reply
0 Kudos
4 Replies
TheBobkin
Champion
Champion
Jump to solution

Hello MtheG92​,

Welcome (back) to Communities.

"But how does this looks like with 3 or more disk groups? Is there a significant performance benefit (in numbers)?"

While it won't account for the performance improvements made in later versions (most significant being in 6.7 U3/7.0), this VMworld session has some decent comparisons of benchmarks with different numbers of DGs (Disk-Groups):

https://static.rainfocus.com/vmware/vmworldus17/sess/1489529911389001s06n/finalpresentationPDF/STO25...

https://videos.vmworld.com/global/2017/videoplayer/2582

Bear in mind that whether 2 vs 3/4 DGs will be beneficial or make a noticeable difference in performance may come down to whether you are pushing the system to its limits or not.

Another thing to consider is that if you for instance had 2 DGs, each composed of 1x600GB Cache-Tier + 4x2TB Capacity-Tier, adding a 3rd DG with the same layout may improve the overall node/clusters IOPS output, but if you are filling it to the same level as the others then from a VM-perspective it won't necessarily be gainful (e.g. it is stronger but lifting a heavier weight) - though that being said, it may smooth performance out.

Where you are more likely to see a tangible performance improvement is if you split the same amount of capacity over more DGs as then you would have a better usable:used Cache:Capacity ratio.

"A further question is about the logical cache limitation of 600GB. This limitation is set on a per disk group level as far as I know, is that correct?"

Yes, but also consider that anything beyond 600GB will be used to extend the longevity of the device via wear-levelling (beyond what is already non-visible/reserved by manufacturer for this purpose).

Is there any scope in this project for using NVMe Cache-Tier devices? I mention it as these can pull so much weight considering smaller sizes generally used, a 375GB P4800x likely far outperforms 2x600GB of your average Class D/E write-intensive SSD as Cache-Tier and thus while they appear pricey they may be worth the investment.

Bob

Reply
0 Kudos
MtheG92
Enthusiast
Enthusiast
Jump to solution

Ho Bob

Thank you for your feedback.

Bear in mind that whether 2 vs 3/4 DGs will be beneficial or make a noticeable difference in performance may come down to whether you are pushing the system to its limits or not.

Another thing to consider is that if you for instance had 2 DGs, each composed of 1x600GB Cache-Tier + 4x2TB Capacity-Tier, adding a 3rd DG with the same layout may improve the overall node/clusters IOPS output, but if you are filling it to the same level as the others then from a VM-perspective it won't necessarily be gainful (e.g. it is stronger but lifting a heavier weight) - though that being said, it may smooth performance out.

Where you are more likely to see a tangible performance improvement is if you split the same amount of capacity over more DGs as then you would have a better usable:used Cache:Capacity ratio.

I agree with you and also interesting video from VMworld 2017.

Is there any scope in this project for using NVMe Cache-Tier devices? I mention it as these can pull so much weight considering smaller sizes generally used, a 375GB P4800x likely far outperforms 2x600GB of your average Class D/E write-intensive SSD as Cache-Tier and thus while they appear pricey they may be worth the investment.

Not yet but in the next configuration I will definitely use the 375GB NVMe SSD drives for caching. What do you think about the destaging performance from NVMe cache to SAS capacity? I think there could be a potential problem with to much inflight I/O (because the NVMe cache is faster and provides with lower write latency) that will not be destaged to the slower SAS backed capacity tier as fast as new I/O hits the write buffer. Do you have some real world experience with NVMe based vSAN configurations?

Kind regards,

MtheG

Reply
0 Kudos
srodenburg
Expert
Expert
Jump to solution

You are correct that NVMe cache-devices will be "waiting" for SAS Flash devices. Sure. But that will be always be the case. So use the largest cache-devices you can afford. Very large NVMe cache devices can suck up a lot of incoming writes, serving them back from NVMe when those blocks are still "hot" (cache-hit probability is higher). At the same time, because hot blocks are served from cache, de-staging can afford to take longer.

The smaller the NVMe cache device, the more it is under pressure to de-stage as the write cache fills up so quickly in comparison.

Another approach is to have disk-groups with more SAS capacity devices per cache-device. SAS3 and SAS4 are not slow by any means and when the entire system is really put to work and data is spread over more capacity devices, de-staging can be done in parallel (per cache-device). Divide and conquer 😉

To be honest, most of my customers worry about disk performance, buy NVMe+SAS Flash (some even all-NVMe) only to find out that those devices are picking their noses all day and the load they put on those systems is no where near the limits of the flash devices. Even SAS3 gobbles it up without making a dent. One must have brutally heavy applications to impress modern day flash, NVMe especially.

Also, buying networking equipment with relatively high inter-port latencies can ruin your NVMe happy day. When data, say mirrored, is written, it has to be written to at least 2 nodes and thus devices. That data goes over the network. You will notice it in the statistics if Switch A takes "X milliseconds" to forward packets between ports while Switch B uses half the time or less. Rule of thumb "The more brains a switch has, the more it "thinks", the slower the inter-port latency is". In other words, use "dumb as sh*t" switches with large per port packet buffers as they tend to be better suitable for ip-based storage (what vSAN is).

I've seen customers buy very expensive 25gig networking equipment with a million features, only to stretch the solution between two datacenters, 150 miles apart with 4ms latency between datacenters and then wonder why their super-duper all flash NVMe streched cluster "writes so slowly"  (reads are local so they are always fast).

MtheG92
Enthusiast
Enthusiast
Jump to solution

You are correct that NVMe cache-devices will be "waiting" for SAS Flash devices. Sure. But that will be always be the case. So use the largest cache-devices you can afford. Very large NVMe cache devices can suck up a lot of incoming writes, serving them back from NVMe when those blocks are still "hot" (cache-hit probability is higher). At the same time, because hot blocks are served from cache, de-staging can afford to take longer.

What do you exactly mean with a large NVMe device? Because the logical limitation of vSAN everything above 600GB would be waste in money since the NVMe drives are not the cheapest and the endurance is more then enough. For example I would recommend a 750GB NVMe cache rather then a 1600GB.

Another approach is to have disk-groups with more SAS capacity devices per cache-device. SAS3 and SAS4 are not slow by any means and when the entire system is really put to work and data is spread over more capacity devices, de-staging can be done in parallel (per cache-device). Divide and conquer 😉

I absolutely agree with you.

Also, buying networking equipment with relatively high inter-port latencies can ruin your NVMe happy day. When data, say mirrored, is written, it has to be written to at least 2 nodes and thus devices. That data goes over the network. You will notice it in the statistics if Switch A takes "X milliseconds" to forward packets between ports while Switch B uses half the time or less. Rule of thumb "The more brains a switch has, the more it "thinks", the slower the inter-port latency is". In other words, use "dumb as sh*t" switches with large per port packet buffers as they tend to be better suitable for ip-based storage (what vSAN is).

I also absolutely agree with you.

Reply
0 Kudos