large capacity Disk Group sizing (all-flash) : reb...

LubomirZvolens1 · ‎09-27-2017

Gentlemen, little advice needed with disk group sizing design.

VSAN 6.6.1, all-flash, six servers with 2x 10Gbit connectivity each. I'm leaning towards single disk group with PCI-e NVMe cache device (1.2TB capacity, yes I know it's more than 600GB ; 700k read IOPS, 180k write IOPS sustained) with five 1.92TB SATA capacity drives (90k read ; 66k write IOPS), single controller where SATA SSDs are connected. No other drives, ESXi booted from SD card.

Raid6, compression, dedupe. Let's say the capacity layer is almost full, for example with 9TB of data after compression/dedupe 2.5 : 1

I'm considering to create 10TB capacity disk groups (5x 1.92TB SSDs) which clearly requires long time to rebuild in case of failure. 10Gbit interface is capable to provide 1.2GB/s throughput, we have two per server, VM load is next-to-nothing from NW perspective. That means 16Gbit/s = 2GB/s are easily available to VSAN traffic all the time. If I create 10TB capacity disk groups, that is 10.000 GB divided by theoretical 2GB/s, let's count with 1GB/s in reality only... 10.000 seconds to rebuild components. Of course that is pretty optimistic, so let's make it 20.000 seconds = less than 6 hours. Being protected by Raid6, I feel pretty confident to let rebuild process run for 6 hours.

Question is what compression/dedupe will do to rebuild times. Are data calculated, uncompressed, transmitted over network to new destination where they are compressed and stored again ? Or are they transferred over network in compressed form because VSAN knows "hey, blocks 1, 10, 19, 28, 37, 46... are missing because disk group X failed on server Y" and it will not need to decompress it and recompress at new location ?

Does someone have real-life experience how long it takes to rebuild components in case of [disk group] failure in similar conditions ? Clearly I won't find someone with identical config, I just would appreciate to hear your real experience with all-flash rebuild times. I'm eager to hear "we have 4x 1TB SSDs in disk group and it takes 8 hours to rebuild components after failure". Of course different load on the subsystem would cause these times to extremely differ - if someone has 10.000 IOPS on average hitting VSAN while it's rebuilding, it will be different to someone with 80.000 IOPS.

Share your experience please.

General question, sure 10TB disk groups are hefty. These are 1U servers with 10 SAS bays available, so I wanted to use five of them for capacity right from the start, leaving the other five for future expansion. Would you recommend against 10TB disk groups ? What is reasonable all-flash maximum according to you ?

TheBobkin · ‎09-28-2017

Hello Lubomir,

Just a couple of points that I hope help to clarify multiple things here:

"700k read IOPS, 180k write IOPS sustained" , "90k read ; 66k write IOPS"

- Vendors spec-sheet for devices will generally have multiple strings attached to these stats including terms such as "up to" and/or tested on file-systems or test types that are not the same as functional-vSAN so take with a grain of salt.

"capacity layer is almost full, for example with 9TB of data"

- Best practice is to utilize ~70% to allow for headspace (as as much as possible should be thin to benefit from dedupe) vSAN starts moving data proactively between disks once they reach 81% (with default settings) - though this should be relatively balanced assuming R5 FTM and not too many huge Objects.

"after compression/dedupe 2.5 : 1"

- Whether this is a feasible ratio depends on the data, its size and distribution and % space utilized.

"Are data calculated, uncompressed, transmitted over network to new destination where they are compressed and stored again ?"

- Data is deduped/compressed as it is written to disk and is deduped per disk-group, so no.

"Would you recommend against 10TB disk groups ?"

- I have seen larger with less disks and smaller cache that I wouldn't advise, yours seems reasonable enough and NVMe-cache should help (*maybe* capacity drives a smidge bigger than ideal but again this depends on the data - if a lot of larger Objects/components than 2TB may be beneficial over anything smaller).

Regarding resync:

It depends on what the cause of failure and how fast this is resolved e.g.:

- Physically faulted cache-tier, capacity-tier means new DG (Disk-Group) created after replacement

- Controller/disk driver/firmware or other hardware/power/networking issue and disk-group comes back intact then *should* only require a partial delta resync but how much depends on time and the rate of data-change (+ as their are only four available DGs for components it can't rebuild until it gets all four available).

As far as resync rates go and calculating this: unfortunately this is a massive 'it depends' including factors such as contention with VM workload (resync is by defualt lower priority), how it is resyncing (partial or full rebuild) and presence of other issues.

I have seen resyncs in similar configurations go at 1TB+ an hour but I don't keep track of specifics such as higher stripe-width or drive-type/quality that might improve this, multiple available nodes and DGs (and controllers if applicable) per node is definitely preferable if possible.

Bob

LubomirZvolens1 · ‎09-28-2017

Thank you very much for reply, I noticed you are very active in VSAN forum !

>> "700k read IOPS, 180k write IOPS sustained" , "90k read ; 66k write IOPS"

> - Vendors spec-sheet for devices will generally have multiple strings attached to these stats including terms such as "up to"

> and/or tested on file-systems or test types that are not the same as functional-vSAN so take with a grain of salt.

Right, but enterprise-class disks provide steady-state performance figures and they are real. The particular model I was mentioning here is Micron 9100 PCI which has been tested with these figures, I didn't consider to be important to write that before. I believe filesystem has not too much to do with those performance figures as SSD does not understand OS filesystems, be it NTFS or VMFS.

Moreover, if NTFS performance is tested to be 180.000 random 4kB IOPS (because this is what you can find on internet - Windows 2012 performance tests), then VMFS or other filesystem performance can't be 50.000 IOPS only - if it is 155.000 or 175.000, honestly I don't care too much. BTW, these SSDs are often tested with virtual machines running on VMware and they achieve specified numbers so these are really achievable in VMware environment. All in all, I have little reasons to question manufacturer specifications.

>> "capacity layer is almost full, for example with 9TB of data"

> - Best practice is to utilize ~70% to allow for headspace (as as much as possible should be thin to benefit from dedupe) vSAN starts moving data proactively between disks once they reach 81%

> (with default settings) - though this should be relatively balanced assuming R5 FTM and not too many huge Objects.

70% full is more important for spinning (magnetic) disks as they are for SSDs becase beyond that threshold they start to heavily deteriorate with performance due to usage of inner tracks.

Yes sure the same 80%+ relocation principle applies to SSD capacity disks but they are not as sensitive as magnetic disks from performance perspective - and performance always was the main reason for this 70% threshold and why migrations start after 80%.

>> "after compression/dedupe 2.5 : 1"

> - Whether this is a feasible ratio depends on the data, its size and distribution and % space utilized.

sure it depends, I wrote "let's say". So let's pretend we achieved that compression/dedup ratio

>> "Are data calculated, uncompressed, transmitted over network to new destination where they are compressed and stored again ?"

> - Data is deduped/compressed as it is written to disk and is deduped per disk-group, so no.

You are right saying data are compressed/deduped as they are written to capacity layer from cache layer, and this is disk group specific, sure, this is written everywhere.

I didn't find relevant source saying "you have X hosts with compression / dedup, one of them fails, data are reconstructed with mathematical science like in classical raid5 arrays so they don't care about being compressed or not". No doubt VSAN has been created by extremely capable team, I just didn't find anybody confirming or denying what I wrote in regards to the compression/dedup. There might be decompression necessary for some specific reasons I have absolutely no clue about.

The reason I'm asking : in the case I made up (9TB compressed data on each host, single disk group, 2.5:1 dedup/compression ratio), it will be huge difference to transfer 9TB over network or to transfer 22.5TB (2.5x 9TB) over the same network. This essentially is about time which is necessary to recover from disk group failure ; yes I understand it depends on other factors but amount of data is still extremely significant factor.

> Regarding resync:

> It depends on what the cause of failure and how fast this is resolved e.g.:

> - Physically faulted cache-tier, capacity-tier means new DG (Disk-Group) created after replacement

> - Controller/disk driver/firmware or other hardware/power/networking issue and disk-group comes back intact then *should* only require a partial delta resync but how much depends on time and the rate of data-change (+ as their are only four available DGs for components it can't rebuild until it gets all four available).

Right. I was curious about degraded state (=fatal failure) with full component rebuild.

Partial delta resyncs in case of absent state are not my concern because the amount of data to synchronize is extremely smaller compared to full resync.

>> I have seen resyncs in similar configurations go at 1TB+ an hour but I don't keep track of specifics such as higher stripe-width or drive-type/quality

>> that might improve this, multiple available nodes and DGs (and controllers if applicable) per node is definitely preferable if possible.

phenomenal info, thank you very much. 1TB/hour is my dream because it means full component resync overnight even with this huge capacity per host.

>> "Would you recommend against 10TB disk groups ?"

> - I have seen larger with less disks and smaller cache that I wouldn't advise, yours seems reasonable enough and NVMe-cache should help (*maybe* capacity drives

> a smidge bigger than ideal but again this depends on the data - if a lot of larger Objects/components than 2TB may be beneficial over anything smaller).

only 600GB will be used from cache layer, right. Seems like I can't create more than one disk group per host, because I'm gonna use 1U rack servers with 10 SAS bays and three PCI-Express slots out of which only single one is free. At the same time, I see little meaning of SAS SSDs for cache and capacity layer because

- they are hooked to the same disk controller, I only have one controller in each host without possibility to extend

- limited performance compared to PCI NVMe devices

- I'm not going to hide five or six flash capacity devices with 66k write IOPS each behind single cache device with 70k write IOPS performance

- questionable performance of the only disk controller I will have in server especially when it has to de-stage data from SAS cache to SAS capacity, queues latencies etc.

- questionable performance of SAS cache single disk during destage (concurrent read and write operations, no longer "write only")

- usage of scarce 2.5" slots which I rather dedicate to capacity drives.

I understand design implications one DG versus two DGs such as bigger failure domain, better performance, more data to rebuild in case of failure etc. This is the specific reason why I'm asking about 10TB sized all-flash disk group as that is little too much for my taste - in case of failure, that is helluva lot of data to reconstruct !! From performance perspective, I'm replacing two SAS cache drives with single PCI NVMe device with even better performance figures (two SAS drives are not going to provide 700k read IOPS and 180k write IOPS combined). I'm also playing economy game here so I don't have free hands what drives to choose.

Someone might ask "why do you want 10TB per host, why don't you do 5TB per host and twice as much hosts". I have to scale up because of costs - VSAN licensing is going to kill the economy. In the case I have, I'm going to pay more for VSAN licenses than for hardware itself and yes I'm talking about 60TB capacity SSDs total plus another 7TB in NVMe drives !! Six hosts, about 10TB capacity each.

Every single big 10TB host costs more to license than to equip with 10TB of SSDs, pheww. Every small 5TB host would cost about twice as much to license than to equip with SSDs. Unfortunately we are budget constrained so I can't go scale-out way.

Additional questions, if I may : 10TB flash capacity layer, with only 600GB cache (1.2TB PCI NVMe drive but 600GB used only). Yay or nay ? It's not 10% recommended, will it be real problem with ALL-FLASH environment ? Reads always go directly from capacity layer. Writes... no more than 600GB will be used regardless of capacity so... so what ?

Favoritevmguy · ‎09-29-2017

IMO I would go with two disk groups because with dedupe and compression turned on if you lose one disk out of that group, you are down a full node of compute. Two disk groups would let your node still participate in the vsan cluster while you fixed the one failed disk.

TheBobkin · ‎09-29-2017

Hello Lubomir,

"Thank you very much for reply, I noticed you are very active in VSAN forum !"

Cheers - unfortunately vSAN can be a bit lacking in public-facing troubleshooting information or it is just hard to find/figure out so I put in what time I can spreading what I know.

Regarding disk-benchmarking - I didn't mean to say file-system as in FS, more so I meant caveats such as 'contiguous' etc., that Micron spec-sheet and the process to gather those stats looks far more legit and granular than other ones I have seen.

As per witnessing 1TB+ p/hr resyncs in 6-node AF clusters - resyncs are often non-linear and it can be near impossible to determine how much of it is partials (which AFAIK can show as calculating the entire component size as needing resync then be done with that component after delta completes) or full (and harder still when this is not my focus and I am trying to put out a fire :smileygrin: ), more input from others with in any way similar setups may help get a better idea about this.

"It's not 10% recommended, will it be real problem with ALL-FLASH environment ?"

Actually this is not as straight-forward as it seems and this recommendation has changed over time for AF, these articles should help clarify:

yellow-bricks.com/2016/02/16/10-rule-vsan-caching-calculate-vm-basis-not-disk-capacity/

blogs.vmware.com/virtualblocks/2017/01/18/designing-vsan-disk-groups-cache-ratio-revisited/

If going R6 as the FTM then I would strongly recommend going with a minimum of 7 nodes (N+1 as always is relevant).

I suggest this mainly for two reasons:

- Might permit to slightly lower DG and/or disk sizes a bit to have a better cache-ratio and reduce overall DG size, or alternatively allow more slack space.

- You can't rebuild degraded components of R6 with only 5 nodes/FDs available - you should consider what types of failures might occur, it might not always be as simple as swapping out a disk (+ are these always on-hand and is the site nearby?), if a motherboard or some other vital component dies then the time until this is back up might be significant.

Are you planning on going with R6 as the FTM for everything on this cluster or just a few key critical applications? If the latter, then FTT=2 should have you covered even if resync does take longer than would be desired.

Bob

LubomirZvolens1 · ‎09-30-2017

>> IMO I would go with two disk groups because with dedupe and compression turned on if you lose one disk out of that group, you are

>> down a full node of compute. Two disk groups would let your node still participate in the vsan cluster while you fixed the one failed disk.

agree. Customer is not compute bound, they have very underutilized servers from cpu/ram perspective (in fact, we will be doing 6 or 7 servers in cluster just because of Raid6 FTT=2 requirements). From what I have written :

"Seems like I can't create more than one disk group per host, because I'm gonna use 1U rack servers with 10 SAS bays and three PCI-Express slots out of which only single one is free."

"At the same time, I see little meaning of SAS SSDs for cache and capacity layer because..." {and several reasons explained}

Yes, agree, two disk groups per host would be better. No, I'm not going to design one DG with PCI-Express NMVe based CACHE drive and another DG with SAS/SATA based CACHE drive due to extremely different performance and latency characteristics and deal with consequences later. No way.

I can't create two NVMe-cached disk groups because there only is single PCI-Express slot free and there are no U2 slots at all. Moreover, no SAS drive capable of 100k+ IOPS exists for the price I'm going to pay for much faster 180k write IOPS NMVe with latencies 5-10x lower than SAS SSDs" plus I'm not sure how to physically connect different SAS slots in server to different controllers.

Even better, because I'm going to heavily overprovision that NVMe drive on top of what manufacturer already has done, it will be spitting out fire in the form of approximately 400k write IOPS - tested in reality. Nothing like that exists in SAS form.

All in all, two disk groups are not easy to create in environment I'm working with (customer has existing environment with 1U servers, that's not going to change).

==============================

>> "It's not 10% recommended, will it be real problem with ALL-FLASH environment ?"

> Actually this is not as straight-forward as it seems and this recommendation has changed over time for AF, these articles should help clarify:

> yellow-bricks.com/2016/02/16/10-rule-vsan-caching-calculate-vm-basis-not-disk-capacity/

> blogs.vmware.com/virtualblocks/2017/01/18/designing-vsan-disk-groups-cache-ratio-revisited/

I know, I've seen them. I will use brutally fast cache drives, 600GB cache per host, 6 or 7 hosts (will get to that in a second) so that is really a lot for write buffer. Furthermore, my capacity flash drives are fast as hell, too : 66k sustained write IOPS, five per each host == over 300k write IOPS raw performance. Uhm. Per host. Endurance of capacity drives is also not a question, each of them is guaranteed to sustain 17PB.

10% cache/capacity is like burned in everyone's mind. The hardest part will be to explain and convince customer about not hitting 10% cache/capacity ratio.

================================

>> If going R6 as the FTM then I would strongly recommend going with a minimum of 7 nodes (N+1 as always is relevant).

> I suggest this mainly for two reasons:

> - Might permit to slightly lower DG and/or disk sizes a bit to have a better cache-ratio and reduce overall DG size, or alternatively allow more slack space.

> - You can't rebuild degraded components of R6 with only 5 nodes/FDs available - you should consider what types of failures might occur, it might not always be as simple as swapping out a disk (+ are these always on-hand and is the site nearby?), if a motherboard or some other vital component dies then the time until this is back up might be significant.

yes, they are onsite, datacenter is in their building. They have contract with Dell, 6 hours fix time, they have some spare hardware available so in the worst case we will just relocate drives from one server to different one. Because we are going for home-grown solution, I will demand to have two capacity drives and one cache drive to be available on-site as spares.

Will see if I will be able to push customer to 6+1 configuration.

>> Are you planning on going with R6 as the FTM for everything on this cluster or just a few key critical applications?

>> If the latter, then FTT=2 should have you covered even if resync does take longer than would be desired.

I will be pushing them for Raid6 FTT=2 for everything, but this will be their decision with understandable consequences if they don't accept it.

I don't want to risk their operations just because of raid5 versus raid6 differences which are two servers worth of license (2x ~7500) and SSD drives (2x ~4000 for capacity tier, ~800 for NVMe, that's grand total of circa 25.000$. The piece of mind such additional redundancy provides is extraordinarily more valuable... especially considering with compression/dedup every drive failure brings down the whole disk group and requires full resync which takes time and puts infrastructure to risk.

================================

General question : with Raid6 and FTT=2, the cluster is ALWAYS able to sustain failure of TWO nodes failure domains at the same time while being fully operational and having all the data available, right ? Aren't there any exclusions, any gotchas ? Failure domain in case of dedup/compression is every failed SSD, capacity or cache, every single controller, node. Space overhead (1.5x) compared to Raid5 (1.33x) is minimal, price to pay is pretty acceptable, benefits in terms of redundancy / availability are extremely huge.

Guys, anyone else with real-world all-flash resync times ?

LubomirZvolens1 · ‎10-05-2017

nobody using all-flash VSAN ? Nobody ever rebuilt ?

wattpeter · ‎10-26-2017

I have all flash, 8TB per disk group.

I haven't done explicit testing but my general experience is that it would take 3 or 4 days to rebuild a disk group. Certainly more than 8 hours.

All

large capacity Disk Group sizing (all-flash) : rebuild times