VMware Cloud Community
undejj
Contributor
Contributor
Jump to solution

VSAN configuration after SSD failure

Here is our SDDC configuration.

•      Four VMware hosts in HA cluster

•      Each host has 2 procs, 8 cores each and 148 GB memory

•      vSphere 5.5 and VSAN 5.5

•      40 VMs

•      Each host has 10 NICs with 10 Gbe network for VSAN

•      VSAN capacity is 31.83 TB with 12.47 TB free

•      Each host has 2 disk groups

•      Each disk group has 200 GB SSD

•      vCenter Server appliance resides on VSAN

•      All company VMs reside on VSAN, including SQL Servers, Exchange Servers and custom application used by the whole company

•      The IP phone system for our call center is virtualized and residing on the VSAN

We were VMware novices when this system was designed and deployed 3 years ago. The thought behind the design was resiliency for business continuity. Elimination of single points of failure.

Last week, one of the SSD devices failed. When that happened, we could no longer access vCenter Server and every one of our servers slowed to a crawl, effectively stopping our business – no custom application, no email and no phones.

We got on the phone with VMware and 6 hours later, we were back up and running. The disk group needed removed from VSAN, the bad SSD removed from the disk group, the new SSD added to the disk group and the disk group added back to VSAN. It was all done via ESXCLI.

First, I believe we should (1) move our vCenter Server to a physical server and (2) learn ESXCLI.

I believe this affected us so severely because (a) VMware recommends sizing SSD cache to about 10% of the expected VSAN load. Based on our current VSAN use, we are at about 8%, (b) VMware’s recommendation is based on its assumption about 10% of data is frequently accessed. We have a call center phone system running on VSAN storage and I believe we have a high rate of disk reads and writes, maybe higher than the assumption and (c) we were not able to quickly respond to the SSD failure with no access to vCenter and less than adequate ESXCLI skills.

My question is one of validation for my plan going forward. I am recommending we move from 2 disk groups per host to 5 disk groups per host. In doing so, we will add 12 SSD cache devices, increasing our percentage of SSD cache in VSAN to 20.7% currently and 12.5% at full capacity.

Does my plan make sense for helping us avoid another disaster the next time a VSAN SSD fails?

Reply
0 Kudos
1 Solution

Accepted Solutions
admin
Immortal
Immortal
Jump to solution

Greetings!

Let's go step by step to this query:

1) If not already done, please have a complete vSAN analysis done by VMware team so that It is sure that the issue you faced was due to not having enough capacity in the vSAN Cache Tier.

2) As per VMware sizing and design guideline - Ideally the cache size should be big enough to hold the repeatedly used blocks in the workload. We call this the active working set. However, it is not easy to obtain the active working set of the workload because typical workloads show variations with respect to time, changing the working set and associated cache requirements.

3) Looking at your environment specification, It is quite sure that the cache is not enough for Active Working Set and yes, your plan to move from 2 Disk Group per host to 5 Disk Group per host makes absolute sense as It will add more capacity to the cache tier of vSAN datastore. Please have a look at blog post One versus multiple VSAN disk groups per host‌ This post should give you more info on why having multiple disk groups per host is a good thing.

4) Moving vCenter to physical server is not a good option as you will again have a single point of failure.

5) Also see if you can upgrade your environment to vSphere 6.5 or at least vSphere 6.0 u2. vSphere 6.5 and 6.0 u2 has a lot more vSAN level view in vCenter UI than vSphere 5.5. Also, needless to say, there are many bug fixes in the latest releases of vSAN.

Hope this is helpful and answers your query.

_________________________

Was your question answered correctly? If so, please remember to mark your question as answered when you get the correct answer and award points to the person providing the answer. This helps others searching for a similar issue.


Cheers!

-Shivam

View solution in original post

Reply
0 Kudos
6 Replies
admin
Immortal
Immortal
Jump to solution

Greetings!

Let's go step by step to this query:

1) If not already done, please have a complete vSAN analysis done by VMware team so that It is sure that the issue you faced was due to not having enough capacity in the vSAN Cache Tier.

2) As per VMware sizing and design guideline - Ideally the cache size should be big enough to hold the repeatedly used blocks in the workload. We call this the active working set. However, it is not easy to obtain the active working set of the workload because typical workloads show variations with respect to time, changing the working set and associated cache requirements.

3) Looking at your environment specification, It is quite sure that the cache is not enough for Active Working Set and yes, your plan to move from 2 Disk Group per host to 5 Disk Group per host makes absolute sense as It will add more capacity to the cache tier of vSAN datastore. Please have a look at blog post One versus multiple VSAN disk groups per host‌ This post should give you more info on why having multiple disk groups per host is a good thing.

4) Moving vCenter to physical server is not a good option as you will again have a single point of failure.

5) Also see if you can upgrade your environment to vSphere 6.5 or at least vSphere 6.0 u2. vSphere 6.5 and 6.0 u2 has a lot more vSAN level view in vCenter UI than vSphere 5.5. Also, needless to say, there are many bug fixes in the latest releases of vSAN.

Hope this is helpful and answers your query.

_________________________

Was your question answered correctly? If so, please remember to mark your question as answered when you get the correct answer and award points to the person providing the answer. This helps others searching for a similar issue.


Cheers!

-Shivam

Reply
0 Kudos
undejj
Contributor
Contributor
Jump to solution

Thank you for your reply. How do I go about getting "a complete vSAN analysis done by VMware team?"

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

The engineer who worked with you on the recovery of vSAN Cluster should have given some recommendations already to avoid this type of failure in future. The same engineer should be able to help you in understanding your environment in more details so that you know if the issue really happened due to insufficient storage in Cache tier.

If you already have these analysis with you and you know that the issue actually happened due to insufficient storage in Cache tier then I think you are good to go with your plan.

Cheers!

-Shivam

Reply
0 Kudos
undejj
Contributor
Contributor
Jump to solution

Although the engineer did not specifically point to insufficient storage in cache tier, he recommended either more hosts or more storage groups.

Reply
0 Kudos
elerium
Hot Shot
Hot Shot
Jump to solution

Upgrading to VSAN 6 is supposed to provide a much higher performance benefit as well (whitepaper claiming between 2X-3.5X), may be worthwhile to upgrade (make sure all your stuff is still certified in VSAN 6 HCL)

http://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/whats-new-perf-vsphere6-w... page 8

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

After looking at your environment specifications, I am quite confident that adding more Disk Groups on each host (which will eventually add more capacity to the Cache Tier) will help to avoid this type of failures in future. It's a good thing to have multiple Disk Groups per host as explained here - One versus multiple VSAN disk groups per host

Cheers!

-Shivam

Reply
0 Kudos