VMware Cloud Community
flaxseed
Contributor
Contributor

VMware vSAN DiskGroup Failure

Hi all,

It seems that the disks groups of a particulat vSAN node went offline and we are trying to understand what exactly went wrong. VMware support proved once again to be incapable of handling the production down environment issues, putting us on hold for 4 hours, sad and very disappointing but true. Anyhow, these are the details of our vSAN cluster, feel free to share your thoughts.

5 vSAN nodes in total

HOST A

1x

Dell PowerEdge R620

- Image Profile: (Updated) Dell-ESXi-6.0U1-3073146-A01

- PERC Controller: PERC H710 Mini in RAID mode

- esxcfg-advcfg -g /LSOM/diskIoTimeout: Value of diskIoTimeout is 20000

- esxcfg-advcfg -g /LSOM/diskIoRetryFactor: Value of diskIoRetryFactor is 3

HOST B

1x

Dell PowerEdge R730xd

- Image Profile: (Updated) Dell-ESXi-6.0.0-2494585-A00 (**On top of the summary tab in vSphere Client it writes VMware ESXi, 6.0.0, 3073146)

- PERC Controller: PERC H730P Mini in HBA mode (Firmware Version: 25.4.0.0017, Driver Version: 6.903.85.00) - According to VMware's Compatibility Guide, these are the correct firmware and driver versions -> VMware Compatibility Guide - I/O Device Search

- SSDs: 800GB SSD SATA 6Gbps 2.5in DC S3710 (Revision: G201DL29) - According to VMware the Minimum Firmware Version is DL29 -> VMware Compatibility Guide - ssd

- HDDs: 1200GB SAS 2.5in ST1200MM0088 (Revision: TS04) - According to VMware the Minimum Firmware Version is N001 -> VMware Compatibility Guide - hdd

- esxcfg-advcfg -g /LSOM/diskIoTimeout: Value of diskIoTimeout is 100000 (recommended flags) - Required VSAN and ESXi configuration for controllers based on the LSI 3108 chipset (2144936) | VMwar...

- esxcfg-advcfg -g /LSOM/diskIoRetryFactor: Value of diskIoRetryFactor is 4 (recommended flags) - Required VSAN and ESXi configuration for controllers based on the LSI 3108 chipset (2144936) | VMwar...

HOST C - HOST D

2x

Dell PowerEdge R730xd

- Image Profile: (Updated) Dell-ESXi-6.0U1-3073146-A01

- PERC Controller: PERC H730P Mini in HBA mode (Firmware Version: 25.4.0.0017, Driver Version: 6.903.85.00) - According to VMware's Compatibility Guide, these are the correct firmware and driver versions -> VMware Compatibility Guide - I/O Device Search

- SSDs: 800GB SSD SATA 6Gbps 2.5in DC S3710 (Revision: G201DL29) - According to VMware the Minimum Firmware Version is DL29 -> VMware Compatibility Guide - ssd

- HDDs: 1200GB SAS 2.5in ST1200MM0088 (Revision: TS04) - According to VMware the Minimum Firmware Version is N001 -> VMware Compatibility Guide - hdd

- esxcfg-advcfg -g /LSOM/diskIoTimeout: Value of diskIoTimeout is 100000 (recommended flags) - Required VSAN and ESXi configuration for controllers based on the LSI 3108 chipset (2144936) | VMwar...

- esxcfg-advcfg -g /LSOM/diskIoRetryFactor: Value of diskIoRetryFactor is 4 (recommended flags) - Required VSAN and ESXi configuration for controllers based on the LSI 3108 chipset (2144936) | VMwar...

HOST E

1x

Dell PowerEdge R730xd

!!!!!!!!!!!!!!!!!!! - Image Profile: (Updated) Dell-ESXi-6.0.0-2494585-A00

- PERC Controller: PERC H730P Mini in HBA mode (Firmware Version: 25.4.0.0017, Driver Version: 6.903.85.00) - According to VMware's Compatibility Guide, these are the correct firmware and driver versions -> VMware Compatibility Guide - I/O Device Search

- SSDs: 800GB SSD SATA 6Gbps 2.5in DC S3710 (Revision: G201DL29) - According to VMware the Minimum Firmware Version is DL29 -> VMware Compatibility Guide - ssd

- HDDs: 1200GB SAS 2.5in ST1200MM0088 (Revision: TT31) - According to VMware the Minimum Firmware Version is N001 -> VMware Compatibility Guide - hdd

!!!!!!!!!!!!!!!!!!! - esxcfg-advcfg -g /LSOM/diskIoTimeout: Value of diskIoTimeout is 20000 (default flags) - Required VSAN and ESXi configuration for controllers based on the LSI 3108 chipset (2144936) | VMwar...

!!!!!!!!!!!!!!!!!!! - esxcfg-advcfg -g /LSOM/diskIoRetryFactor: Value of diskIoRetryFactor is 3 (default flags) - Required VSAN and ESXi configuration for controllers based on the LSI 3108 chipset (2144936) | VMwar...


As you can see from the above, the last vSAN node does not include the recommended vSAN IO timeout settings for this PERC Controller and we believe that this is what may have caused the issue. In addition to these settings, do you think the different ESXi version of HOST E somehow relates? In general, does it matter if you have different versions of ESXi in the vSAN cluster, as in 3073146 and 2494585? Finally, can you guess why would HOST B show a different ESXi version on top of summary tab and next to Image Profile?

Thank you in advance.

Tags (2)
0 Kudos
6 Replies
zdickinson
Expert
Expert

Good morning, I don't believe I can help with the "why".  My guess is that the answer to your question about versions being different is that they should all be the same and that is best practice, but in reality it should not matter.

What I do want to do is echo your thoughts on support.  When I have called in this is the response I get, "Yeah... the guy for that is on lunch".  The support issue is real and I hope it is addressed.  Thank you, Zach.

0 Kudos
elerium
Hot Shot
Hot Shot

Your ESXi build on your cluster appears to be 3073146 (other than host E running GA release 2494585). Not sure it's the greatest idea to be running long term with separate versions but in the short term it should be no issue. Still best to keep VSAN on the same ESXi version. Also i go by what the host version reports in vSphere, i think that's accurate regardless of whatever Dell labels things.


Build 3073146 would be VSAN 6.1 and it's possible you ran into this behavior (which is only in 6.1 and behavior was reverted in 6.2)

VSAN 6.1 New Feature - Handling of Problematic Disks - CormacHogan.com

Also which node did the failed disk group occur on? If you still have logging for the host where the disk group resides, /var/log/vmkernel.log and /var/log/vmkwarning.log may have some useful info.

If you restarted the node with the failed disk group, did VSAN remount it or is it still offline?

0 Kudos
evil_scooby
Contributor
Contributor

I think your storage controller firmware might be out of date as Dell has released a newer VSAN compatable one with some performance fixes.

Could also have been SSD failing (or VMware thinking it was going to fail) so it pulls the disk group down.

0 Kudos
flaxseed
Contributor
Contributor

Hi,

Thanks for your reply.

That is the thing though, how long should we expect to update firmware and drivers to make vSAN work flawlessly?

0 Kudos
flaxseed
Contributor
Contributor

Hi,

Thanks for your reply.

It seems we were running an older firmware for the PERC controller, so after we have updated the controller, the issue was resolved.

0 Kudos
zdickinson
Expert
Expert

Our implementation of vSAN in DR taught me two things.  I love hyper converged infrastructure, I just don't want to manage it.  We'll be going with VxRail for production.  Thank you, Zach.

0 Kudos