VMware Cloud Community
fabio1975
Commander
Commander

vSAN Health: Failed Overall disks health

HI Guys,

I have a vSAN infrastructure (stretched cluster with 3+3 node, all-flash) that after an hardware reset on all the disks of a node,  the vSAN has put the diskgroups, hosted in this node, in the failed state.
As an indication in the health of the vsan:
vSAN Health: Failed Overall disks heath.

The disks of the node are not in hardware error.

The vms are working and it seems that there are no objects in error and I do not find objects hosted on diskgroups in the failed state.
I opened a call to the support but they are analyzing the logs for a weeks. Until today no news
Almost certainly it is a problem of firmware and driver of the raid controller.
How can I active  the failed diskgroups?

I thought of trying to put the node in maintenance mode (trying to evacuate the data but it should not move anything because the diskgroups were in error) and then a simple node reboot  to see if it active the diskgroup everything.

Thank you

Fabio

Fabio

Visit vmvirtual.blog
If you're satisfied give me a kudos

Reply
0 Kudos
2 Replies
TheBobkin
Champion
Champion

Hello Fabio,

"that after an hardware reset on all the disks of a node,  the vSAN has put the diskgroups, hosted in this node, in the failed state."

What model of controller and what driver+firmware in use? (and if any variant of H730P - do you have any VMFS/logging/dumps going to disks on same controller?)

Are you seeing a lot of 'Power on reset' messages in vmkernel.log and vobd.log? Are you seeing reset sense codes (H:0x7)?

"How can I active  the failed diskgroups?"

Are the disks out of CMMDS? If yes, have you tried unmount + remount of the failed disks?

"The vms are working and it seems that there are no objects in error and I do not find objects hosted on diskgroups in the failed state."

Yes, as all the data would have been rebuilt on the remaining nodes and if the disk-group(s) are still marked as failed then it isn't going to place any data-components on these.

"I thought of trying to put the node in maintenance mode (trying to evacuate the data but it should not move anything because the diskgroups were in error) and then a simple node reboot  to see if it active the diskgroup everything."

If all the data from this node has already been rebuilt elsewhere then placing this in MM with 'Ensure Accessibility' and rebooting it isn't going to negatively change the data availability, so yes do perform this.

"I opened a call to the support but they are analyzing the logs for a weeks. Until today no news"

Sorry to hear that, can you PM me the Support Request number?

Bob

Reply
0 Kudos
chris122686
Enthusiast
Enthusiast

Q: What does the Physical Disk Health - Overall Disks Health check do?

Checks the physical disk operation status for all hosts in the vSAN cluster.

Q: What does it mean when it is in an error state?

If this check fails, the disk cannot be used by vSAN anymore with the possible reasons including the physical disk damage, the issue in reading the disk metadata or the vSAN software issue preventing it to use this disk.

Q: What does it mean when the operational state is Impending permanent disk failure?

Dying Disk Handling (DDH) in vSAN continuously monitors the health of disks and disk groups in order to detect an impending disk failure or a poorly performing disk group. When such conditions are detected, vSAN marks the disk or disk group as unhealthy and might trigger data evacuation from the affected disk or disk group. Such disks and disk groups display an operational state of Impending permanent disk failure. For more information, see Dying Disk Handling (DDH) in vSAN 6.6 (2148358).

Q: How does one troubleshoot and fix the error state?

You need to examine the information displayed as part of the health check.

For example:

  • Is the disk offline or permanent failure indicating there is physical disk damage?
  • Is it an issue when trying to read the metadata of the drive? This implies that the drive is offline and unavailable for use.
  • Is it the vSAN software state that is the root cause, which in all likelihood will impact all of the disks on this host?

Each of these individual checks must be considered to determine the corrective course of action. Some of the checks imply that the drive is offline, others imply that the drive is still online, but some corrective action might be needed.

Please mark this as correct if this answers and helps you .

Christopher Sibug
Reply
0 Kudos