vulpineox
Contributor
Contributor

vSAN Host PSoD from Capacity Disk Failure

Jump to solution

Hi.

 

I am using verion 7.0b vsan cluster.

one capacity disk failure has occurred in the environment, and the host has entered PSoD state.

 

In this regard, checked the KB below

https://kb.vmware.com/s/article/71207

 

However, there are areas that I don't understand.

PSoD may not occur in every situation where a disk failure occurs. Is there a criterion for PSoD to occur?

 

Additionally, I am curious about th meaning of 'wedged i/o or lost i/o'.

 

thank you for your guidance. 🙂

 

 

 

0 Kudos
1 Solution

Accepted Solutions
TheBobkin
VMware Employee
VMware Employee

@vulpineox  PSODs don't occur for the vast vast majority of disk failures, likely far less than 1% judging by how rarely we see cases relating to this GSS vs 'normal' failed disks - additionally, the vast majority of the times we see PSODs there is some unsupported element (e.g. controller/disk not on the vSAN HCL or not using driver/firmware listed on this).

 

As the kb says, this only occurs in scenarios where the disk or the controller managing it is non-responsive to more graceful attempts of unmounting the device or marking it as PDL for what in storage terms is a very long time (120 seconds). 'Wedged' or 'lost' IOs are those that have been transmitted to the end-device but no acknowledgement of them being received/committed is returned.

View solution in original post

2 Replies
TheBobkin
VMware Employee
VMware Employee

@vulpineox  PSODs don't occur for the vast vast majority of disk failures, likely far less than 1% judging by how rarely we see cases relating to this GSS vs 'normal' failed disks - additionally, the vast majority of the times we see PSODs there is some unsupported element (e.g. controller/disk not on the vSAN HCL or not using driver/firmware listed on this).

 

As the kb says, this only occurs in scenarios where the disk or the controller managing it is non-responsive to more graceful attempts of unmounting the device or marking it as PDL for what in storage terms is a very long time (120 seconds). 'Wedged' or 'lost' IOs are those that have been transmitted to the end-device but no acknowledgement of them being received/committed is returned.

View solution in original post

vulpineox
Contributor
Contributor

Thank you for your detailed and kind answer. Have a good time. 🙂

0 Kudos