have you had the same issue again?
No, it was only once so far.
We had the same issue 4 times in the last 2-3 weeks on one of the hosts.
Do you remember the commands of the missing KB link or do you have any other information?
Thanks a lot
"Is this behavior by design or misconfiguration?"
This is by design - while we test as rigidly as possible any components and driver+firmware combinations that we certify on the vSAN HCL, it is not feasible to test every possible combination of components used together - thus there are scenarios where a disk or controller can become in a non-responsive state and for which the various means by which ESXi and vSAN deal with these cannot do so in a convenient manner.
Not dealing with such scenarios in a timely manner can potentially have knock-on impacts on the performance and data-sync of the cluster and thus PSODing the host after 120 seconds of the problematic component being unable to comply, while a bit brute-force, in my opinion is better than the alternative of not doing anything.
ralfthiel, I would advise against posting (currently) internal-only information and/or shared with you privately by VMware, I am no lawyer but potentially this could have legal implications and/or likely violates the terms of your organisations support contract with VMware.
That being said, I do agree that this should be publicly documented so as better understanding of the issues is easily accessible - I am currently engaged with my colleagues with regard to the feasibility of this.
"have you had the same issue again?"
Ice_Dog_M, you are asking the wrong question here - PSODs are often just the outcome of a different issue with the host, they are not the actual problem - the PSOD here is occurring because there was an issue with a disk and/or controller that could not be managed by our array of conventional (and less disruptive)means of doing so. Thus the underlying problem needs to be addressed if you do not wish this to re-occur
"We had the same issue 4 times in the last 2-3 weeks on one of the hosts."
cblochi, please open a Support Request with us at GSS so that we can assist with identifying and resolving the underlying issue - I would not advise disabling this mechanism unless you have clear evidence that it is being triggered in some scenario where it should not be.
I'm sorry, you're right of course.
I just removed my post.
As I said I would, I have worked with my colleagues to 'un-disappear' the kb article that in better detail explains the rationale and mechanisms of this PSOD:
If this PSOD is encountered, I would advise engaging GSS vSAN and likely your hardware vendor as the PSOD is not the root problem - the host was PSODed to avoid further impact due to controller/disk issues that could not be remediated by any other means.
at the time of my first post, I used SSDs, which are not part of the compatibility list.
I reorganized my vSAN and substituted the old SSDs with supported SSDs. So far, it is working.
There is might be and (partly) broken disk among the old SSDs, That's maybe the root of the PSoD, I will check this in the next weeks.
Thanks for your help!
Hello and thank you for making this KB available again. Had the same issue with the other host, filled support ticked this time. Got a copy/paste of the article as an answer and advice to work this out with the hardware vendor. I think that the only fast and reliable option is to remove any disk groups from a couple of hosts and stick important VMs to them. How would IO/latency perform on compute only nodes in vSAN cluster? Does vSAN tries to hold VM data near to VM host?