VMware Cloud Community
Ice_Dog_M
Contributor
Contributor
Jump to solution

PSoD with suspended I/Os on capacity SSD

Hello,

7-node all-flash vSAN cluster. One host went into PSoD with the following error. All VMs on the host crashed, restarted on another host by HA but we still needed some manual steps to perform to start the databases. Is this behavior by design or misconfiguration? I have FTT2 and FTT3 policies but VMs could crash with a single disk problem?

Tags (2)
1 Solution

Accepted Solutions
TheBobkin
Champion
Champion
Jump to solution

Hello All,

As I said I would, I have worked with my colleagues to 'un-disappear' the kb article that in better detail explains the rationale and mechanisms of this PSOD:

VMware Knowledge Base

If this PSOD is encountered, I would advise engaging GSS vSAN and likely your hardware vendor as the PSOD is not the root problem - the host was PSODed to avoid further impact due to controller/disk issues that could not be remediated by any other means.

Bob

View solution in original post

Reply
0 Kudos
8 Replies
Ice_Dog_M
Contributor
Contributor
Jump to solution

have you had the same issue again?

Reply
0 Kudos
ralfthiel
Contributor
Contributor
Jump to solution

No, it was only once so far.

Reply
0 Kudos
cblochi
Contributor
Contributor
Jump to solution

Hi ralfthiel

We had the same issue 4 times in the last 2-3 weeks on one of the hosts.

Do you remember the commands of the missing KB link or do you have any other information?

Thanks a lot
cblochi

Reply
0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello All,

"Is this behavior by design or misconfiguration?"

This is by design - while we test as rigidly as possible any components and driver+firmware combinations that we certify on the vSAN HCL, it is not feasible to test every possible combination of components used together - thus there are scenarios where a disk or controller can become in a non-responsive state and for which the various means by which ESXi and vSAN deal with these cannot do so in a convenient manner.

Not dealing with such scenarios in a timely manner can potentially have knock-on impacts on the performance and data-sync of the cluster and thus PSODing the host after 120 seconds of the problematic component being unable to comply, while a bit brute-force, in my opinion is better than the alternative of not doing anything.

ralfthiel, I would advise against posting (currently) internal-only information and/or shared with you privately by VMware, I am no lawyer but potentially this could have legal implications and/or likely violates the terms of your organisations support contract with VMware.

That being said, I do agree that this should be publicly documented so as better understanding of the issues is easily accessible - I am currently engaged with my colleagues with regard to the feasibility of this.

"have you had the same issue again?"

Ice_Dog_M, you are asking the wrong question here - PSODs are often just the outcome of a different issue with the host, they are not the actual problem - the PSOD here is occurring because there was an issue with a disk and/or controller that could not be managed by our array of conventional (and less disruptive)means of doing so. Thus the underlying problem needs to be addressed if you do not wish this to re-occur

"We had the same issue 4 times in the last 2-3 weeks on one of the hosts."

cblochi, please open a Support Request with us at GSS so that we can assist with identifying and resolving the underlying issue - I would not advise disabling this mechanism unless you have clear evidence that it is being triggered in some scenario where it should not be.

Bob

Reply
0 Kudos
ralfthiel
Contributor
Contributor
Jump to solution

I'm sorry, you're right of course.

I just removed my post.

Reply
0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello All,

As I said I would, I have worked with my colleagues to 'un-disappear' the kb article that in better detail explains the rationale and mechanisms of this PSOD:

VMware Knowledge Base

If this PSOD is encountered, I would advise engaging GSS vSAN and likely your hardware vendor as the PSOD is not the root problem - the host was PSODed to avoid further impact due to controller/disk issues that could not be remediated by any other means.

Bob

Reply
0 Kudos
cblochi
Contributor
Contributor
Jump to solution

Hello all,

at the time of my first post, I used SSDs, which are not part of the compatibility list.

I reorganized my vSAN and substituted the old SSDs with supported SSDs. So far, it is working.

There is might be and (partly) broken disk among the old SSDs, That's maybe the root of the PSoD, I will check this in the next weeks.

Thanks for your help!

Best regards

cblochi

Ice_Dog_M
Contributor
Contributor
Jump to solution

Hello and thank you for making this KB available again. Had the same issue with the other host, filled support ticked this time. Got a copy/paste of the article as an answer and advice to work this out with the hardware vendor. I think that the only fast and reliable option is to remove any disk groups from a couple of hosts and stick important VMs to them. How would IO/latency perform on compute only nodes in vSAN cluster? Does vSAN tries to hold VM data near to VM host?

Reply
0 Kudos