8 Replies Latest reply on May 6, 2020 6:49 AM by Ice_Dog_M

    PSoD with suspended I/Os on capacity SSD

    Ice_Dog_M Lurker

      Hello,

      7-node all-flash vSAN cluster. One host went into PSoD with the following error. All VMs on the host crashed, restarted on another host by HA but we still needed some manual steps to perform to start the databases. Is this behavior by design or misconfiguration? I have FTT2 and FTT3 policies but VMs could crash with a single disk problem?

        • 1. Re: PSoD with suspended I/Os on capacity SSD
          Ice_Dog_M Lurker

          have you had the same issue again?

          • 2. Re: PSoD with suspended I/Os on capacity SSD
            ralfthiel Novice

            No, it was only once so far.

            • 3. Re: PSoD with suspended I/Os on capacity SSD
              cblochi Lurker

              Hi ralfthiel

               

              We had the same issue 4 times in the last 2-3 weeks on one of the hosts.

               

              Do you remember the commands of the missing KB link or do you have any other information?

               

              Thanks a lot
              cblochi

              • 4. Re: PSoD with suspended I/Os on capacity SSD
                TheBobkin Virtuoso
                vExpertVMware Employees

                Hello All,

                 

                "Is this behavior by design or misconfiguration?"

                This is by design - while we test as rigidly as possible any components and driver+firmware combinations that we certify on the vSAN HCL, it is not feasible to test every possible combination of components used together - thus there are scenarios where a disk or controller can become in a non-responsive state and for which the various means by which ESXi and vSAN deal with these cannot do so in a convenient manner.

                Not dealing with such scenarios in a timely manner can potentially have knock-on impacts on the performance and data-sync of the cluster and thus PSODing the host after 120 seconds of the problematic component being unable to comply, while a bit brute-force, in my opinion is better than the alternative of not doing anything.

                 

                ralfthiel, I would advise against posting (currently) internal-only information and/or shared with you privately by VMware, I am no lawyer but potentially this could have legal implications and/or likely violates the terms of your organisations support contract with VMware.

                That being said, I do agree that this should be publicly documented so as better understanding of the issues is easily accessible - I am currently engaged with my colleagues with regard to the feasibility of this.

                 

                "have you had the same issue again?"

                Ice_Dog_M, you are asking the wrong question here - PSODs are often just the outcome of a different issue with the host, they are not the actual problem - the PSOD here is occurring because there was an issue with a disk and/or controller that could not be managed by our array of conventional (and less disruptive)means of doing so. Thus the underlying problem needs to be addressed if you do not wish this to re-occur

                 

                "We had the same issue 4 times in the last 2-3 weeks on one of the hosts."

                cblochi, please open a Support Request with us at GSS so that we can assist with identifying and resolving the underlying issue - I would not advise disabling this mechanism unless you have clear evidence that it is being triggered in some scenario where it should not be.

                 

                Bob

                • 5. Re: PSoD with suspended I/Os on capacity SSD
                  ralfthiel Novice

                  I'm sorry, you're right of course.

                  I just removed my post.

                  • 6. Re: PSoD with suspended I/Os on capacity SSD
                    TheBobkin Virtuoso
                    VMware EmployeesvExpert

                    Hello All,

                     

                    As I said I would, I have worked with my colleagues to 'un-disappear' the kb article that in better detail explains the rationale and mechanisms of this PSOD:

                    VMware Knowledge Base

                     

                    If this PSOD is encountered, I would advise engaging GSS vSAN and likely your hardware vendor as the PSOD is not the root problem - the host was PSODed to avoid further impact due to controller/disk issues that could not be remediated by any other means.

                     

                    Bob

                    • 7. Re: PSoD with suspended I/Os on capacity SSD
                      cblochi Lurker

                      Hello all,

                       

                      at the time of my first post, I used SSDs, which are not part of the compatibility list.

                       

                      I reorganized my vSAN and substituted the old SSDs with supported SSDs. So far, it is working.

                       

                      There is might be and (partly) broken disk among the old SSDs, That's maybe the root of the PSoD, I will check this in the next weeks.

                       

                      Thanks for your help!

                       

                      Best regards

                      cblochi

                      • 8. Re: PSoD with suspended I/Os on capacity SSD
                        Ice_Dog_M Lurker

                        Hello and thank you for making this KB available again. Had the same issue with the other host, filled support ticked this time. Got a copy/paste of the article as an answer and advice to work this out with the hardware vendor. I think that the only fast and reliable option is to remove any disk groups from a couple of hosts and stick important VMs to them. How would IO/latency perform on compute only nodes in vSAN cluster? Does vSAN tries to hold VM data near to VM host?