Jun 7, 2020

    ESXI VMs Getting I/O Errors Every Few Months

    MrZed411 Lurker

      Hi there,


      I have been using 2 Dell R710's and one Dell R910 with ESXI 6.5 for just over a year total now. All of the servers are in a cluster and everything works properly. After the first few months randomly my VMs had an error and my VCSA VM was no longer working. I never truly could diagnose the issue but it has occurred again with everything being reinstalled. All of my Linux VMs received an error similar to:


      blk_update_request: I/O error, dev sda, sector 424280328

      sd 0:0:0:0: timing out command, waited 1080s

      Buffer I/O error on dev dm-0, logical block 609048, lost async page write


      Each of these lines were repeated around 10 times each on all of my Linux VMs across all of my hosts. All of the Linux VMs were no longer working or responding to anything either including the services they ran and the local console. Some of them, including the VCSA VM, I had to do a fsck scan in order for it to work after a restart.


      Once the VMs are restarted they typically work, but I am afraid of data loss from this issue. My storage solution is a NAS running FreeNAS but the VCSA VM, which also had this issue, uses the local storage on one of the hosts so I know that this is not a network storage problem. This happened across all hosts and VMs at what appears to be the same time.


      If there are any ideas or suggestions please let me know what it might be.

        • 1. Re: ESXI VMs Getting I/O Errors Every Few Months
          daphnissov Guru
          At one point, did all the VMs which are now exhibiting this I/O error issue reside on the same backend storage device, local or networked? How about did they all run on the same host? Was there a power outage or some other outage? If not, you might have some piece of hardware failure that has impacted them at one time, possibly even a DIMM.

          • 2. Re: ESXI VMs Getting I/O Errors Every Few Months
            MrZed411 Lurker

            No, not all of these VMs started on or even resided on the same storage. My VCSA VM has always been on the local storage of my ESXI-1 Host, while 2 of the Linux VMs that exhibited the same issue were always on the NAS datatstore.

            These VMs were on different Hosts, the VCSA VM was on my ESXI-2 host along with a Linux VM that had this issue, while I had 2 Linux VMs on my ESXI-1 host with the issue.


            There was not a power nor network outage as everything is on redundant UPSes on separate breakers to minimize the possibility of that happening.


            So to summarize, this happened across all hosts in my cluster, across local and network datastores, and for all of my Linux VMs (I believe it affects Windows too but the tolerances for coming back from this error might be better so I can't see an issue with them).

            • 3. Re: ESXI VMs Getting I/O Errors Every Few Months
              daphnissov Guru
              Logically, I can't think of a example event that could affect these disparate VMs in such a similar way if they were indeed as separate as you claim. I think a root cause identification, however, is of only secondary concern here where you're likely more concerned with fixing any corruption that might have occurred. If, once fixed, it appears again then you've got something to chase.

              • 4. Re: ESXI VMs Getting I/O Errors Every Few Months
                MrZed411 Lurker

                This is actually the 3rd time that this has happened. It happened twice at my old house and after I moved I reinstalled ESXI, vCenter, new NAS, etc. It doesn't seem that there is a large issue with my data currently as most of the VMs appear to be working after a restart. I did manage to get my VCSA working after following https://vuptime.io/2017/05/10/VMware-VCSA-PSC-wont-boot/. After doing the fsck command vCenter came back up and appears to be working but I want to ensure that this problem doesn't occur again.

                • 5. Re: ESXI VMs Getting I/O Errors Every Few Months
                  daphnissov Guru
                  Then start looking for commonalities in the environment. Here are some questions to ask yourself and answer:


                  • What has been done "non-stock" to these Dell 11 Gen servers? Meaning: What hardware has been added, subtracted, or altered?
                    • As part of this, for good measure I would do basic hardware diagnostics on the components which are the lowest-hanging fruit (ex. CPU and memory).
                  • What applications are you using against this vSphere environment? Could be backup, replication, or a VAIO filter. Anything unusual/old/non-compatible/hacky?
                  • Are you using any of those integrated applications strangely/unusually? I.e., are you doing unsupported things because this may be a homelab situation?