Had an issue this past week where a customer reported being unable to access a database on their server. During investigation I was unable to log onto the vm server (Windows Server 2012r2), though I could see the file shares and ping it (so partially functional?). I then tried to log onto the ESXi Host and was not able to reach the <ip>/ui landing page (leading me to suspect the issue lay with the host). I was able to ping the host ip address at this time. I was eventually able to access the host using ssh and issued a reboot command. Many minutes later the host rebooted and seemed okay again. The VM was also given a secondary reboot as a precaution. All is functional at this time. This is the first time the host server has done this that I can recall, though I have had issues with the usb drives not passing through to the vm without a host reboot a few times this year.
I have been trying to figure out what happened but I've not been able to nail down a culprit to my satisfaction, though I do suspect it is related to the storage controller (this machine has a history for poor i/o). The host controller is showing a disk queue length of 242 (image attached), though it never seems to change (the windows server vm shows a disk queue length under 2, unless the search service is running, which I have now disabled). All hardware sensors in the host ui are green. There are a couple warning events in the event history page, but those seem to be related to the usb backup drives that are swapped every day (disconnect, reconnect type events), the usb drives are passed through to the server vm so it can use them.
In the VMWARE observer daemon log there are a number of these (the time coincides with the time the usb drive was swapped today, so probably not important):
2022-10-23T08:49:06.001Z: Successfully sent event (esx.audit.net.firewall.config.changed) after 1 failure.
In the VMkernel warnings log there are a lot of identical events like this (like an i/o transfer is timing out?):
2022-10-24T21:08:53.594Z cpu2:36056)WARNING: LinScsi: SCSILinuxProcessCompletions:826: Error BytesXferred > Requested Length Marking transfer length as 0 - vmhba = vmhba1, Driver Name = hpvsa, Requested length = 0, Resid = 24
In vmkernel.log there is a LOT of events like this (which is leading me to suspect the storage controller):
2022-10-24T20:54:00.910Z cpu2:36561)FS3Misc: 1759: Long VMFS rsv time on 'ESX01-DS01-RAID1' (held for 285 msecs). # R: 1, # W: 1 bytesXfer: 5 sectors
2022-10-24T20:59:54.092Z cpu0:36054)<6>xhci_hcd 0000:00:14.0: Waiting for status stage event
2022-10-24T20:59:54.095Z cpu0:36054)<6>xhci_hcd 0000:00:14.0: Waiting for status stage event
2022-10-24T20:59:54.097Z cpu0:36054)<6>xhci_hcd 0000:00:14.0: Waiting for status stage event
2022-10-24T20:59:54.100Z cpu0:36054)<6>xhci_hcd 0000:00:14.0: Waiting for status stage event
2022-10-24T21:08:53.570Z cpu2:36056)WARNING: LinScsi: SCSILinuxProcessCompletions:826: Error BytesXferred > Requested Length Marking transfer length as 0 - vmhba = vmhba1, Driver Name = hpvsa, Requested length = 0, Resid = 24
This host was configured with a software raid1/mirror array (vmware mirror drive).
Any thoughts on how to diagnose the issue further?
I was going to recommend the customer obtain a new server anyway (this one is 7+ years old and they only have a single server). They could use some redundancy, but it will be the usual challenge to convince them to approve the budget and I'd like to be confident of being able to use this one as a 'hot standby' (if it can be).
Found this after further digging:
Device or filesystem with identifier mpx.vmhba34:C0:T0:L0 has entered the All Paths Down state.
Sunday, October 23, 2022, 01:49:06 -0700
AND this one:
Device or filesystem with identifier mpx.vmhba32:C0:T0:L0 has entered the All Paths Down state.
Sunday, October 23, 2022, 01:49:06 -0700
These two devices are indeed the primary storage drives and are attached to the same controller. The server and vm's all seem fully functional at this time, so I am presuming the above is eventually followed by reset events a few seconds later (still digging, but now I have a time frame to focus on).