I have an ESXi issue where it hangs doing pretty much anything: VM power on/off, vMotion, etc. For example, I just told a VM to shut down the guest OS and that task has been sitting at 0% for the last 10 mins. Meanwhile, if I connect via SSH and do something simple like "esxcli vm process list" it hangs on that pretty much forever. Another weird behavior is that if I do "reboot" or "poweroff" it appears to work - as in, it accepts the command and returns to the prompt - but it never reboots or shuts down.
This happened with one of my hosts and I resorted to pulling the plug because nothing else worked. About half an hour after that host came back up, apparently without issue, the exact same thing is now happening to the second host.
Can anyone help me fix this?
It could be a problem in the external storage or in your HBA or cables check the logs of the storage and servers.
it would also be good to check the firmware levels and the HBA firmware recommendations for the Synology DS720 in VMware 7.0.2
Logs attached for the period 01:00 to 04:30 UTC on Apr 15, which covers the time when the second host was acting up.
I'm not very familiar with ESXi log analysis but I did notice a lot of messages in the format:
...access("[path]") took over [high number] sec.
Where "high number" was regularly in the hundreds of seconds.
No warnings or errors in the logs on the NAS side at all.
Searching on 'Warning' in vCenter events I see a lot of stuff like this (really confused about the bootbank warning since that's referring to the USB key, which is visible in storage devices):
"it would also be good to check the firmware levels and the HBA firmware recommendations for the Synology DS720 in VMware 7.0.2"
Are HBAs relevant if this is software iSCSI? I'm assuming they're not. The hosts are connected to a basic layer 2 switch via a single CAT5 each. The NAS has two NICs connected to the same switch, also via CAT5.
Looks like the DS720+ isn't on the compatibility list for iSCSI. However, the version of the Synology firmware I have is on the list for NFS. So maybe one possible solution is to just forget about trying to do this with iSCSI and use NFS instead?
It’s clear indication about latency factor here. HBA doesn’t play a role here. Check on NIC driver/firmware and Storage firmware compatibility. If congestion at switch or elsewhere, moving to NFS shall not improve the situation.
You can begin by following instructions in VMware KB article https://kb.vmware.com/s/article/1027206 to learn about driver and firmware of IO devices. Then you need to Search the VMware Compatibility Guide for the Vendor ID (VID), Device ID (DID), Sub-Vendor ID (SVID), and Sub-Device ID (SDID). Please watch the video linked at the end of the article explaining the steps.
The suspicion on congestion is called out, as hostd is exceeding its resources while coping the demands from virtual machines and transmitting the IOPS to end device. You begin with device validation and proceed in evaluating other components of your datacenter to fix your issue.
Can you also share log snippets of hostd and vmkernel at the time of the alert notification.
"You can begin by following instructions in VMware KB article https://kb.vmware.com/s/article/1027206 to learn about driver and firmware of IO devices. Then you need to Search the VMware Compatibility Guide for the Vendor ID (VID), Device ID (DID), Sub-Vendor ID (SVID), and Sub-Device ID (SDID). Please watch the video linked at the end of the article explaining the steps."
Thanks, this looks like a better method of assessing compatibility than my previous attempts with the compatibility guide.
"Can you also share log snippets of hostd and vmkernel at the time of the alert notification."
I already did when you asked me last time. 😉 See this post back up the page.
Apologies, I overlooked the attachment.
Log spew has been observed in hostd logs
One coming from IoTracker, inquiring status of storage device and vo a updating performance stats to vCenter.
“hostd [Originator@6876 sub=IoTracker] In thread 536555, access("/vmfs/volumes/606a4f4a-5c745a38-278c-1c697a671244/catalog") took over 985 sec.”
“Task Created : haTask-ha-host-vim.HostSystem.retrieveInternalCapability-41285”
Let me if this helps in improving the situation:
Apologies for not replying sooner on this one!
Before I saw your previous reply, I decided to reinstall ESXi on the SSDs in each node rather than continuing to boot from the USB flash drives. Ever since then, both nodes have been 100% fine.
I can see no obvious reason why this should be the case since the installation and configuration was identical both times and since, after booting from a flash drive, I don't think ESXi uses it at all. But my gut was telling me that something about those drives wasn't kosher. It looks like my gut was right!
I may be having a similar issue. I upgraded two servers last week from 6.7 to 7.0.2 build 17867351. Today I came in and couldn't manage the hosts or vms through vcenter. I could connect to the host directly but couldn't manage the vms there either. Every command just sits at 0%. The only thing I could do is remote desktop to the virtual machines so I know they are still working fine. Couldn't enter maintenance mode or reboot the host from vcenter or esxi ui. I had to do a hardware reset and now that it's back up, everything is fine.
I used the Dell image and then patched (Updated) DEL-ESXi-702_17630552-A00 (Dell Inc.)
My servers run on mirrored SD cards. I created a baseline for the hardware vendor updates and will try to see if installing those helps.