I have an ESXi issue where it hangs doing pretty much anything: VM power on/off, vMotion, etc. For example, I just told a VM to shut down the guest OS and that task has been sitting at 0% for the last 10 mins. Meanwhile, if I connect via SSH and do something simple like "esxcli vm process list" it hangs on that pretty much forever. Another weird behavior is that if I do "reboot" or "poweroff" it appears to work - as in, it accepts the command and returns to the prompt - but it never reboots or shuts down.
This happened with one of my hosts and I resorted to pulling the plug because nothing else worked. About half an hour after that host came back up, apparently without issue, the exact same thing is now happening to the second host.
Can anyone help me fix this?
It could be a problem in the external storage or in your HBA or cables check the logs of the storage and servers.
it would also be good to check the firmware levels and the HBA firmware recommendations for the Synology DS720 in VMware 7.0.2
Logs attached for the period 01:00 to 04:30 UTC on Apr 15, which covers the time when the second host was acting up.
I'm not very familiar with ESXi log analysis but I did notice a lot of messages in the format:
...access("[path]") took over [high number] sec.
Where "high number" was regularly in the hundreds of seconds.
No warnings or errors in the logs on the NAS side at all.
Searching on 'Warning' in vCenter events I see a lot of stuff like this (really confused about the bootbank warning since that's referring to the USB key, which is visible in storage devices):
"it would also be good to check the firmware levels and the HBA firmware recommendations for the Synology DS720 in VMware 7.0.2"
Are HBAs relevant if this is software iSCSI? I'm assuming they're not. The hosts are connected to a basic layer 2 switch via a single CAT5 each. The NAS has two NICs connected to the same switch, also via CAT5.
Looks like the DS720+ isn't on the compatibility list for iSCSI. However, the version of the Synology firmware I have is on the list for NFS. So maybe one possible solution is to just forget about trying to do this with iSCSI and use NFS instead?
It’s clear indication about latency factor here. HBA doesn’t play a role here. Check on NIC driver/firmware and Storage firmware compatibility. If congestion at switch or elsewhere, moving to NFS shall not improve the situation.
You can begin by following instructions in VMware KB article https://kb.vmware.com/s/article/1027206 to learn about driver and firmware of IO devices. Then you need to Search the VMware Compatibility Guide for the Vendor ID (VID), Device ID (DID), Sub-Vendor ID (SVID), and Sub-Device ID (SDID). Please watch the video linked at the end of the article explaining the steps.
The suspicion on congestion is called out, as hostd is exceeding its resources while coping the demands from virtual machines and transmitting the IOPS to end device. You begin with device validation and proceed in evaluating other components of your datacenter to fix your issue.
Can you also share log snippets of hostd and vmkernel at the time of the alert notification.
"You can begin by following instructions in VMware KB article https://kb.vmware.com/s/article/1027206 to learn about driver and firmware of IO devices. Then you need to Search the VMware Compatibility Guide for the Vendor ID (VID), Device ID (DID), Sub-Vendor ID (SVID), and Sub-Device ID (SDID). Please watch the video linked at the end of the article explaining the steps."
Thanks, this looks like a better method of assessing compatibility than my previous attempts with the compatibility guide.
"Can you also share log snippets of hostd and vmkernel at the time of the alert notification."
I already did when you asked me last time. 😉 See this post back up the page.
Apologies, I overlooked the attachment.
Log spew has been observed in hostd logs
One coming from IoTracker, inquiring status of storage device and vo a updating performance stats to vCenter.
“hostd [Originator@6876 sub=IoTracker] In thread 536555, access("/vmfs/volumes/606a4f4a-5c745a38-278c-1c697a671244/catalog") took over 985 sec.”
“Task Created : haTask-ha-host-vim.HostSystem.retrieveInternalCapability-41285”
Let me if this helps in improving the situation:
Apologies for not replying sooner on this one!
Before I saw your previous reply, I decided to reinstall ESXi on the SSDs in each node rather than continuing to boot from the USB flash drives. Ever since then, both nodes have been 100% fine.
I can see no obvious reason why this should be the case since the installation and configuration was identical both times and since, after booting from a flash drive, I don't think ESXi uses it at all. But my gut was telling me that something about those drives wasn't kosher. It looks like my gut was right!
I may be having a similar issue. I upgraded two servers last week from 6.7 to 7.0.2 build 17867351. Today I came in and couldn't manage the hosts or vms through vcenter. I could connect to the host directly but couldn't manage the vms there either. Every command just sits at 0%. The only thing I could do is remote desktop to the virtual machines so I know they are still working fine. Couldn't enter maintenance mode or reboot the host from vcenter or esxi ui. I had to do a hardware reset and now that it's back up, everything is fine.
I used the Dell image and then patched (Updated) DEL-ESXi-702_17630552-A00 (Dell Inc.)
My servers run on mirrored SD cards. I created a baseline for the hardware vendor updates and will try to see if installing those helps.
@MattGoddard I know you have since started booting from SSD and the issue is no longer present... but If you want to experiment with this issue -the cause of your "Bootbank cannot be found at path /bootbank" and extreme latency/hanging is likely due to an APD to your USB boot media which in turn could either be due to a hardware/firmware issue OR due to media corruption. As you already identified, 100% this is because your boot media is USB. Your assumption that "after booting from a flash drive, I don't think ESXi uses it at all" is no longer correct by default for vSphere 7, which uses a modified boot partition layout - and unless you manually redirect coredump and scratch to a high-endurance storage device, you are effectively overwhelming the I/O capabilities of your boot device by streaming your system logs to it during normal operations - this sustained high data rate will eventually corrupt the boot media. In ESXi 7, the small & large core-dump, locker, and scratch are now located on a new "ESX-OSData" partition on your boot device (provided it is large enough) which is formatted as VMFS-L. Booting from a non-high endurance device like USB, SD card, etc. is still a supported method of running ESXi, but you MUST create a core dump file on a datastore backed by high-endurance media, and also assign scratch to a directory on a datastore backed by high-endurance media such as local HDD or SSD, shared storage datastore, etc. I would encourage you to read Niels Hagoort's excellent blog posts on the subject here:
and also read the following KB article which describes the risk of boot media corruption:
Sadly, I feel this is a ticking time bomb for many who upgrade to vSpere7 without knowledge the boot device formatting changes. The recommended boot device is now an HDD or SSD. Too bad for me and a lot of vSphere admins who spent considerable time building diskless host environments over the last 5 years and getting rid of all our spinning rust.
"unless you manually redirect coredump and scratch to a high-endurance storage device, you are effectively overwhelming the I/O capabilities of your boot device by streaming your system logs to it during normal operations - this sustained high data rate will eventually corrupt the boot media"
This makes 100% sense as an explanation for both the behavior at the time and why it's no longer an issue since I moved the install to the SSD.
Ironically, I did redirect the coredump to shared HDD storage. I must not have moved the scratch. I may attempt this again and see if I can get it working in a stable fashion.
"Sadly, I feel this is a ticking time bomb for many who upgrade to vSpere7 without knowledge the boot device formatting changes."
It's quite possible that this would've been a ticking time bomb for me had I not read your post! At work, I have a production cluster of Dell rack servers that need to be upgraded from 6.7 to 7.0 and they all boot from SD cards. So, thanks for this info!
According to https://kb.vmware.com/s/article/83376
Hasn't that always been the case? I checked my 7.0.2 host and scratch is still set to my datastore, not my SD Card which is what I had to do in 6.7 as well. Is there anything else we have to do to avoid corruption?