ESXi hanging issues on multiple hosts

MattGoddard · ‎04-14-2021

I have an ESXi issue where it hangs doing pretty much anything: VM power on/off, vMotion, etc. For example, I just told a VM to shut down the guest OS and that task has been sitting at 0% for the last 10 mins. Meanwhile, if I connect via SSH and do something simple like "esxcli vm process list" it hangs on that pretty much forever. Another weird behavior is that if I do "reboot" or "poweroff" it appears to work - as in, it accepts the command and returns to the prompt - but it never reboots or shuts down.

This happened with one of my hosts and I resorted to pulling the plug because nothing else worked. About half an hour after that host came back up, apparently without issue, the exact same thing is now happening to the second host.

Can anyone help me fix this?

Environment details:

ESXi 7.0.2 build-17630552
vCenter 7.0.2 build-17694817
Hosts: 2 x Intel NUC10I5FNH, booting from USB keys
Shared storage: Synology DS720+ NAS (connected to hosts via iSCSI)

Virt-aid · ‎04-15-2021

Matt,

Your issue most likely seems to be related to hostd exhaustion of its resources. Could you please share the kernel logs and hostd logs during the time of issue occurrence.

e_espinel · ‎04-15-2021

Hello.
It could be a problem in the external storage or in your HBA or cables check the logs of the storage and servers.

it would also be good to check the firmware levels and the HBA firmware recommendations for the Synology DS720 in VMware 7.0.2

Enrique Espinel
Senior Technical Support on IBM, Lenovo, Veeam Backup and VMware vSphere.
VSP-SV, VTSP-SV, VTSP-HCI, VTSP
Please mark my comment as Correct Answer or assign Kudos if my answer was helpful to you, Thank you.
Пожалуйста, отметьте мой комментарий как Правильный ответ или поставьте Кудо, если мой ответ был вам полезен, Спасибо.

MattGoddard · ‎04-15-2021

Logs attached for the period 01:00 to 04:30 UTC on Apr 15, which covers the time when the second host was acting up.

I'm not very familiar with ESXi log analysis but I did notice a lot of messages in the format:

...access("[path]") took over [high number] sec.

Where "high number" was regularly in the hundreds of seconds.

No warnings or errors in the logs on the NAS side at all.

Searching on 'Warning' in vCenter events I see a lot of stuff like this (really confused about the bootbank warning since that's referring to the USB key, which is visible in storage devices):

MattGoddard · ‎04-15-2021

"it would also be good to check the firmware levels and the HBA firmware recommendations for the Synology DS720 in VMware 7.0.2"

Are HBAs relevant if this is software iSCSI? I'm assuming they're not. The hosts are connected to a basic layer 2 switch via a single CAT5 each. The NAS has two NICs connected to the same switch, also via CAT5.

Looks like the DS720+ isn't on the compatibility list for iSCSI. However, the version of the Synology firmware I have is on the list for NFS. So maybe one possible solution is to just forget about trying to do this with iSCSI and use NFS instead?

Virt-aid · ‎04-16-2021

It’s clear indication about latency factor here. HBA doesn’t play a role here. Check on NIC driver/firmware and Storage firmware compatibility. If congestion at switch or elsewhere, moving to NFS shall not improve the situation.

Virt-aid · ‎04-16-2021

Did you had a chance to figure out the cause?

MattGoddard · ‎04-17-2021

"Check on NIC driver/firmware and Storage firmware compatibility."

How would I do that?

"congestion at switch or elsewhere"

It's definitely not a network congestion issue.

Virt-aid · ‎04-17-2021

You can begin by following instructions in VMware KB article https://kb.vmware.com/s/article/1027206 to learn about driver and firmware of IO devices. Then you need to Search the VMware Compatibility Guide for the Vendor ID (VID), Device ID (DID), Sub-Vendor ID (SVID), and Sub-Device ID (SDID). Please watch the video linked at the end of the article explaining the steps.

The suspicion on congestion is called out, as hostd is exceeding its resources while coping the demands from virtual machines and transmitting the IOPS to end device. You begin with device validation and proceed in evaluating other components of your datacenter to fix your issue.

Can you also share log snippets of hostd and vmkernel at the time of the alert notification.

MattGoddard · ‎04-25-2021

"You can begin by following instructions in VMware KB article https://kb.vmware.com/s/article/1027206 to learn about driver and firmware of IO devices. Then you need to Search the VMware Compatibility Guide for the Vendor ID (VID), Device ID (DID), Sub-Vendor ID (SVID), and Sub-Device ID (SDID). Please watch the video linked at the end of the article explaining the steps."

Thanks, this looks like a better method of assessing compatibility than my previous attempts with the compatibility guide.

"Can you also share log snippets of hostd and vmkernel at the time of the alert notification."

I already did when you asked me last time. 😉 See this post back up the page.

Virt-aid · ‎04-25-2021

Apologies, I overlooked the attachment.

Log spew has been observed in hostd logs

One coming from IoTracker, inquiring status of storage device and vo a updating performance stats to vCenter.

“hostd[527312] [Originator@6876 sub=IoTracker] In thread 536555, access("/vmfs/volumes/606a4f4a-5c745a38-278c-1c697a671244/catalog") took over 985 sec.”

“Task Created : haTask-ha-host-vim.HostSystem.retrieveInternalCapability-41285”

Let me if this helps in improving the situation:

Login to affected ESXi host via SSH.
Take a backup of config file by running below command:

cp /etc/vmware/hostd/config.xml /etc/vmware/hostd/config.xml-backup

Edit /etc/vmware/hostd/config.xml using vi editor
Add below line:

<ioTrackers> false </ioTrackers>

Example:
Before change
<config>


<version>6.6.0.0</version>

After change
<config>

<ioTrackers> false </ioTrackers>

<version>6.6.0.0</version>

Save and close the file.
Restart the hostd service by running the following command:

/etc/init.d/hostd restart

MattGoddard · ‎05-07-2021

Apologies for not replying sooner on this one!

Before I saw your previous reply, I decided to reinstall ESXi on the SSDs in each node rather than continuing to boot from the USB flash drives. Ever since then, both nodes have been 100% fine.

I can see no obvious reason why this should be the case since the installation and configuration was identical both times and since, after booting from a flash drive, I don't think ESXi uses it at all. But my gut was telling me that something about those drives wasn't kosher. It looks like my gut was right!

kurtd · ‎05-10-2021

I may be having a similar issue. I upgraded two servers last week from 6.7 to 7.0.2 build 17867351. Today I came in and couldn't manage the hosts or vms through vcenter. I could connect to the host directly but couldn't manage the vms there either. Every command just sits at 0%. The only thing I could do is remote desktop to the virtual machines so I know they are still working fine. Couldn't enter maintenance mode or reboot the host from vcenter or esxi ui. I had to do a hardware reset and now that it's back up, everything is fine.

I used the Dell image and then patched (Updated) DEL-ESXi-702_17630552-A00 (Dell Inc.)

My servers run on mirrored SD cards. I created a baseline for the hardware vendor updates and will try to see if installing those helps.

Virt-aid · ‎05-18-2021

Let’s do this:

filter vmkwarning, vmkernel and hostd log snippets and share the output here

less var/run/log/vmkwarning | grep -I bootbank

do the same step for other logs as well.

PatrickDLong · ‎05-20-2021

@MattGoddard I know you have since started booting from SSD and the issue is no longer present... but If you want to experiment with this issue -the cause of your "Bootbank cannot be found at path /bootbank" and extreme latency/hanging is likely due to an APD to your USB boot media which in turn could either be due to a hardware/firmware issue OR due to media corruption. As you already identified, 100% this is because your boot media is USB. Your assumption that "after booting from a flash drive, I don't think ESXi uses it at all" is no longer correct by default for vSphere 7, which uses a modified boot partition layout - and unless you manually redirect coredump and scratch to a high-endurance storage device, you are effectively overwhelming the I/O capabilities of your boot device by streaming your system logs to it during normal operations - this sustained high data rate will eventually corrupt the boot media. In ESXi 7, the small & large core-dump, locker, and scratch are now located on a new "ESX-OSData" partition on your boot device (provided it is large enough) which is formatted as VMFS-L. Booting from a non-high endurance device like USB, SD card, etc. is still a supported method of running ESXi, but you MUST create a core dump file on a datastore backed by high-endurance media, and also assign scratch to a directory on a datastore backed by high-endurance media such as local HDD or SSD, shared storage datastore, etc. I would encourage you to read Niels Hagoort's excellent blog posts on the subject here:

https://blogs.vmware.com/vsphere/2020/05/vsphere-7-esxi-system-storage-changes.html

and

https://blogs.vmware.com/vsphere/2020/07/vsphere-7-system-storage-when-upgrading.html

and also read the following KB article which describes the risk of boot media corruption:

https://kb.vmware.com/s/article/83376

Sadly, I feel this is a ticking time bomb for many who upgrade to vSpere7 without knowledge the boot device formatting changes. The recommended boot device is now an HDD or SSD. Too bad for me and a lot of vSphere admins who spent considerable time building diskless host environments over the last 5 years and getting rid of all our spinning rust.

Virt-aid · ‎05-21-2021

This is what I was anticipated and thereby requested to share the log snippets filtering messages for bootbank to confirm if this is being the case.

MattGoddard · ‎05-21-2021

@PatrickDLong:

"unless you manually redirect coredump and scratch to a high-endurance storage device, you are effectively overwhelming the I/O capabilities of your boot device by streaming your system logs to it during normal operations - this sustained high data rate will eventually corrupt the boot media"

This makes 100% sense as an explanation for both the behavior at the time and why it's no longer an issue since I moved the install to the SSD.

Ironically, I did redirect the coredump to shared HDD storage. I must not have moved the scratch. I may attempt this again and see if I can get it working in a stable fashion.

"Sadly, I feel this is a ticking time bomb for many who upgrade to vSpere7 without knowledge the boot device formatting changes."

It's quite possible that this would've been a ticking time bomb for me had I not read your post! At work, I have a production cluster of Dell rack servers that need to be upgraded from 6.7 to 7.0 and they all boot from SD cards. So, thanks for this info!

kurtd · ‎05-21-2021

According to https://kb.vmware.com/s/article/83376

Alternatively, once the new drive is installed, and ESXi has been reinstalled, you can immediately move the /scratch partition to a location not on the boot drive, per directions in System logs are stored on non-persistent storage

Hasn't that always been the case? I checked my 7.0.2 host and scratch is still set to my datastore, not my SD Card which is what I had to do in 6.7 as well. Is there anything else we have to do to avoid corruption?

All

ESXi hanging issues on multiple hosts

ESXi 7