MattGoddard
Enthusiast
Enthusiast

ESXi hanging issues on multiple hosts

I have an ESXi issue where it hangs doing pretty much anything: VM power on/off, vMotion, etc. For example, I just told a VM to shut down the guest OS and that task has been sitting at 0% for the last 10 mins. Meanwhile, if I connect via SSH and do something simple like "esxcli vm process list" it hangs on that pretty much forever. Another weird behavior is that if I do "reboot" or "poweroff" it appears to work - as in, it accepts the command and returns to the prompt - but it never reboots or shuts down.

This happened with one of my hosts and I resorted to pulling the plug because nothing else worked. About half an hour after that host came back up, apparently without issue, the exact same thing is now happening to the second host.

Can anyone help me fix this?

Environment details:

  • ESXi 7.0.2 build-17630552
  • vCenter 7.0.2 build-17694817
  • Hosts: 2 x Intel NUC10I5FNH, booting from USB keys
  • Shared storage: Synology DS720+ NAS (connected to hosts via iSCSI)
Labels (1)
Tags (2)
0 Kudos
12 Replies
Virt-aid
Contributor
Contributor

Matt,

Your issue most likely seems to be related to hostd exhaustion of its resources. Could you please share the kernel logs and hostd logs during the time of issue occurrence. 

0 Kudos
e_espinel
Expert
Expert

Hello.
It could be a problem in the external storage or in your HBA or cables check the logs of the storage and servers.

it would also be good to check the firmware levels and the HBA firmware recommendations for the Synology DS720 in VMware 7.0.2

 

Enrique Espinel
Senior Technical Consultant IBM, Lenovo and VMware.
VMware VSP-SV 2018, VTSP-SV 2018 VMware Technical Solutions Professional Hyper-Converged Infrastructure (VTSP-HCI 2018)
VMware Technical Solutions Professional (VTSP) 4 / 5.
Please mark my comment as the Correct Answer/Kudos if this solution resolved your problem Thank you.
Пожалуйста, отметьте мой комментарий как "Правильный ответ/Кудос", если это решение решило вашу проблему. Спасибо.
0 Kudos
MattGoddard
Enthusiast
Enthusiast

Logs attached for the period 01:00 to 04:30 UTC on Apr 15, which covers the time when the second host was acting up.

I'm not very familiar with ESXi log analysis but I did notice a lot of messages in the format:

...access("[path]") took over [high number] sec.

Where "high number" was regularly in the hundreds of seconds.

No warnings or errors in the logs on the NAS side at all.

Searching on 'Warning' in vCenter events I see a lot of stuff like this (really confused about the bootbank warning since that's referring to the USB key, which is visible in storage devices):

warning.png

0 Kudos
MattGoddard
Enthusiast
Enthusiast

"it would also be good to check the firmware levels and the HBA firmware recommendations for the Synology DS720 in VMware 7.0.2"

Are HBAs relevant if this is software iSCSI? I'm assuming they're not. The hosts are connected to a basic layer 2 switch via a single CAT5 each. The NAS has two NICs connected to the same switch, also via CAT5.

Looks like the DS720+ isn't on the compatibility list for iSCSI. However, the version of the Synology firmware I have is on the list for NFS. So maybe one possible solution is to just forget about trying to do this with iSCSI and use NFS instead?

0 Kudos
Virt-aid
Contributor
Contributor

It’s clear indication about latency factor here. HBA doesn’t play a role here. Check on NIC driver/firmware and Storage firmware compatibility. If congestion at switch or elsewhere, moving to NFS shall not improve the situation. 

0 Kudos
Virt-aid
Contributor
Contributor

Did you had a chance to figure out the cause?

0 Kudos
MattGoddard
Enthusiast
Enthusiast

"Check on NIC driver/firmware and Storage firmware compatibility."

How would I do that?

"congestion at switch or elsewhere"

It's definitely not a network congestion issue.

0 Kudos
Virt-aid
Contributor
Contributor

You can begin by following instructions in VMware KB article https://kb.vmware.com/s/article/1027206 to learn about driver and firmware of IO devices. Then you need to Search the VMware Compatibility Guide for the Vendor ID (VID), Device ID (DID), Sub-Vendor ID (SVID), and Sub-Device ID (SDID). Please watch the video linked at the end of the article explaining the steps. 

The suspicion on congestion is called out, as hostd is exceeding its resources while coping the demands from virtual machines and transmitting the IOPS to end device. You begin with device validation and proceed in evaluating other components of your datacenter to fix your issue. 

Can you also share log snippets of hostd and vmkernel at the time of the alert notification. 

0 Kudos
MattGoddard
Enthusiast
Enthusiast

"You can begin by following instructions in VMware KB article https://kb.vmware.com/s/article/1027206 to learn about driver and firmware of IO devices. Then you need to Search the VMware Compatibility Guide for the Vendor ID (VID), Device ID (DID), Sub-Vendor ID (SVID), and Sub-Device ID (SDID). Please watch the video linked at the end of the article explaining the steps."

 

Thanks, this looks like a better method of assessing compatibility than my previous attempts with the compatibility guide.

 

"Can you also share log snippets of hostd and vmkernel at the time of the alert notification."

 

I already did when you asked me last time. 😉 See this post back up the page.

0 Kudos
Virt-aid
Contributor
Contributor

Apologies, I overlooked the attachment. 

Log spew has been observed in hostd logs

One coming from IoTracker, inquiring status of storage device and vo a updating performance stats to vCenter. 

“hostd[527312] [Originator@6876 sub=IoTracker] In thread 536555, access("/vmfs/volumes/606a4f4a-5c745a38-278c-1c697a671244/catalog") took over 985 sec.”

“Task Created : haTask-ha-host-vim.HostSystem.retrieveInternalCapability-41285”

Let me if this helps in improving the situation:

  1. Login to affected ESXi host via SSH.
  2. Take a backup of config file by running below command:
cp /etc/vmware/hostd/config.xml /etc/vmware/hostd/config.xml-backup
  1. Edit /etc/vmware/hostd/config.xml using vi editor
  2. Add below line:
<ioTrackers> false </ioTrackers>

Example:
Before change
<config>
<!-- Host agent configuration file for ESX/ESXi -->
<!-- the version of this config file -->
<version>6.6.0.0</version>


After change
<config>
<!-- Host agent configuration file for ESX/ESXi -->
<ioTrackers> false </ioTrackers>
<!-- the version of this config file -->
<version>6.6.0.0</version>
  1. Save and close the file.
  2. Restart the hostd service by running the following command:
/etc/init.d/hostd restart

 

Tags (2)
0 Kudos
MattGoddard
Enthusiast
Enthusiast

Apologies for not replying sooner on this one!

Before I saw your previous reply, I decided to reinstall ESXi on the SSDs in each node rather than continuing to boot from the USB flash drives. Ever since then, both nodes have been 100% fine.

I can see no obvious reason why this should be the case since the installation and configuration was identical both times and since, after booting from a flash drive, I don't think ESXi uses it at all. But my gut was telling me that something about those drives wasn't kosher. It looks like my gut was right!

0 Kudos
kurtd
Enthusiast
Enthusiast

I may be having a similar issue.  I upgraded two servers last week from 6.7 to 7.0.2 build 17867351.  Today I came in and couldn't manage the hosts or vms through vcenter.  I could connect to the host directly but couldn't manage the vms there either.  Every command just sits at 0%.  The only thing I could do is remote desktop to the virtual machines so I know they are still working fine.  Couldn't enter maintenance mode or reboot the host from vcenter or esxi ui.  I had to do a hardware reset and now that it's back up, everything is fine.  

I used the Dell image and then patched (Updated) DEL-ESXi-702_17630552-A00 (Dell Inc.)

My servers run on mirrored SD cards.  I created a baseline for the hardware vendor updates and will try to see if installing those helps.

 

0 Kudos