Took another outage with SIOC and the StorageRM serivce disabled. We're now working on downgrading the hosts to 6.5u2. Migrating the VMs to the 6.5 hosts requires a reboot due to the lower EVC support level and downgrading the hardware version on them.
Engineering conceded they have been working on this for many months and have not found a root cause and it's affecting multiple customers. Update 2 is pushed until Early April and will not a fix. They're now hoping to have a fix by the time Update 3 comes out this summer.
We are seing the same issue on some of our 6.7EP6 hosts. The recommendations we have gotten from vmware support is:
1. Disable ATS heartbeat
2. Upgrade drivers/fw on HBA (Emulex)
3. Migrate from VMFS5 til VMFS6 datastores
4. Upgrade NIC drivers to latest
They also told me yesterday (27th of March) that a fix would be in place in 6.7U2 and that will be released most likely within 4 weeks.
EDIT: We are now trying with EP7 on the affected hosts to see if that helps. No specific fixes for this mentioned in the releasenotes thought.
I came across the below KB that talks about a different defect in 6.7 affecting Dell EMC SC Storage. use Dell EMC SC storage so this may be part of the equation.
ESXi 6.7 hosts with active/passive or ALUA based storage devices may see premature APD events during storage controller fail-over scenarios
We also have SC storage. But after implementing all the mentioned changes we haven't had this problem anymore...
VMware engineering Team is working on this issue and hopefully permanent fix would be included in upcoming versions.
For temporary fix this issue, please follow the workaround steps mentioned in VMware KB - https://kb.vmware.com/s/article/67543
The error you are experiencing is a known issue in vSphere 6.7. this bug have in ESXi 6.7 EP 07 and ESXi 6.7 EP 09 which results in host becoming unresponsive.
The main Root Cause is SIOC running out of memory.
Please wait for VMware to release the fix which will included in 6.7U3. ETA the release date is around July/August 2019.
Note: Please note that currently there is no workaround available for above-mentioned issue.
Workaround the issue by restarting the SIOC service using the following commands on the affected ESXi Hosts:
1. Check the status of storageRM and sdrsInjector
2. Stop the service
3. Start the service
If the issue persists even after the SIOC service is restarted, users can temporarily disable SIOC by turning off the feature from VMware Virtual CenterNawals
Please mark helpful or correct if your issue resolved.
I experienced the same issues in 6.7u2 and found this post. I reverted back to 6.5 and all the issues disappeared. Anyone try update 3 and see if its fixed?
It isn't fixed. I'm running 10 hosts and all have the same issues as of 6.7 U3 so the problem persists even though the fix was supposed to be in U3. I'm not looking forward to downgrading all my hosts, I AM looking forward to dumping VMware and going with a Microsoft virtual environment. I've had enough of losing VM's and having zero access to the ESXi hosts when it comes to trying to recover them. I wish I have never upgraded to 6.7, it's a POS.
Hey all - Does anyone know if this error is still occurring in 6.7.3 or has it been hotfixed since?
Plz provide vmkernel.log and hostd.log file
and exact date/time when server went not responding
what is back end storage ( is it boot from SAN or Local)
What hardware type
This issue is fixed in 6.7 Update 2
PR 2235031: An ESXi host becomes unresponsive and you see warnings for reached maximum heap size in the vmkernel.log
Due to a timing issue in the VMkernel, buffers might not be flushed, and the heap gets exhausted. As a result, services such as hostd, vpxa and vmsyslogd might not be able to write logs on the ESXi host, and the host becomes unresponsive. In the
/var/log/vmkernel.log, you might see a similar warning:
WARNING: Heap: 3571: Heap vfat already at its maximum size. Cannot expand.
This issue is resolved in this release.
Host unresponsive can happen due to multiple reasons, if you experience this issue in 6.7 U2 and above, it needs to be validated by GSS if you are hitting same issue .. Most likely a different issue I believe