VMware Cloud Community
isaacwd
Contributor
Contributor

Strange Host Responsiveness Issues

- Recently upgraded hosts to 6.7 and vCenter to 6.7a

- Hosts are 'not responding' in vCenter Server

- Can ping

- Cannot acess web interface or login via SSH

- Can access it via console, but after you enter login information and press enter it freezes (cursor is still blinking)

- If you remove the host from the inventory and shut down a virtual machine on the host it brings everything back online and the host can be re-added to vCenter

- Four identical hosts, has happened on three of the four (twice on one)

- The host that had this issue twice now will not come back after trying the above method and is completely unresponsive at the console

Tags (1)
41 Replies
dzampino
Contributor
Contributor

We are still waiting for a fix.

Reply
0 Kudos
FernandoAS
Contributor
Contributor

We've had the exactly same problem with a fresh install of Vmware ESXI 6.7 Update 1 on a DELL R740 Server.

Still waiting for Vmware's support, but any update on this subject will be appreciated.

Thanks!

Reply
0 Kudos
dzampino
Contributor
Contributor

We were told the fix would be made available in Update 2 which is scheduled for Q1 of 2019.

Reply
0 Kudos
FernandoAS
Contributor
Contributor

We were also told that the fix will be available in Update 2. The workaround in our case is to disable the SIOC on all Datastores.

Reply
0 Kudos
VirtualCop
Contributor
Contributor

Hi MightyGorilla,

After we disabled all 4 onboard Broadcom NICs in the BIOS/RBSU these warnings flooding the log have been disappeared:

2018-09-18T13:23:34.015Z cpu25:2100568)MemSchedAdmit: 470: Admission failure in path: nicmgmtd/nicmgmtd.2100568/uw.2100568

2018-09-18T13:23:34.015Z cpu25:2100568)MemSchedAdmit: 477

Thx for info!

Regards

Cop

Reply
0 Kudos
DarkSaber
Contributor
Contributor

Is there a bug number for this issue that anyone took note of?

Reply
0 Kudos
JimL1651
Contributor
Contributor

We're having the identical issue with Dell PowerEdge 740xd hosts and Dell Compellent SC5020 iSCSI storage.  Multiple hosts go into this zombie state and some or all of the VMs lose connection to the storage.  HA  doesn't to move the VMs to healthy hosts.  Initially we were power cycling the hosts because we could not access their Web UI or console (DCUI).  Once power cycled, Ha will move the VMs to other healthy hosts. We've since learned that if SSH is enabled, you can connect to them with SSH and run "services.sh restart". After several minutes the command will complete and the host will return to "normal".  We can then vMotion the VMs and gracefully restart the host.

Support confirmed yesterday there's a "firmware / driver / vmkernel" bug in 6.7 update 1 and the fix would be included in 6.7 update 2.  They indicated update 2 was not expected out for another 2 months.  They're very tight lipped so far about the details of the bug, triggers and any possible workarounds.

I cannot wait 60 days for a update 2 that's has already been delayed from Q1 to Q2/Q3.   I'm really hoping VMware does the right thing and makes the details of this critical defect public.so we and other customers can make an informed decisions about updating.

.

Reply
0 Kudos
Dsoczek
Contributor
Contributor

I currently have the same issue, slight variation in our infrastructure.

We have four esxi 6.7 hosts fully patched. All hosts are connected to iscsi nimble san. Two hosts share majority of  workload and iscsi connections. These are the only two hosts that experience the issue. I have managed to find a workaround at least in our setup. We have been restarting the vxpa and hostd services at least once a day if not twice And have not had the issue for 5 days. The services are restarted regardless of healthy host state.

Some of the logs reported “out of memory” errors and storage disconnects.

Himadri
Contributor
Contributor

Upgrade the hosts to 6.5u2

Reply
0 Kudos
JimL1651
Contributor
Contributor

Support confirmed there is a bug in SIOC that causes it to consume large amounts of memory and CPU when handling storage I/O anomalies. Anomalies appear to include normal latency increases during backup periods and Windows updates.  The workaround from engineering was to disable the SIOC on all datastores and stop the StorageRM service on the host (/etc/init.d/storageRM stop).

Apparently this workaround is not 100% effective because we had another outage last night.  Support suggested downgrading to 6.5 until 6.7 update 2 is ready.  A downgrade will be painful because we'll need to also downgrade the hardware version on a large number of VMs.  If I downgrade I'm not likely to go back to 6.7 knowing it's history of instability.

I'm ready to dump VMware for a more reliable virtualization platform.

Reply
0 Kudos
JimL1651
Contributor
Contributor

Took another outage with SIOC and the StorageRM serivce disabled. We're now working on downgrading the hosts to 6.5u2. Migrating the VMs to the 6.5 hosts requires a reboot due to the lower EVC support level and downgrading the hardware version on them.

Engineering conceded they have been working on this for many months and have not found a root cause and it's affecting multiple customers. Update 2 is pushed until Early April and will not a fix. They're now hoping to have a fix by the time Update 3 comes out this summer.

Reply
0 Kudos
jawad
Contributor
Contributor

We are seing the same issue on some of our 6.7EP6 hosts. The recommendations we have gotten from vmware support is:

1. Disable ATS heartbeat

2. Upgrade drivers/fw on HBA (Emulex)

3. Migrate from VMFS5 til VMFS6 datastores

4. Upgrade NIC drivers to latest

They also told me yesterday (27th of March) that a fix would be in place in 6.7U2 and that will be released most likely within 4 weeks.

EDIT: We are now trying with EP7 on the affected hosts to see if that helps. No specific fixes for this mentioned in the releasenotes thought.

In the shadows...
Reply
0 Kudos
JimL1651
Contributor
Contributor

Vmware finally published a KB for this defect.  https://kb.vmware.com/s/article/67543

The fix will NOT be in 6.7 update 2.

Reply
0 Kudos
JimL1651
Contributor
Contributor

I came across the below KB that talks about a different defect in 6.7 affecting Dell EMC SC Storage. use Dell EMC SC storage so this may be part of the equation.

ESXi 6.7 hosts with active/passive or ALUA based storage devices may see premature APD events during storage controller fail-over scenarios

https://kb.vmware.com/s/article/67006

Reply
0 Kudos
jawad
Contributor
Contributor

We also have SC storage. But after implementing all the mentioned changes we haven't had this problem anymore...

In the shadows...
Reply
0 Kudos
VMassociate
Contributor
Contributor

VMware engineering Team is working on this issue and hopefully permanent fix would be included in upcoming versions.

For temporary fix this issue, please follow the workaround steps mentioned in VMware KB - https://kb.vmware.com/s/article/67543

Regards

Jitendra Singh

Reply
0 Kudos
Nawals
Expert
Expert

Hi Isaacwd,

The error you are experiencing is a known issue in vSphere  6.7. this bug have in ESXi 6.7 EP 07 and ESXi 6.7 EP 09 which results in host becoming unresponsive.

The main Root Cause is SIOC running out of memory.

Please wait for VMware to release the fix which will included in 6.7U3.  ETA the release date is around July/August 2019.

Note: Please note that currently there is no workaround available for above-mentioned issue.

Workaround:

Workaround the issue by restarting the SIOC service using the following commands on the affected ESXi Hosts:

1. Check the status of storageRM and sdrsInjector

/etc/init.d/storageRM status
/etc/init.d/sdrsInjector status


2. Stop the service

/etc/init.d/storageRM stop
/etc/init.d/sdrsInjector stop


3. Start the service

/etc/init.d/storageRM start
/etc/init.d/sdrsInjector start

If the issue persists even after the SIOC service is restarted, users can temporarily disable SIOC by turning off the feature from VMware Virtual Center

Refer VMware KB 67543

NKS Please Mark Helpful/correct if my answer resolve your query.
Reply
0 Kudos
CharlieM1
Contributor
Contributor

I experienced the same issues in 6.7u2 and found this post. I reverted back to 6.5 and all the issues disappeared. Anyone try update 3 and see if its fixed?

Reply
0 Kudos
Truckstop
Contributor
Contributor

It isn't fixed.  I'm running 10 hosts and all have the same issues as of 6.7 U3 so the problem persists even though the fix was supposed to be in U3.  I'm not looking forward to downgrading all my hosts, I AM looking forward to dumping VMware and going with a Microsoft virtual environment.  I've had enough of losing VM's and having zero access to the ESXi hosts when it comes to trying to recover them.  I wish I have never upgraded to 6.7, it's a POS.

RJB

Reply
0 Kudos
NinjaNitrate
Contributor
Contributor

Hey all - Does anyone know if this error is still occurring in 6.7.3 or has it been hotfixed since?

Cheers!

Reply
0 Kudos