ESXi 6.0 becomes unresponsive at random times even...

larry96 · ‎03-28-2016

We have been having random problems with 6.0 becoming unresponsive at random time and it seems the issue has gotten worse after updating to 6.0U2. I had three hosts become unresponsive over the weekend.

- Seem to happen while Veeam backups are running

- Very random. Doesn't happen to the same host

- VM's continue to run but I cannot migrate them, so I have to do a reboot of the host and hope everything fails over

- Support had said that this would be fixed in update2. I guess it hasn't.

- Support just asks for log files.

- Been a VMware customer for some time. Never seen these type of serious issues, especially after update 1 or 2.

- Have seen it on different hardware. Both UCS using fiber channel and stand alone servers using iSCSI.

Any help would be greatly appreciated.

Thanks,

Larry

DavoudTeimouri · ‎03-28-2016

Hi,

We need to your log files to check and find the issue as well.

Please send the log files.

-------------------------------------------------------------------------------------
Davoud Teimouri - https://www.teimouri.net - Twitter: @davoud_teimouri Facebook: https://www.facebook.com/teimouri.net/

cesprov · ‎03-28-2016

You say the VMs continue to run so the host itself hasn't become unresponsive. I assume you are using vCenter? Is it just that the host is becoming disconnected from vCenter? If you restart the management agents does it rejoin vCenter without having to restart the entire host?

larry96 · ‎03-28-2016

Yes, we use vCenter. I can't restart the management agents. The host becomes completely unresponsive. I can log into the console but that is it. It is like the host becomes disconnected from whatever media the host OS is running from (disk, SSD, boot from SAN, etc) When the problem first happened, support thought it was a bad SSD card but we have seen the issues happen hosts that boot from SAN as well as disk.

buckmaster · ‎04-19-2016

I can vouch for you when you say hostd becomes unresponsive since 6.02. I rarely if ever had this error but seems like it is happening often now.

It's ugly when it does and a host reboots is all that fix it.

services.sh restart just sit there and never finishes.

It gets to the point of "Exclusive access granted." then nothing happens.

tom miller

Tom Miller

larry96 · ‎04-19-2016

Tom,

Thanks for letting me know as I thought I was going crazy. I currently have a support case open with VMware and they have just escalated it. By leaving SSH enabled on the hosts, they were able to grab some logs which should be helpful.

I'll post updates as I get them.

Larry

vLarus · ‎04-19-2016

Some questions:

What is the QoS settings on the UCS? and what vNIC profiles do you have setup?

Does the management network (for the management vmkernel on the ESXi hosts) include other workloads?

Is vMotion on that same network?

When you say FiberChannel, do you mean FCoE?

What protocols are you using when performing the Veeam backup? Is it NBD? or direct SAN?

What version of Veeam are you using and how is it setup? A seperate physical host or a VM?

Larus.

vmice.net

buckmaster · ‎04-19-2016

Larry no problem. I'll be watching this thread so see the outcome. I've ran 6 since beta and through all the iterations. Currently running 6.02. hostd issue did not arise until 6.02.

FYI

Tom

Tom Miller

larry96 · ‎04-19-2016

Hi Larus,

Let's see if I can answer your questions....

What is the QoS settings on the UCS? and what vNIC profiles do you have setup?

Silver CoS=2 Weight=8

Best Effort CoS=Any Weight=5

Fibre Channel CoS= 3 Weight=5

So you mean vNic placement profiles?

Does the management network (for the management vmkernel on the ESXi hosts) include other workloads?

No

Is vMotion on that same network?

No

When you say FiberChannel, do you mean FCoE?

no, FC between the UCS/EMC Storage/Switch Stack

What protocols are you using when performing the Veeam backup? Is it NBD? or direct SAN?

Checking with our backup admin

What version of Veeam are you using and how is it setup? A seperate physical host or a VM?

9.0. Setup as a VM

vLarus · ‎04-26-2016

I'm assuming you have:

2 vNICs just for management.

2 vNICs just for vMotion.

2 vNICs for VM traffic.

1 to 3 switches.

Where does the Veeam VM reside? I'm assuming a different cluster than the one that it is backing up?

Using NBD will use the MGMT vmkernel as a default NIC to move the data through to the Veeam Backup Proxy/server depending on the backup design.

Using Direct SAN allows the Veeam server to have access to the volumes that include the VMFS datastores. I'm guessing you are using NBD since it gets tricky to configure Direct SAN unless you have passthrough configured.

So you have an FC EMC array connected directly to the Fabric Internconnect and than use vHBAs to map to the corresponding array? That is FCoE from host to Fabric Interconnect and than FC to the array.

I'm not sure what it could be. Seems as the hostd or vpxd is getting overwhelmed. How many jobs does the Veeam backup job throw at the ESXi hosts concurrently? Try doing it synchronously. That could give each job access to the most bandwidth available and should finish within a reasonable time (maybe ).

vmice.net

pbardaville · ‎05-03-2016

Hi Larry

I'm a MCS TAM for VMware and am curious about this issue. Can you post your ticket# in VMware so I can take a look at it?

Thanks!

Phil

larry96 · ‎05-07-2016

Hi Phil,

Sure the ticket number is: 16893661502

Thanks,

Larry

larry96 · ‎05-07-2016

One thing that we have found so far to make these issues less painful is to keep SSH enabled on the hosts. When they become unresponsive we can at least restart the services and the host comes back online. Support just suggested this. Just had to use it and it works.

/etc/init.d/hostd restart

/etc/init.d/vpxa restart

Larry

larry96 · ‎06-15-2016

So here is an update on the issue......

I am told by support that there should be a patch within the next week, so around June 20th. They say that the issue is caused by the likewise services??

To apply the patch, the hosts need to be at build # 6.0.0, 3825889.

I'll post an update after the patch is applied.

Larry

cesprov · ‎06-16-2016

I'm still wondering if the issue you are seeing is this?

ESXi 5.5 or 6.0 host disconnects from vCenter Server with the syslog.log error: Unable to allocate m...

That issue is fixed by restarting management agents, specifically vpxa and hostd also, which like you said, fixes your issue too. The ESXi host isn't disconnected from the network, just disconnected from vCenter because vpxa crashes. If SSH is enabled, you can log in and restarting hostd/vpxa allows it to rejoin vCenter. The issue you're having sounds awfully similar and may be the same bug, just in 6.0U2. I first noticed it in 6.0U1a which is where I am still at. It's still not fixed. Will be interesting to see what comes out June 20th as I've been waiting on the fix for KB2144799 since I was first affected by it in November 2015.

EDIT: There's another thread on KB2144799 here: ESXi 6.0U1a build 3073146 vpxa crashing randomly, host disconnected from vCenter.

People in that thread are reporting this issue still happens on 6.0U2 also.

Bleeder · ‎06-16-2016

I noticed that if I have any of the vCenter statistics levels set higher than 2, the problem occurs more frequently.

cesprov · ‎07-05-2016

I recently turned down the stats levels across the boards for another issue I am having and this seems to have all but gone away since doing so. Or I just haven't happened to catch it in the act since then.

Bleeder · ‎07-05-2016

Still no patch?

cypherx · ‎07-05-2016

vSphere replication was causing hostd to crash in our instance. A patch is supposed to be out towards the end of July for our specific issue. It seems in 6.0 hostd is a little finicky. We have 8 ESXi servers and we did upgrade vCenter to 6.0U2, SRM to 6.1 and vR to 6.1.1. But as for hosts we only did 3 of our 8 hosts. The other 5 will remain on 5.0.0,3086167 for some time until we can be assured that VMWare QA takes a turn around for the better. We also use Veeam and its NFS client to directly backup machines from their datastores and write to an Exagrid deduplicating appliance. Works great, no issues with Veeam at this point. But funny how every few weeks in the Veeam newsletter Gostev usually has some VMWare bug / showstoppper / patch issues since 6.0 came out. I'm glad at least from their angle they were having some good discussions on the bigger issue with all of the VMware QC issues we've seen in the past year.

All

ESXi 6.0 becomes unresponsive at random times even after 6.0U2