At wit's end with vCenter Server Connectivity Issu...

dsohayda · ‎03-30-2016

We’ve had a support ticket open since August 2015 regarding hosts spontaneously disconnecting from our windows based vcenter server. It’s actually the latest in a long line of support tickets related to the same issue. The original going back to November of 2014!
The ticket was created as high/critical, but yet our support has been very lax, and the prevailing suggestion is basically just to move our vcenter to a new server. I’d like to avoid such an ordeal.
It was suggested that we create a ticket with Microsoft because VMware thought the issue was with the OS, but Microsoft has been even less helpful; suggesting a single hot fix ( https://support.microsoft.com/en-us/kb/2775511 ) to remedy the situation, and it didn’t work. Originally we were told we needed another hot fix that covered kernel socket leaks, http://support.microsoft.com/kb/2577795, that was applied in December of 2014 but did not help.
Multiple times a day we get alerts for ‘Hosts not responding’ from log insight. These emails are basically just a summary of alerts on vcenter that match the event alarm for when a host is not responding. If you happen to be logged into the vcenter via the vsphere client you’ll see the host get disconnected along with all of its VMs. This lasts for under a minute, but when checking logs on the host itself it doesn’t appear to be aware of any networking issues. This leads us to believe the issue is with the vcenter server.
We’ve run perfmon and special scripts created by VMware support to monitor resource usage and network connectivity/port exhaustion while these events happen, but no smoking gun has been found. At one point VMware suggested we add more CPUs to the VM, going from an already over provisioned 8 vCPU to 10, but that didn’t help.
This is a relatively small environment of under 60 hosts and approximately 425 VMs, but the vcenter server is configured with 8 vCPU and 20GB of memory.
No other products are installed on the vcenter other than the related VMware products; web client, VUM, dump collector, SSO, inventory service. We run McAfee MOVE off-server virus scan which was called out by VMware support, but we need to run something and this should be the best option as scanning is not done on the server itself. We have updated the client to the latest version at their suggestion, but no change. We also have Veritas netbackup client installed for backups. Another item we need, but we have tried removing it temporarily with no help.
Support also thought maybe having vROps tied into this vcenter was a problem, but we temporarily stopped data collection and no improvement was noticed.
At one point they called out VDP being to blame and an update to that would fix it. It didn’t.

At this point I believe they have given up on doing any troubleshooting and would like us to just move this to a new server. I feel like that’s an easy way out for them and a lot of work on my part. I’m fairly confident the issue lies with vcenter server, and think surely there could be more that can be done to narrow things down and possibly fix the issue.
This VM was originally deployed as vcenter server 5.1 and upgraded throughout the years to the current 5.5U3b. It is Windows Server 2008 R2 Enterprise SP1 and SQL is running on a separate VM.

Has anybody had a similar issue they were able to solve, or have any suggestions other than a slash and burn approach?
Thanks

unsichtbare · ‎03-30-2016

One ESXi Host at a time, or all ESXi Hosts at once?

You have probably been all over the network, however, there is a free utility called WinMTR (which you can run multiple instances of simultaneously). You may consider running multiple pings simultaneously to: A. ESXi Hosts, B. Management Network Gateway, C. other objects on Management Network.

P.S. I agree, the quality of VMware Support is abysmal. Methodology seems to be: Send customer on a wild goose chase > wait for customer response > wait 24 to 48 hours before responding > send customer on another wild goose chase > wait for customer to give up.

+The Invisible Admin+ If you find me useful, follow my blog: http://johnborhek.com/

jasoncain_22 · ‎03-30-2016

dsohayda

I have seen this issue several times. (each time it was something slightly different)

But just to track it down....

Do you see any latency warnings in your vmkernel.log or vmkwarning.log files?

What kind of storage do you have? FC or iSCSI / Which vendor?

Which version of vmtools? (assumes all of your hosts have been upgraded to 5.5U3)

How often are the disconnects? Same time each time?

dsohayda · ‎04-06-2016

Exactly! except in our experience we're told our drivers are out of date so we need to update all drivers and firmware or else they can't continue to troubleshoot. who runs the latest versions of a driver or firmware?

anyway, we get bunches of hosts at a time. 2-3 or as many as 10. It varies.

dsohayda · ‎04-06-2016

I think we may have storage latency warnings in our logs. I'll have to check to be sure we still do.

we use FC storage on HP EVAs and EMC VNX arrays.

vcenter is currently at 5.5u3b while hosts are still at 5.5u3a until next month.

here is a snippet of log insight alert emails showing some of the alerts. you can even see the vcenter failing to connect too.

dsohayda · ‎04-07-2016

this morning's batch of alerts. They're seconds apart, and the whole ordeal takes about 10 seconds.

adamjg · ‎05-12-2016

Wow, I thought I'd never hear of someone else having this same issue. I've had the same alerts on and off since last fall. Sometimes it's multiple times a day, sometimes it's only once a week or so. The host or VMs never go down and most of the time by the time I get the email alert the host is back in. I opened a case with VMware and they said the host is losing heartbeat connection to the vCenter. They recommended contacting Cisco for UCS and/or network issues or our storage vendor, neither of whom were able to find anything. One thing different than yours is that we're running the vCenter appliance. vCenters are 5.5u3b now, but this went back to u1.

I'm going to upgrade to vSphere 6 in the somewhat near future, so I'm hoping whatever bug is present will go away. Wish I had a better answer for you, but at least you know someone else is having the same issue.

dsohayda · ‎05-13-2016

That does sound similar. VMware told us to open a ticket with Microsoft because they were blaming Windows Server 2008. Microsoft couldn't find anything and said to open a ticket with VMware. Typical.

We hope to migrate to vSphere 6 and leave the Windows based vCenter Server for dead. Hopefully the VCSA doesn't have the same problem for us on version 6.

unsichtbare · ‎05-18-2016

I would use a free utility called WinMTR to ping various entities on your vSphere management network from your vCenter. You can run multiple windows of WinMTR simultaneously and ping:

Several/all ESXi Hosts
The gateway

If you observe that vCenter is disconnecting from both the ESXi and gateway equally, then VMware may be correct and the problem lies with Microsoft.

If you observe that vCenter is becoming disconnected from ESXi hosts disproportionately from the gateway, then the problem is with the physical switch and/or Management Network VMkernel teaming and failover policy.

+The Invisible Admin+ If you find me useful, follow my blog: http://johnborhek.com/

aaronwsmith · ‎05-19-2016

Have you increase vCenter's log level to verbose to help troubleshoot?

VMware KB: Increasing VMware vCenter Server and VMware ESX/ESXi logging levels

Guessing this may have been tried by GSS, but if not you could see if increasing the heartbeat timeout between vCenter and ESXi hosts would help resolve the issue:

https://kb.vmware.com/kb/1005757 (note the symptoms described in this KB require verbose logging for vCenter to identify the missed heartbeat messages.)

Is your vMotion and Management vmk# ports on separate uplinks? Or is it shared? I've seen on 1 Gbps links if vMotion + Management traffic is shared on the same wire, multiple vMotions (for example from putting a host in maintenance mode) will cause ESXi hosts to drop because the uplink is congested from vMotion traffic, preventing heartbeat UDP packets from getting through.

dsohayda · ‎06-03-2016

Logging was increased at one point, and many log bundles have been put together. No clear smoking gun could be found.

I thought about increasing the heartbeat timeout after finding that same kb, but as it stated doing so is only a temporary work around, not a fix.

At this point we're basically ignoring any hosts not responding alerts until we can bring our new vSphere 6 vCenter Server online to migrate to.

As for vMotion and management vmkernels; yes, they are on completely separate uplinks. Management has two 1Gb uplinks, and vMotion runs off of another set of two 1Gb uplinks. These are both active with IP Hash load distribution on the VDS due to Cisco ether channel on the physical switches.

It does seem, however, that when running patches where we move a lot of VMs around by putting hosts in maintenance mode we get a lot more of the alerts than normal. If this were a result of the networking configuration I'm not sure how it could be changed to help. VM traffic is on its own VDS and uses its own set up uplinks the same as management and vMotion traffic. All symptoms point to the vCenter server being to blame.

dsohayda · ‎06-03-2016

I just downloaded WinMTR and will check it out. Thank you for the suggestion. Strange that VMware support never tried anything like this. They put together a script that would check for port exhaustion every second by outputting available ports to a log file, and had us run a perfmon capture, but nothing more in-depth than that. And neither of which every found anything. The script didn't even really work. We ran it until the next event occurred, but when we went back to reference the log it created nothing was there. Another waste of time.

All told I think if we were lucky enough to get a more capable tech on the case they would have been more helpful. As it was we ended up with somebody who early on in the case focused on moving the vcenter server install to a new Windows installation rather than trying to figure out a root cause we could apply a fix to. To me that's like the typical "update your drivers" support response. They're trying to push it off on you to do some slash and burn leg work rather than finding out what's really going on. In addition to that the link he sent with steps we should follow was completely wrong. Instead of the moving your vCenter to another server kb which I had found and sent to him, he instead sent instructions on how to re-install vcenter on the same server. Super unhelpful, and not instilling of great confidence in their capabilities.

J1mbo · ‎06-03-2016

I presume you are using vmxnet3 vNIC?

dsohayda · ‎06-03-2016

You presume correct sir.

All

At wit's end with vCenter Server Connectivity Issue