Hosts disconnect temporarily with VCSA 6.5u1

mark49808 · ‎12-11-2017

I've posted this elsewhere and also have a support case in, but a few months into the support case i'm running out of patience. Hoping someone else has had this problem.

Issue: ESXi Hosts become disconnected from vcenter intermittently after upgrading to 6.5U1 VCSA. Especially during template deployments between clusters or veeam backups. There are at least 2 people having this problem so i know its not just me. More details here, but i'll post the relevant bits: https://www.reddit.com/r/vmware/comments/7ht23s/vsphere_65u1_cluster_hosts_disconnecting_and/

My issue started when upgrading a 6.0 VCSA to 6.5u1. Everything worked fine before this upgrade. I did not upgrade any of the hosts at the time, and the issue started the next evening after the upgrade (during Veeam backups). I did not touch any of the hosts or config, just a by the book vcenter upgrade. I should add that these hosts were added to this vcenter when it was a VCSA. I did not use the migration tool from Windows vcenter to appliance, I started from scratch, so it should be pretty clean.

Veeam backups sometimes cause the issue, but I can reliably recreate the issue on demand by deploying a template between datacenters/clusters. VMware initially thought it is a network congestion issue. However, I disproved that theory by doing a tcpdump on the vcsa, showing the heartbeats are in fact arriving to the vcenter. They responded by saying

“As we discussed before the heartbeats are reaching vCenter, but the "host sync" is not getting completed and timing out. Why? Because the ACK messages that vCenter is sending are not being processed by the ESXi host. We are currently investigating from the ESXi side which seems to be the top offender, the "miss heartbeats"messages in the vpxd.log are false positives.”

Here are some relevant logs I see (some text redacted)

Vpxd.log on vcenter when the host disconnects: 2017-11-28T15:39:10.827Z info vpxd[7F34FF162700] [Originator@6876 sub=InvtHostCnx opID=HeartbeatStartHandler-5d19501d] [VpxdIntHost] Missed 11 heartbeats for host <hostname>

017-11-28T15:39:25.201Z error vpxd[7F34FFBF7700] [Originator@6876 sub=vmomi.soapStub[103] opID=781d2fe1] Error deserializating SOAP response body: --> Unexpected exception reading HTTP response body: N7Vmacore15SystemExceptionE(Connection reset by peer)

This post https://communities.vmware.com/thread/389935 from 2012 has very similar logs, for what it’s worth. I tried the one suggestion of removing management network redundancy, to no effect. I have not yet tested disabling the firewall as mentioned in this post, as I have multiple host disconnects (even those not involved in vmotions) so I don’t feel like this is an esxi issue.

Veeam ONE also reports the following when hosts disconnect. My hunch is vcenter is sending out some malformed data, but I can’t prove it. Failed to collect performance data for object <esxihostname>. A general system error occurred: Invalid response<br> Initiated by: Veeam ONE Monitor"

My setup: Multiple clusters tied to this one vcenter experience the issue. So there is a variety of hardware/software involved across these two cluster. This is a good thing in my opinion as it rules out a specific esxi software version.

Vcenter; 6.5u1 (no updates)

Hosts in cluster 1: Dell R620s running 6.0 connected to a Nimble iscsi storage array via Solarflare 10G nics. Pretty standard redundancy set up, two 10g connections shared for management/vmotion/vm. Load balancing set to Originating Virtual Port (no LACP/Bonding/etc). Vcenter VCSA is hosted here.

Hosts in cluster 2: Dell R740XD running all flash VSAN on very latest software/hardware/drivers/etc (6.5u1). Just built it a month ago. These hosts also have the disconnect issues. They cannot possibly blame network on this, they had issues when loaded with a minimal number of idle VMs. Hosts are connected to standard Cisco Nexus 9300 switches.

jablue22 · ‎04-19-2018

Did you ever make any progress on this issue? I have had a ticket open for almost a month with very similar behavior. Brand new clean install VCSA 6.5u1 new esxi hosts and randomly the hosts appear to go offline. We have tried extending the heartbeat timeouts which helped but it's still occurring. More so when any changes are made.

mark49808 · ‎05-02-2018

Sorry for the delay. My issue was related to a bug in a NIC driver. Solarflare in my instance.

All

Hosts disconnect temporarily with VCSA 6.5u1