VMware Cloud Community
OptimalZ06
Contributor
Contributor

Connectivity issues to hosts over MPLS

Hello,

We have been troubleshooting an issue that prevents our vCenter server from connecting to some of our remote hosts. This has impacted 2 different vCenter servers running 5.1 and 5.5 on Windows Server 2008 R2 and 2012 R2.

Process leading to the error

  • We are able to add hosts to a data center after a host reboot or fresh vCenter install
  • If our primary data center MPLS goes down (maintenance or otherwise) we lose connectivity to all remote hosts
  • One data center is able to reconnect without issue. This particular data center is our secondary data center
  • No other remote sites are able to reconnect

Troubleshooting

  • Disabled IPv6 across VMware infrastructure (Windows Servers, ESXi hosts)
  • Increased handshakeTimeoutMs to 120000
  • Restarted management network
  • Cleared ARP table
  • Lockdown mode is disabled

Notes

  • We have a single ESX 4.1 host that is able to reconnect without issue (has only experienced one disconnect, but came back without issue unlike the 5.5 counterpart)
  • We're able to connect to the hosts via vSphere console and SSH without issue
  • The network team is troubleshooting the issue as well, but we've not been able to rule out VMware as the culprit

Logs


vpxd

2014-09-24T14:00:14.785-05:00 [05920 warning 'Default'] Failed to connect socket; <io_obj p:0x000000000d10a128, h:3876, <TCP '0.0.0.0:0'>, <TCP '10.x.x.16:443'>>, e: system:10060(A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond)

2014-09-24T14:00:14.785-05:00 [05920 error 'HttpConnectionPool-000001'] [ConnectComplete] Connect failed to <cs p:000000000ee4c730, TCP:xxxesxi01.xxx.com:443>; cnx: (null), error: class Vmacore::SystemException(A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond)

2014-09-24T14:00:14.785-05:00 [05852 error 'httphttpUtil' opID=6159800D-000000AB-d6] [HttpUtil::ExecuteRequest] Error in sending request - A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

2014-09-24T14:00:14.785-05:00 [05852 error 'vpxdvpxdHostAccess' opID=6159800D-000000AB-d6] [VpxdHostAccess::Connect] Failed to discover version: vim.fault.HttpFault

2014-09-24T14:00:14.786-05:00 [05852 info 'commonvpxLro' opID=6159800D-000000AB-d6] [VpxLRO] -- FINISH task-internal-5070 -- datacenter-31 -- vim.Datacenter.queryConnectionInfo --

2014-09-24T14:00:14.786-05:00 [05852 info 'Default' opID=6159800D-000000AB-d6] [VpxLRO] -- ERROR task-internal-5070 -- datacenter-31 -- vim.Datacenter.queryConnectionInfo: vim.fault.NoHost:

--> Result:

--> (vim.fault.NoHost) {

-->    dynamicType = <unset>,

-->    faultCause = (vmodl.MethodFault) null,

-->    name = "xxxesxi01.xxx.com",

-->    msg = "",

--> }

--> Args:

-->

Connection error

Call "Datacenter.QueryConnectionInfo" for object "XXX" on vCenter Server "VCENTER" failed.

Thanks

Removed network details Message was edited by: OptimalZ06

Reply
0 Kudos
10 Replies
OptimalZ06
Contributor
Contributor

Anybody have anything at all we can attempt vmware side?

Reply
0 Kudos
OptimalZ06
Contributor
Contributor

We have been unable to correct this issue, which is now impacting even more (4) remote sites. Does anybody out there have any insight whatsoever?

Reply
0 Kudos
OptimalZ06
Contributor
Contributor

You can add disabling proxy arp within the cisco firewalls to the list of potential fixes that did not solve the issue.

Reply
0 Kudos
ramig
Contributor
Contributor

Hi,

Any success with solving this issue?

Thanks.

Reply
0 Kudos
OptimalZ06
Contributor
Contributor

Sadly, no. We have determined that if you refresh the MAC address on the management vmkernel you can re-add the host to vcenter. This does not require a host reboot. However, you will lose all connectivity to the host, so you must have some kind of remote management interface (idrac) to restart the management network.

I had created a script that would automate this process, but couldn't get past this roadblock. Still searching for answers...

Reply
0 Kudos
lonni3b
Contributor
Contributor

We have this same issue. Only we don't have our main site go down at all, only 3 of our 12 sites drop periodically. the only way we know how to get connected back is to restart the management network on the esx server. we are going to try and clear the arp cache on the switch the next time this happens, if that doesn't work we'll try rebooting the switch. I'll update the results here.

Reply
0 Kudos
Feuerio
Contributor
Contributor

Hello,

any update on this issue?

We're having the same problem here since about four months after updating vcenter 5.5 from u3d to u3e. Unlike our affected hosts automatically reconnect after about 10 seconds. This occurs randomly, some days nothing and then suddenly one or more hosts per day and only with hosts in secure zones with firewall between vcenter and them (but the cisco firewall guys see nothing).

Any ideas are welcome.

Reply
0 Kudos
GordonRamsay
Contributor
Contributor

I have a few questions around your network layout.

Are you having a DNS failure around your VMware environment when your primary MPLS is down?

How is your network configured to handle your primary MPLS outage?  Are you using all Cisco hardware?  HSRP for redundancy?  Are your host management interfaces on separate physical NICs or on shared?

Reply
0 Kudos
bas000m
Contributor
Contributor

Hi Feuerio,

I have the same issue, vcenter is 5.5 u3e... i am always getting hosts disconnecting/reconnecting from vcenter, how did you manage to fix that ?

Thank you in advance

Basem

Reply
0 Kudos
bas000m
Contributor
Contributor

Hi Feuerio,

I have the same issue, vcenter is 5.5 u3e... i am always getting hosts disconnecting/reconnecting from vcenter, how did you manage to fix that ?

Thank you in advance

Basem

Reply
0 Kudos