VMware Cloud Community
chukarma
Enthusiast
Enthusiast

ESX host intermittent losing connection

Hi all,

I need some help with this one. We have a 12 hosts cluster running esx 3.5 U4. Occasionally, some of the hosts in this cluster will trigger this message.

Host connection state changed status from Green to Red

When this happens, I look at VC and the host shows as disconnected. After a brief moment (5 seconds), this host will reconnect itself to the cluster automatically. During this time, none of the vms are affected.

I have scoured the log for the cause and even opened a case with vm support to no prevail. Have anyone experience this issue? Any help is appreciated.

Thanks,

Daniel

Reply
0 Kudos
5 Replies
Troy_Clavell
Immortal
Immortal

I would ensure proper name resolution is setup not only between all your ESX hosts, but also between vCenter and the ESX Hosts.

if all name resolution is setup, you can also restart hostd

>service mgmt-vmware restart

Reply
0 Kudos
EGRAdmin
Enthusiast
Enthusiast

Is your management NIC plugged into seperate switches then your VM networks?

I've seen this exact behavior when the above conditions are true (also on 3.5u4) and the Network Management team makes unscheduled changes during business hours.

Is it possible someone is running cables / power and disrupting existing cables that may have damaged RJ-45 ends?

If it's a cisco switch with expansion modules have your network team check that switches error logs. When their modules or chassis start failing they experience random module outages. I've seen that happen with CISCO MDS9216A model SAN (fiber) switches. It's possible the issue is switch related.

chukarma
Enthusiast
Enthusiast

Thanks, our management network is on a different cisco switch than our vm network. I will have the network guys check the management switch. This is happening almost daily now. I have checked the DNS settings and all appears to resolve correctly.

Reply
0 Kudos
EGRAdmin
Enthusiast
Enthusiast

No problem.

Another good check is see what hosts are having the random issue and see which switch they are in. It could all trace back to a single switch having issues.

That would explain some hosts having the issue but not all and also explain why the guests on the other networks continue to operation without any problems.

I bet it's the Cisco rebooting a component module. When those parts fail it can continuously recycle and it can be a pretty quick 30 second outage +/- so it may be back online before the VC reports their connectivity as online.

If they give you a hard time about not finding any issues and if you have any other systems plugged into that switch you could run a small script that would maybe have something like;

ping -n 20 ip-of-something-on-another-switch >> c:\somelogfile.txt

Set that to run every 5 minutes (or less) rather then running the ping -t so it doesn't flood the network with ping packets and lock down the network port (if they have security stuff configured).

It'll keep appending the log and give you a relative time as well as confirm if multiple systems on that same switch are having issues.

It would also rule out that the issue is isolated to yoru ESX hosts.

Reply
0 Kudos
chukarma
Enthusiast
Enthusiast

It's a little odd because only this host is behaving this way in the 12 hosts cluster. All 12 hosts are connected to the same mangement switchstack. I am still awaiting for some sort of resolution from vm (approaching 2 weeks now). In the meantime, I found this KB regarding memory leak in the CIM service. I applied this fix but unfortunately it hasn't fix the problem yet. Hopefully, it will be helpful for someone else.

Reply
0 Kudos