VMware Cloud Community
juchestyle
Commander
Commander

Host Disconnects

Hey Everyone,

I have found that our environment has many disconnects on a regular basis and I wanted to reach out and see if everyone else gets this too?

We have 23 hosts, and on average we get several disconnects a day.

Does this happen in your environment often?

Any ideas on what to do and what to check to stop it from happening?

What I have tried:

Rebuilding from a virgin build (from 3.5 u2 to 3.5 u4), nothing added, seems to help but still happens (much less).

Added a host file to try to take dns out of the picture, still happens.

Looked at cpu, memory, disk, network and they don't seem to be hitting highs during disconnects.

Ideas?

Matthew

Kaizen!

Kaizen!
0 Kudos
18 Replies
Chamon
Commander
Commander

We had this issue a while ago and our problem was a bad switch. Is everything there working properly?

0 Kudos
RParker
Immortal
Immortal

We have 23 hosts, and on average we get several disconnects a day.

Does this happen in your environment often?

Yes, but not now. DNS issues mostly. From your VC ping the hosts by name and by IP. Verify they are the same for each host.

Make sure the etc/hosts for one of your disconnected hosts is setup properly. I know you, I know you did this.. but just in CASE you missed it Smiley Happy

Another thing I did was to remove the certs, /etc/vmware/ssl (delete both files). service mgmt-vmware restart.

The issue I found was that the VC connected to an ESX host on one IP, and the ESX host had 2 distinct SC IP addresses.

Eventually I figured out what the problem was, but that was a long time ago, so I am trying to remember what exactly the steps I took.

0 Kudos
Datto
Expert
Expert

Are your NICs Broadcom NICs? If so -- and you have the chance to take a server offline and investigate whether you're suffering from a conflict between Broadcom NICs and USB controllers/drivers described here:

http://www.tuxyturvy.com/blog/index.php?/archives/37-Troubleshooting-VMware-ESX-network-performance....

Also, another idea is that you may have a duplex mismatch between your NICs and your switch ports causing the ports to flap.

Datto

0 Kudos
kjb007
Immortal
Immortal

I had this type of issue on one of my clusters. As chamon, it was one bad cable on a switch to switch ISL link that contained a LAG of 4 cables. Very hard to diagnose to find the actual issue. But, I would receive ping failures from host to host. Just run extended pings within your hosts and see if any drop.

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
0 Kudos
juchestyle
Commander
Commander

Another disconnect happened on one of my hosts. I have been doing a continous ping from vCenter all day waiting for this. As soon as I saw the disconnect, I checked to see if anything dropped. NOPE! the continous ping, was continous, nothing dropped so why the disconnect?

Rparker, thanks for the shout out, I double checked the host files just in case a third time, and made sure DNS had the right IP addresses too!

So how does something disconnect when networking never dropped off?

Ideas!????

Matthew

Kaizen!

Kaizen!
0 Kudos
RParker
Immortal
Immortal

Rparker, thanks for the shout out, I double checked the host files just in case a third time, and made sure DNS had the right IP addresses too!

My thinking is it's 100% the vmware agent on the host. That's pretty much the only thing 'disconnecting'.

I still say that deleting the certs and disconnecting the hosts, removing them from the VC, and adding them back is the only way to truly fix that agent.

0 Kudos
kjb007
Immortal
Immortal

I agree with RParker as well. It has to be your agents. I would also remove the agents and the vpxuser after the disconnect from VC, just to get new vpx and aam agents installed by vCenter.

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
0 Kudos
juchestyle
Commander
Commander

Hey Rparker,

Deleting the certs to fix the agents. Interesting idea. Here is some background nfo though. I have rebuilt from scratch several ESX hosts using nothing but the virgin install files from VMware. Wouldn't you agree that a virgin rebuild should solve this?

I am thinking out loud now: It seems that some of the hosts sometimes disconnect and take awhile for them to reconnect; sometimes I have to restart the mgmtservice several times. I wonder if there is a backlog of stuff happening that affects this disconnect, it works its way through that backlog and finally gets back to responding. This would explain why everything is still pingable but tranparently not there also.

I feel like Gregory House without the insults!

Matthew

Kaizen!

Kaizen!
0 Kudos
juchestyle
Commander
Commander

Did you guys get my response, that I have rebuilt some of the host from scratch using a virgin install from Vmware? Does that change your perception of the agents issue?

Kaizen!

Kaizen!
0 Kudos
Troy_Clavell
Immortal
Immortal

the agent gets pushed from vCenter, so a fresh install really won't change the agent.

0 Kudos
kjb007
Immortal
Immortal

I would still remove the agents in the hosts after disconnecting the hosts from vCenter, search for the vpx and aam rpm's, and then remove them. Then delete the vpxuser. Re-register the host. Another thing to try is to disable ha in the entire cluster,, and then re-enable it.

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
0 Kudos
RParker
Immortal
Immortal

Deleting the certs to fix the agents. Interesting idea. Here is some background nfo though. I have rebuilt from scratch several ESX hosts using nothing but the virgin install files from VMware. Wouldn't you agree that a virgin rebuild should solve this?

Yes it should. Have we ruled out IP address conflict? Maybe some other machines on your network are somehow getting the same IP range?

0 Kudos
Chamon
Commander
Commander

Do all of the hosts fall out at the same time or is it just one? Is

the vcenter a VM.

On May 7, 2009, at 4:59 PM, kjb007 <communities-emailer@vmware.com

0 Kudos
juchestyle
Commander
Commander

Hey Chamon,

No, just one will disconnect here and there. vCenter is a physical server.

Hey guys, I have to ask a dumb question. If I wipe out a host, delete all the partitions, after having removed that host from vCenter, and I repartition, and reinstall a host; why wouldn't that be good enough to wipe out the agents?

Respectfully confused on the agent discussion,

Matthew

Kaizen!

Kaizen!
0 Kudos
AndreTheGiant
Immortal
Immortal

&gt;why wouldn't that be good enough to wipe out the agents

The VC agent is installed direclty by the VC. Ensure that VC is the last version (U4)

The reinstall you can simple remove the host from VC, enter in ESX console and the do:

rpm -e VMware-vpxa

Andrea

Andrew | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro
0 Kudos
CiscoKid1981
Contributor
Contributor

I am having what sounds to me to be a vaguely similar problem.

http://communities.vmware.com/thread/209075

Your network sounds like its production so this may not be feasible, but all i did was remove a switch and replace it with a hub and everything works. I would still like to know why that works though....

0 Kudos
kjb007
Immortal
Immortal

A regular hub would not participate in STP as a switch does. It could be an STP blocking port situation in your case.

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
0 Kudos
Mikeluff
Contributor
Contributor

Question for you - are you using HP hardware?

0 Kudos