Hey Everyone,
I have found that our environment has many disconnects on a regular basis and I wanted to reach out and see if everyone else gets this too?
We have 23 hosts, and on average we get several disconnects a day.
Does this happen in your environment often?
Any ideas on what to do and what to check to stop it from happening?
What I have tried:
Rebuilding from a virgin build (from 3.5 u2 to 3.5 u4), nothing added, seems to help but still happens (much less).
Added a host file to try to take dns out of the picture, still happens.
Looked at cpu, memory, disk, network and they don't seem to be hitting highs during disconnects.
Ideas?
Matthew
Kaizen!
We had this issue a while ago and our problem was a bad switch. Is everything there working properly?
We have 23 hosts, and on average we get several disconnects a day.
Does this happen in your environment often?
Yes, but not now. DNS issues mostly. From your VC ping the hosts by name and by IP. Verify they are the same for each host.
Make sure the etc/hosts for one of your disconnected hosts is setup properly. I know you, I know you did this.. but just in CASE you missed it
Another thing I did was to remove the certs, /etc/vmware/ssl (delete both files). service mgmt-vmware restart.
The issue I found was that the VC connected to an ESX host on one IP, and the ESX host had 2 distinct SC IP addresses.
Eventually I figured out what the problem was, but that was a long time ago, so I am trying to remember what exactly the steps I took.
Are your NICs Broadcom NICs? If so -- and you have the chance to take a server offline and investigate whether you're suffering from a conflict between Broadcom NICs and USB controllers/drivers described here:
Also, another idea is that you may have a duplex mismatch between your NICs and your switch ports causing the ports to flap.
Datto
I had this type of issue on one of my clusters. As chamon, it was one bad cable on a switch to switch ISL link that contained a LAG of 4 cables. Very hard to diagnose to find the actual issue. But, I would receive ping failures from host to host. Just run extended pings within your hosts and see if any drop.
-KjB
VMware vExpert
Another disconnect happened on one of my hosts. I have been doing a continous ping from vCenter all day waiting for this. As soon as I saw the disconnect, I checked to see if anything dropped. NOPE! the continous ping, was continous, nothing dropped so why the disconnect?
Rparker, thanks for the shout out, I double checked the host files just in case a third time, and made sure DNS had the right IP addresses too!
So how does something disconnect when networking never dropped off?
Ideas!????
Matthew
Kaizen!
Rparker, thanks for the shout out, I double checked the host files just in case a third time, and made sure DNS had the right IP addresses too!
My thinking is it's 100% the vmware agent on the host. That's pretty much the only thing 'disconnecting'.
I still say that deleting the certs and disconnecting the hosts, removing them from the VC, and adding them back is the only way to truly fix that agent.
I agree with RParker as well. It has to be your agents. I would also remove the agents and the vpxuser after the disconnect from VC, just to get new vpx and aam agents installed by vCenter.
-KjB
VMware vExpert
Hey Rparker,
Deleting the certs to fix the agents. Interesting idea. Here is some background nfo though. I have rebuilt from scratch several ESX hosts using nothing but the virgin install files from VMware. Wouldn't you agree that a virgin rebuild should solve this?
I am thinking out loud now: It seems that some of the hosts sometimes disconnect and take awhile for them to reconnect; sometimes I have to restart the mgmtservice several times. I wonder if there is a backlog of stuff happening that affects this disconnect, it works its way through that backlog and finally gets back to responding. This would explain why everything is still pingable but tranparently not there also.
I feel like Gregory House without the insults!
Matthew
Kaizen!
Did you guys get my response, that I have rebuilt some of the host from scratch using a virgin install from Vmware? Does that change your perception of the agents issue?
Kaizen!
the agent gets pushed from vCenter, so a fresh install really won't change the agent.
I would still remove the agents in the hosts after disconnecting the hosts from vCenter, search for the vpx and aam rpm's, and then remove them. Then delete the vpxuser. Re-register the host. Another thing to try is to disable ha in the entire cluster,, and then re-enable it.
-KjB
VMware vExpert
Deleting the certs to fix the agents. Interesting idea. Here is some background nfo though. I have rebuilt from scratch several ESX hosts using nothing but the virgin install files from VMware. Wouldn't you agree that a virgin rebuild should solve this?
Yes it should. Have we ruled out IP address conflict? Maybe some other machines on your network are somehow getting the same IP range?
Do all of the hosts fall out at the same time or is it just one? Is
the vcenter a VM.
On May 7, 2009, at 4:59 PM, kjb007 <communities-emailer@vmware.com
Hey Chamon,
No, just one will disconnect here and there. vCenter is a physical server.
Hey guys, I have to ask a dumb question. If I wipe out a host, delete all the partitions, after having removed that host from vCenter, and I repartition, and reinstall a host; why wouldn't that be good enough to wipe out the agents?
Respectfully confused on the agent discussion,
Matthew
Kaizen!
>why wouldn't that be good enough to wipe out the agents
The VC agent is installed direclty by the VC. Ensure that VC is the last version (U4)
The reinstall you can simple remove the host from VC, enter in ESX console and the do:
rpm -e VMware-vpxa
Andrea
I am having what sounds to me to be a vaguely similar problem.
http://communities.vmware.com/thread/209075
Your network sounds like its production so this may not be feasible, but all i did was remove a switch and replace it with a hub and everything works. I would still like to know why that works though....
A regular hub would not participate in STP as a switch does. It could be an STP blocking port situation in your case.
-KjB
VMware vExpert
Question for you - are you using HP hardware?