VMware Cloud Community
nyjz1298
Contributor
Contributor

Hosts disconnect/reconnect in vCenter within a few seconds

I have 70 ESX hosts in vCenter on several different continents running ESX 3.0.1 to 4.1 with patches.  I reciently edited the vpxd.cfg file and added:

<heartbeat>
       <notRespondingTimeout>60</notRespondingTimeout>
</heartbeat>

I thought it would eliminate those false positives which also alert many people, some of whom get very worried when they see an alert like this.  How can I stop my hosts for causing this sort of quick disconnect / connect?  The change I made to vCenter doesn't seem to be working.  I'm seeing alerms over night that show a disconnect and a reconnect within the very same minute.  It happens to at least 3 hosts per day.  All of these false alarms are causing my department to ignore potentially valid alerts.  After I made the change to the cfg file restarted the entire server so I know it's in effect.

Heres the entire file... Let me know if there is a problem or what I can do.  Thanks!

<config>
  <level id="VmCheck">
    <logLevel>info</logLevel>
    <logName>VmCheck</logName>
  </level>
  <level id="CpuFeatures">
    <logLevel>info</logLevel>
    <logName>CpuFeatures</logName>
  </level>
  <log>
     <maxFileNum>10</maxFileNum>
     <level>info</level>
     <memoryLevel>verbose</memoryLevel>
     <compressOnRoll>true</compressOnRoll>
  </log>
  <alert>
    <log>
      <enabled>true</enabled>
    </log>
  </alert>
  <vmacore>
    <threadPool>
      <TaskMax>90</TaskMax>
    </threadPool>
    <ssl>
      <useCompression>true</useCompression>
    </ssl>
  </vmacore>
  <vpxd>
    <das>
      <serializeadds>true</serializeadds>
      <slotCpuMinMHz>256</slotCpuMinMHz>
      <slotMemMinMB>0</slotMemMinMB>
    </das>
    <filterOverheadLimitIssues>true</filterOverheadLimitIssues>
    <heartbeat>
               <notRespondingTimeout>60</notRespondingTimeout>
    </heartbeat>
  </vpxd>
</config>

0 Kudos
17 Replies
Troy_Clavell
Immortal
Immortal

can we assume you are managing these ESX(i) Hosts with vCenter 4.1?

0 Kudos
nyjz1298
Contributor
Contributor

Yes 4.1.  These types of issues were happening prior to cutting over from 4.0 on 2003 to 4.1 2008 R2.

0 Kudos
Troy_Clavell
Immortal
Immortal

These types of issues were happening prior to cutting over from 4.0 on 2003 to 4.1 2008 R2.

You have all your ESX(i) added to vCenter inventory using FQDN, and proper name resolution is setup in the enviornment?

0 Kudos
nyjz1298
Contributor
Contributor

Yes.  All are pointing to valid DNS servers.  Like I said, the server will drop out of vcenter and say "disconnected" then come back, sometimes within a matter of seconds.  I've seen this at my previous job.  This seems to be a slightly common  from what I can tell.  All have FQDNs.

0 Kudos
Troy_Clavell
Immortal
Immortal

what is the health of your vCenter DB?  What are your Statistics Intervals set to?

Finally, mayb the below article will help.

http://kb.vmware.com/kb/1003409

0 Kudos
nyjz1298
Contributor
Contributor

DB is just fine.  We actually had some serious problems with our DB server about a month ago, but some heavy hitting DB were removed and now things are running smoothly.  Before the vCenter db would stop or the stat collection wouldn't be complete.  During that time the disconnects/connects were no greater then they are now.  The vcenter db is now always up and never stops.  I keep my vsphere client opened and logged on for weeks at a time with no logout.  Stats are at 3 / 3 minutes 2/ 30 min 2/ 2 hr 2 / day.

0 Kudos
sajitnair
Contributor
Contributor

I would suggest checking SSL settings and whether certificate verification is enabled or not. I have been battling this for a while now. What i have seen is that on the vcenter which has SSL verification disabled (hosts are using the default certs) the disconnect problem is observed. The vcenter on which SSL verification is enabled (hosts using default certs) has not had a host disconnect. From the vpxd, vpxa and hostd logs it seems that when SSL verification is disabled and the hosts are using default defective certs, the SSL handshake takes longer that usual causing a timeout. This timeout issue has been addressed in KB http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=102244.... This change was done but we still saw the disconnects happening. Looking through the logs it seemed that the handshake is taking much more that the recommended value in the KB. I believe that either SSL verification should be enabled and the default certs accepted as it is, or proper certs installed and SSL verification enabled. If for the hosts this setting is True (by default it is) - Config > Defaults > security > host > ruissl > Require SSL to be used when communicating with the host over port 902 - then SSL verification is desired and proper certificates should be installed on the host. Otherwise all we can do is keep increasing timeout values. Also, the good part is that the disconnect lasts only for a few seconds (3000 ms to be exact). Host is not affected and neither are the VM's, but it generates an upsetting alarm and questions are asked.

0 Kudos
JDLangdon
Expert
Expert

Here is something to try but it might not be relevent.  When I setup a cluster of ESX servers I always populate the HOSTS file with the FQDN's add IP addresses of all the hosts in the cluster and the vCenter server that manages them.

Also, ensure that all of your ESX servers are using lowercase letters for their names and that they are connected to the vCenter server using lowercase letters.  I read somewhere that HA/DRS doesn't like capital letters and that this could be one of the many side effects.

0 Kudos
sajitnair
Contributor
Contributor

I would agree if it were a HA problem. Also the lowercase requirement is for entries in host, network and .conf files only. In my case the host just disconnects for a few seconds and connects back.

0 Kudos
Nergohs
Contributor
Contributor

We are seeing the same thing and I'm seeing certificate warnings in the log files.  Did that end up being the cause of the problem?

0 Kudos
davel1970
Contributor
Contributor

Hi,

We're seeing the same thing - but only on our Dell R710 hosts. Our IBM 3850's are not showing this problem. For those who are seeing the same issues - what hardware are you using on your hosts?

0 Kudos
mattslotten
Contributor
Contributor

We too are experiencing the issue with vCenter.  We haven't modified vpxd.cfg yet to try and set the heartbeat to a longer interval.  The issue is impacting our View VDI infrastructure as well; view desktop recomposes fail as a result of not being able to communicate with the ESX hosts during the 2-4 second "hiccups."

We're using Dell R610 hosts, this issue just started cropping up last week, although we've had our environment up for a couple of months now.

I have an open ticket with VMware and am awaiting a response and will share what I learn.

0 Kudos
Nergohs
Contributor
Contributor

Was a resolution to this issue ever identified?

We are seeing the same behavior.

0 Kudos
DSTAVERT
Immortal
Immortal

vCenter Disconnects
http://kb.vmware.com/kb/1003409

-- David -- VMware Communities Moderator
0 Kudos
sajitnair
Contributor
Contributor

The kb helps to troubleshoot hosts that are disconnected/not responding and stay that way until fixed. In most of our cases hosts disconnect and reconnect within a minute (even less) and by the time that you respond to the alarm (email alert) by opening the VC, you see everything working with no issues, except that fact that the events do indicate the disconnect

VC determines host status by host heartbeat signal, managed by host management agent service. Heartbeat signal will send over network to VC's UDP port 902. If host fails to send out heartbeat, network issue, or VC performance issue happens, for what ever reason, if VC doesn't receive a signal on time, VC will mark the host disconnected or not responding.

I have seen this caused due to the following reasons

> SSL timeout setting http://kb.vmware.com/kb/1020210

> Heartbeat response time http://kb.vmware.com/kb/1005757

> Storage Vmotion causing vpxa crash/recover http://kb.vmware.com/kb/1027919

> Network latency or drops

> VC performance issues (specifically on 4.1 tomcat spiking cpu)

Some of the settings I have done and seem to have got some relief.

<vpxd>

        <heartbeat>

          <maxHandlers>10</maxHandlers>

             <queueTimeout>15</queueTimeout>

                <notRespondingTimeout>60</notRespondingTimeout>

          <queueWatermark>10<queueWatermark>

        </heartbeat>

      </vpxd>

/etc/opt/vmware/vpxa/vpxa.cfg

      <vpxa>

                <heartbeat>

                <interval>20</interval>

                </heartbeat>

      </vpxa>

/etc/vmware/hostd/config.xml

      <ssl>

        <handshakeTimeoutMs>300000</handshakeTimeoutMs>

      </ssl>

0 Kudos
davel1970
Contributor
Contributor

Hi,

Thanks for pointing those kb articles out.

In my environment I thought it was odd that the Dell R710s were showing the problem but the older IBM boxes weren't. However our Dell boxes are busier.

I've tried changing the timeouts.

Cheers,

Dave

0 Kudos
nyjz1298
Contributor
Contributor

Nope... Still an ongoing issue.  We have 99 Esx classic 4.1 hosts on venter 4.1. This issue still persists by at least 5 hosts randomly each week.  They drop out for literally 2 seconds then pop back.  I need to dig deeper and review the thread suggestions this week.  Thanks everyone! 

0 Kudos