jhunter11
Contributor
Contributor

ESX Hosts randomly being disconnected from Virtual Center

Hi Guys,

I currently have 8 HP ProLiant DL580 G7's (Running with 4 X Intel Xeon X7560 @ 2.27Ghz (64 logical processors) and 64 GB RAM) hosts running ESXi 4.1.0 Build 260247 in a HA and Semi-Automated DRS cluster.

In the past month we have had 3 of these hosts randomly disconnect from Virtaul Center (a VM running on one of the hosts) and when I go in to reconnect them, it fails with the following error: "Cannot contact the specified host.  The host may not be available on the network, a network configuration problem may exist, or the management services on this host may not be responding."

Steps I have taken to resolve this issue:

1. Tried simply right-clicking the disconnected host and clicking connect, which is when I get the aforementioned error.

2. Logged into the ILO of the hosts and tried restarting management agents, which is when the host froze and I had to turn the machine off and back on via the ILO.

Of the 3 servers, only 1 ever had an actual hardware issue.  One of the hosts had a bad RAM Cache Module which needed to be replaced.  It has since been replaced.

The first time I called VMware support, I had already hard booted the host and brought it and all the VMs up.  When I told them this, they told me that since we are running ESXi, whenever the server is rebooted it clears out all the logs.  First of all, is this actually true?  Am I missing some way to export those logs to another location?  Because that seems like a very bad model to me, but perhaps I just don't understand how I'm supposed to be getting these logs off.

The second time I called VMware support for a different server than the first, the VMs were still up and running even though the host was disconnected, so I called support before I attempted to do anything.  They went in and looked at logs and were seeing some errors that suggested there were some problems with reading the local disk.  SO, the VMs which reside on a SAN, are still "running in memory" as he put it, but the host itself couldn't read the hard disks.  I checked in the IML on the ILO and had a guy down in our datacenter check for LED lights and there was no warnings or hardware failures of any kind.

Last night, this happened on a 3rd host but when I logged into the ILO initially, the server was locked up and when I tried to RDP or SSH to any of the VMs on the host I couldn't, so I'm not sure if was a different issue.

This has happened on the first server 3 or 4 times, the second server 2 times, and now it happened again on this 3rd server.

One thing I read in a community post was I need to specify the vCenter server managed IP in Virtual Center.  That had not been done, so I did that.

SO, to sum it all up, has anybody ever experienced this before?  Random disconnects without being able to reconnect?  And if anybody can shed some light on the log situation for me, that would be great as well.

Thanks guys, sorry for the russian novel post.

James

0 Kudos
12 Replies
idle-jam
Immortal
Immortal

we have this before and insist the support person to read thru the logs in detail. we escalated, we shouted as the initial reason given does not make sense and we have 10 hosts, with hosts getting d/c and down due to service console port almost daily. in the end it's a faulty (not totally dead but it would just die at certain hours of the day and revive after boot) network card

0 Kudos
bulletprooffool
Champion
Champion

Hi,

Firstly a couple of questions:

  1. Are the 3 faulty hosts different in any way (older than the others . . or newer)?
  2. Are all hosts on the same patch versions?
  3. Are they on the same physical network - different hosts on differnt switches?
  4. Are all NICs on all hosts configured identically (both at the Host end and at the Switch port end (link speed Duplex etc)

Few things to check

  1. Verify all DNS records
  2. try connecting to IP rather than Hostname
  3. Try using fqdn rather than just hostname
  4. Verify that your network is not droppping
  5. Check for IP address conlficts on your network
One day I will virtualise myself . . .
0 Kudos
jhunter11
Contributor
Contributor

Answers to your questions:

Firstly a couple of questions:

  1. Are the 3 faulty hosts different in any way (older than the others . . or newer)?
    1. No, they were purchased at the same time and all the hardware as well as ESX is the same version.

    2. Are all hosts on the same patch versions?
      1. Yes
    3. Are they on the same physical network - different hosts on differnt switches?
      1. We have two hosts in each cabinet, so yes the hosts are plugged into different switches, but they all have access to the same physical networks.
    4. Are all NICs on all hosts configured identically (both at the Host end and at the Switch port end (link speed Duplex etc)

              1. Yes, they are all configured the same.

    Few things to check

    1. Verify all DNS records
      1. DNS looks good
    2. try connecting to IP rather than Hostname
      1. I will try this if it happens again, hopefully not Smiley Happy
    3. Try using fqdn rather than just hostname
      1. Tried both
    4. Verify that your network is not droppping
      1. According to our network admin, he's not seeing any dropped packets from or to the host themselves
    5. Check for IP address conlficts on your network
      1. No conflicts
    0 Kudos
    xevrebyc
    Contributor
    Contributor

    Have you come up with a solution to your problem?

    I have had to run service mgmt-vmware resart and service vmware-vxpa restart everyday over the last week.  Everything worked perfectly until upgrading to 4.1.  After restarting the agents everything works fine for a short time then one of them will show as not responding.  Occasionally I will be able to reconnect but eventually one or both will become disconnected and need the agents restarted from the console as SSH will not work either.

    I have 2 Dell R710 each with 8 nics.

    4 each -> LAN(Cisco 3560)

    2 each -> SAN LAN(PC5424) ->

                                                     EQL PS6000E

    2 each -> SAN LAN(PC5424) ->

    Right about now I am wishing I would have stayed with version 4

    0 Kudos
    jhunter11
    Contributor
    Contributor

    Nope, I have not found a solution yet.  It happens very sporadically and not all that frequently (which I guess is a good thing).  We've had these 8 servers up and running for almost 3 months now, and it's only happened 3 times.

    0 Kudos
    kendalf1
    Contributor
    Contributor

    Any news?

    I am on ESX not (esxi) 4.1 update 1 (348481)

    I have 3 Dell R710's connected to Lefthand p4500 San on Saniq v 9.

    All my hosts disconnect randomly, 1 about every 3 - 7 days, very random as to which one it will be.

    My vm's continue to run, but sometimes I can't even ssh in, or console in.  Sometimes I can restart the hostd when I am able to SSH in, sometimes I can't do that.

    0 Kudos
    Virtualinfra
    Commander
    Commander

    1. Is that a firewall between VC and esxi if so make sure the following port are open.
    Below are mandatory
    902 - UDP\TCP. -heart beat between VC and esx  host and viceversa.
    903 - TCP.
    443 - vCenter agent, web access
    27000, 27010 - License server.
    Download port query tool to check the UDP port connect and you can TCP port status using telnet from VC server.
    make sure even this is open
    53 - DNS
    80 - http
    22 - ssh
    above was the issue in my case and its been resolved once after opening ports.
    Also refer the below KB vcenter management IP in VC to be entered,
    Also check the firmware version of all the host.
    Regards
    Dharshan S
    Thanks & Regards Dharshan S VCP 4.0,VTSP 5.0, VCP 5.0
    0 Kudos
    athlon_crazy
    Virtuoso
    Virtuoso

    Anyone here have the solution?

    My setups:

    • HQ Site : 5x ESX4.1 + 2x ESX3.5(different cluster)
    • DR Site : 7x ESX4.1

    All my ESX4.1 using HP Blade with SAN EVA. At this moment, twice all ESX4.1 hosts marked with red alarm (network issue) but the weird thing are, I still manage to access all hosts from vCenter and this problem didn't happened to my ESX 3.5 hosts. Reset the alarm to green, everything back to normal. No network issue so far and everything were running fine since 2 months ago.

    http://www.no-x.org
    0 Kudos
    Generious
    Enthusiast
    Enthusiast

    Need to see the logs from /var/log/vmkernel

    I have seen similiar issues which are related to the storage backend.

    0 Kudos
    jmerchan
    Contributor
    Contributor

    Maybe you need to apply this patch

    http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=103509...

    I had the same symptoms and applying this patch solved the problems of disconnection

    read the third paragraph: "Resolves an issue where ESX host could intermittently lose connection with vCenter Server due to socket exhaustion"

    this patch was released in April 28, 2011

    0 Kudos
    MattLeBlanc
    Contributor
    Contributor

    Jhunter1,

    Did  you ever find a resolution to this?

    We are experiencing the same thing...

    We have a new VM installation running ESXi 4.1.0 with 2 DELL R710's.

    One host looks "OK" but the VM's are unreachable...

    The other hosts shows "Disconnected" but the VM's are still running...

    We have been told that are EMC SAN needs a firmware update...

    Thanks,

    Matt

    0 Kudos
    AlbertWT
    Virtuoso
    Virtuoso

    Any update on this issue guys ?

    I'm having similar problem as well with HP Blade BL465c G8 running with Emulex Corp. Hp Flexfabric 10Gb 2-port 554FLB Adapter, the ESXi 5.1u1 hosts is randomly gets disconnected from the VCenter.

    /* Any kind of comment or input would be greatly appreciated */
    0 Kudos