Hi Guys,
I currently have 8 HP ProLiant DL580 G7's (Running with 4 X Intel Xeon X7560 @ 2.27Ghz (64 logical processors) and 64 GB RAM) hosts running ESXi 4.1.0 Build 260247 in a HA and Semi-Automated DRS cluster.
In the past month we have had 3 of these hosts randomly disconnect from Virtaul Center (a VM running on one of the hosts) and when I go in to reconnect them, it fails with the following error: "Cannot contact the specified host. The host may not be available on the network, a network configuration problem may exist, or the management services on this host may not be responding."
Steps I have taken to resolve this issue:
1. Tried simply right-clicking the disconnected host and clicking connect, which is when I get the aforementioned error.
2. Logged into the ILO of the hosts and tried restarting management agents, which is when the host froze and I had to turn the machine off and back on via the ILO.
Of the 3 servers, only 1 ever had an actual hardware issue. One of the hosts had a bad RAM Cache Module which needed to be replaced. It has since been replaced.
The first time I called VMware support, I had already hard booted the host and brought it and all the VMs up. When I told them this, they told me that since we are running ESXi, whenever the server is rebooted it clears out all the logs. First of all, is this actually true? Am I missing some way to export those logs to another location? Because that seems like a very bad model to me, but perhaps I just don't understand how I'm supposed to be getting these logs off.
The second time I called VMware support for a different server than the first, the VMs were still up and running even though the host was disconnected, so I called support before I attempted to do anything. They went in and looked at logs and were seeing some errors that suggested there were some problems with reading the local disk. SO, the VMs which reside on a SAN, are still "running in memory" as he put it, but the host itself couldn't read the hard disks. I checked in the IML on the ILO and had a guy down in our datacenter check for LED lights and there was no warnings or hardware failures of any kind.
Last night, this happened on a 3rd host but when I logged into the ILO initially, the server was locked up and when I tried to RDP or SSH to any of the VMs on the host I couldn't, so I'm not sure if was a different issue.
This has happened on the first server 3 or 4 times, the second server 2 times, and now it happened again on this 3rd server.
One thing I read in a community post was I need to specify the vCenter server managed IP in Virtual Center. That had not been done, so I did that.
SO, to sum it all up, has anybody ever experienced this before? Random disconnects without being able to reconnect? And if anybody can shed some light on the log situation for me, that would be great as well.
Thanks guys, sorry for the russian novel post.
James
we have this before and insist the support person to read thru the logs in detail. we escalated, we shouted as the initial reason given does not make sense and we have 10 hosts, with hosts getting d/c and down due to service console port almost daily. in the end it's a faulty (not totally dead but it would just die at certain hours of the day and revive after boot) network card
Hi,
Firstly a couple of questions:
Few things to check
Answers to your questions:
Firstly a couple of questions:
No, they were purchased at the same time and all the hardware as well as ESX is the same version.
1. Yes, they are all configured the same.
Few things to check
Have you come up with a solution to your problem?
I have had to run service mgmt-vmware resart and service vmware-vxpa restart everyday over the last week. Everything worked perfectly until upgrading to 4.1. After restarting the agents everything works fine for a short time then one of them will show as not responding. Occasionally I will be able to reconnect but eventually one or both will become disconnected and need the agents restarted from the console as SSH will not work either.
I have 2 Dell R710 each with 8 nics.
4 each -> LAN(Cisco 3560)
2 each -> SAN LAN(PC5424) ->
EQL PS6000E
2 each -> SAN LAN(PC5424) ->
Right about now I am wishing I would have stayed with version 4
Nope, I have not found a solution yet. It happens very sporadically and not all that frequently (which I guess is a good thing). We've had these 8 servers up and running for almost 3 months now, and it's only happened 3 times.
Any news?
I am on ESX not (esxi) 4.1 update 1 (348481)
I have 3 Dell R710's connected to Lefthand p4500 San on Saniq v 9.
All my hosts disconnect randomly, 1 about every 3 - 7 days, very random as to which one it will be.
My vm's continue to run, but sometimes I can't even ssh in, or console in. Sometimes I can restart the hostd when I am able to SSH in, sometimes I can't do that.
Anyone here have the solution?
My setups:
All my ESX4.1 using HP Blade with SAN EVA. At this moment, twice all ESX4.1 hosts marked with red alarm (network issue) but the weird thing are, I still manage to access all hosts from vCenter and this problem didn't happened to my ESX 3.5 hosts. Reset the alarm to green, everything back to normal. No network issue so far and everything were running fine since 2 months ago.
Need to see the logs from /var/log/vmkernel
I have seen similiar issues which are related to the storage backend.
Maybe you need to apply this patch
I had the same symptoms and applying this patch solved the problems of disconnection
read the third paragraph: "Resolves an issue where ESX host could intermittently lose connection with vCenter Server due to socket exhaustion"
this patch was released in April 28, 2011
Jhunter1,
Did you ever find a resolution to this?
We are experiencing the same thing...
We have a new VM installation running ESXi 4.1.0 with 2 DELL R710's.
One host looks "OK" but the VM's are unreachable...
The other hosts shows "Disconnected" but the VM's are still running...
We have been told that are EMC SAN needs a firmware update...
Thanks,
Matt
Any update on this issue guys ?
I'm having similar problem as well with HP Blade BL465c G8 running with Emulex Corp. Hp Flexfabric 10Gb 2-port 554FLB Adapter, the ESXi 5.1u1 hosts is randomly gets disconnected from the VCenter.