ESX 5 Host Unresponsive after 48hrs

herbalrhyme · ‎11-09-2011

I have been searching the internet for over a week, but I cannot find a solution. I just provisioned a new ESXi 5 host on a Dell 1950 III. I had it working great in my testing environment with DHCP and left it going for weeks at a time before I put it into my colocation. Now after 48hrs, the host stops responding. I can ping it and all the vm's on it run without issue, but I am unable to connect to it via CLI, VMWare, or HTTP(S) to manage it. When I restart it, it works fine for about 2 days before the management interface is unavailable. It used to be connected to a VCenter, but now it is purely on its own. The only difference now is that it is in a static IP environment. When I try the VSphere client I get this error:

mcowger · ‎11-09-2011

Sounds like maybe your colo provider's firewall is doing naughty things!

--Matt VCDX #52 blog.cowger.us

herbalrhyme · ‎11-09-2011

There is no firewalls in the way to stop this. I can ping the host anyway without problem.

nspofadmin · ‎12-12-2011

Anyone found a solution to this issue. We're getting the same issue and since we've been on ESXi 5 with vCenter 5, it's been happening 3 times now.

At the host console, the host does not seem like it's locked up. We can access the system customization screen. We try to restart the server <F12> but it does not seem work. After waiting over 15 minutes, we cold boot the host. Once the host is up, it seems to work as normal. We have 3 hosts and it seems to radomly happen to all three hosts so we don't think it's a hardware issue. Could someone pinpoint us to which logs we should be looking at?

Thank you

mathewford · ‎12-20-2011

We are encountering the same issue. It has happened on three of our 12 ESXi 5.0 hosts in the past 2 weeks. Basically, perfromance on the VMs on the host becom sluggish and for brief periods of time vCenter is unable to contact the host. These gaps can easily be seen in the Performance->Advanced->System graph. Over a period of an hour or two the gaps of unreachable time get worse and worse. vMotioning of VMs begin failing with errors. Eventually the host becomes completely unreachable and requires a restart. After the host becomes unreachable, we are seeing the same issue where a restart from the console (F12) sits there and never completes, requiring a hard reset.

We upgraded all our hosts to v5.0 around 100 days ago and are just now seeing this phenomenon. It feels like it is related to a combination of uptime + load on the host. We have a ticket opened with VMware. Will post here as we discover anything.

Calypso971 · ‎12-21-2011

Same problem here. System unresponsive after 2 or 3 days, management address of ESXi5 server responds to ping, also the hosts on the server respond to ping, but nothing more. Reboot on the console doesn't work, after logging in, the system gives the message that it is rebooting, but nothing happens. Only a power down and up gets the machine going again.When the machine was still on ESXi4.1 it never had problems.

Latest patches (december) are applied, memory tests don't reveal problems. System: Intel 5520HC board with 2x Xeon 5520 processor, 32 GByte memory, Dell Perc6 RAID controller with VMs on the local disk (it is a standalone test system).

Naydonov · ‎02-23-2012

Hi Matthew

Have you been able to resolve this issue ? Exact same thing happens to my ESXI hosts approximatelly once a month. Same simptoms.

mathewford · ‎02-23-2012

Yes and no. VMware blames the storage vendor, the storage vendor blames VMware. It is definitely a bug in v5 since it is happening across different storage providers. We were able to isolate that it was occurring when we have iSCSI multi-pathing turned on to the arrays and/or hosts...once we disabled multi-pathing we haven't had any more crashes. So, the good news is that we no longer lose hosts every couple weeks. The bad news is that the redundancy we got by architecting multiple paths apparently can no longer be used.

Naydonov · ‎02-23-2012

Thanks mate

This is not looking good, i haven't logged my support call yet, will probably do it today.

May i ask you which hardware vendors you use in your environment ?

We use 3 x Cisco UCS C220 servers and EMC VNX 5300 array connected in redundant mode via 2 Cisco 5000 nexus switches.

mathewford · ‎02-23-2012

All our hosts are Dell PowerEdge servers... a mix of R905, R715, and R815 models with 24 to 48 processor cores and 128GB to 512GB or RAM. For storage we primarily deploy Dell MD3000i and MD3200i/MD3220i iSCSI storage arrays. We were baffled that vSphere 5 ran great for 3 months then the issue started happening. What clued us in was that it started happening a couple weeks after we deployed a new array with redundant controllers and also upgraded all our current single controller arrays to redundant controllers (we did this as we upgraded all the datastores to VMFS5). All 10 of our vSphere 5 hosts experienced the problem while the two vSphere 4.1 hosts we had left exhibited no issues. Once we disabled the redundant pathing to the arrays we haven't had a host crash yet.

Until VMware figures this out and fixes it we are simply crossing our fingers that we don't have any array controller failures! Not a good situation but better than getting a hundred angry phone calls from clients when their VM instances go down.

Naydonov · ‎02-23-2012

Thank you Matthew

I will update you if i find anything helpfull.

gaborus01 · ‎05-30-2012

Hello,

We're having a similar issue for the very first time. Our environment is made of ESXi 5 U1 + HP blade BL460 G7 + Netapp storage (FC connected)

VMs continue to run and all vmkernel answer to ping. But host is disconnected from Vcenter and all communication seem to be down SSL/HTTPS.... nothing answers.

We cannot migrate without interruption and that's real bad!

Uptime is below 60 days (we juste migrated)

I haven't open a case yet (last SR was more time consuming than helpful)

I could not figure a reason for this yet

If you guys have any news on this please let us know!

Regards

S.

golddiggie · ‎05-30-2012

Are you seeing latency events on the hosts? Have you checked to make sure the fiber card(s) have the absolute latest drivers (from VMware) even if they're async? How about setting the queue depth to a lower number (such as 64 instead of the default)?

Setting queue depth on Emulex and QLogic fiber HBA's

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1267

We're using EMC Clarrion's for our SAN's (CX3 and CX4 models). We've had to go in and make sure the failover mode was the same across all hosts. This requires powering down the host server in order to make the change on the storage configuration for that host (even though that's on the SAN).

Soemthing else you can check into is KB article 1016626... It might not be part of your issue, but it's at least worth checking into.

mathewford · ‎05-30-2012

This still occurs when you have multipathing (like round-robin) enabled on your datastore AND the host has VMs on the datastore. The only solution we found was to disable multipathing. This just happened once again to us on a new custom-built storage array we were testing, as well as the various EMC and Dell Powevault arrays we already had in our environment, so it does not appear to be an issue with the array vendor but instead is definitely an ESXi 5 problem. This issue did not happen on version 4 hosts.