Re: ESXi hosts become unresponsive, restarting man...

gbullochdst · ‎05-29-2011

Hi all,

We have been experiencing a very strange issue with some of our hosts, here's a summary of the environment

Dell R600 / 80GB RAM / 4 x Quad Core CPU / QLogic iSCSI HBAs / Intel Dual Port ET NICs

Dell PS6000 iSCSI SAN Array

We were previously running ESXi 4.0U2 and had encountered this issue but recently upgrading all hosts to 4.1 including upgrading firmware on the hosts and on the SAN and building a new vCenter server.

A host (or hosts if it's a really bad day) will become unresponsive in vCenter and will not allow direct connection from the vSphere client. We can ping the hosts but the VMs will experience intermittent ping drops lasting for around 2 minutes every 10minutes or so.

Our monitoring solution reports HTTP and HTTPS socket timeouts as well and restarting management agents can take 20+ minutes and does not fix the issue.

We can't log into the host so we are unable to VMotion the VMs so they are all essentially held hostage by this unresponsive host.

We have seen the issue clear itself after multiple hours but are usually forced to hard reboot the host which is a major issue if some of our more mission critical VMs are on the effected host.

I am planning to enable FT for our mission critical VMs to mimimise the impact of this issue when it occurs but it is quite serious and I am at a loss as to the cause.

Has anyone every seen anything similar to this? I have attached some snippets of the logs below.

idle-jam · ‎05-29-2011

normally restarting management agent will help or if it's faulty hardware it will got PSOD (purple screen of death). i would suggest generating a support log and create a ticket with vmware support asap as the scope for troubleshooting is quite wide now ..

gbullochdst · ‎05-29-2011

Thanks, have already raised one through Dell as our vSphere support contract is through them.

We aren't getting PSOD but the DCUI has become quite slow to respond at times while we have been experiencing this issue.

afertmann · ‎05-29-2011

Is that the full log? Can you attach all the logs of the affected servers? Also how is ESXi installed? On local disks? RAIDED? Boot from SAN? USB Stick? Embedded? Also what version of 4.1? Build number?

What is the specific model of Dell? R610?

Are these VMs in a HA /DRS Cluster?

How many VMs per host?

Kinds of VMs? Amount of resources used? IO?

As much information about your config would be helpful in troubleshooting.

gbullochdst · ‎05-29-2011

ESXi is installed on local disks in Hardware RAID1 running 4.1U1 latest build 348481

They are original generation R900s running BIOS version 1.2.0

They are configured in a HA/DRS cluster

Approximately15 - 20 powered on VMs per host, majority Windows (all versions), all hosts under 50% utilisation, DRS host load standard deviation 0.09

I have attatched a larger snippet of the log for the affected host

Myshtigo · ‎05-30-2011

Hello, we have been having a similar issue with Dell R900's running ESXi 4.1... (3) R900s in cluster with bios 1.2.0 and running fine... Updated to 4.1 over past few months then in the past week or two, having hosts drop out and vms slow unresponsive... Also added a tray to our Dell PV MD3220i recently and that has been looking like the problem...

I found this post after trying to restart MGMT on a host tonight and its taking a looooong time... I'm pretty sure just this one host is the problem. Perhaps a hardware problem that is manifesting as an overall issue. I was thinking that this machine could be the problem which is why I was looking to move vms off of it and I do remember this one host causing and being the focus of issues/troubleshooting last week when the host needed to be rebooted 3 or 4 times.

Now having same problem. VMs on it are (disconnected) and I can not connect to the host, so I'm waiting on restarting MGMT through console...

I just checked and the host is now reachable (took about 30 minutes for mgmt restart) so I'm adding to another VC to see if it is any better... Also migrating some 'prd' level VMs off manually so we can restart/troubleshoot in the AM.

Hope this helps... If I do find something R900/Dell specific I will reply.

Dave_Mishchenko · ‎05-30-2011

If you have a vmkernel port configured for other purposes such as storage, are you able to connect to the host through that IP address?

Myshtigo · ‎05-30-2011

Interesting question... I have not tried that... our iSCSI networks are on private vlans and our mgmt VC, etc... are not able to reach those... as a test I could connect and see if it works...

Is this a troubleshooting step or were you asking about a work-around?

Dave_Mishchenko · ‎05-30-2011

Just wondering about the extent of the host issue. I had a recent problem with ESXi Embedded taking out vmnic0 on some IBM hosts. The odd part was that the vSwitch though vmnic0 was still OK when it was definitely not working so failover didn't happen in the vSwitch. Network connectivity to another vmkernel port on the storage network was fine.

gbullochdst · ‎05-30-2011

Myshtigo,

Would you happen to be using QLogic iSCSI HBAs in your environment

We have been seeing connection resets on our iSCSI array between hosts and LUNs.

From this thread, it appears others are experiencing similar issues;

http://solutions.qlogic.com/KanisaSupportSite/forum/viewthread.do?command=FRShowThread&threadId=Post...

bulletprooffool · ‎05-31-2011

First thing you should always do when you can't connect to - or if there is an issue with your connection between vCenter and ESXi \ ESX is to:

1) check DNS configuration on the ESXi server and your DNS server that ESX points to making sure you have the appropriate entries
2) Check host files etc in /etc/hosts, /etc/resolve.conf, /etc/sysconfig/network and /etc/vmware/esx.conf files
2) try to disconnect and reconnect your ESXi host from your vCenter inventory, this uninstalls and reinstalls the vCenter agent using FQDN and then with IP address if FQDN didn't work
3) Try Restarting both the vCenter management agent on the ESX host and the ESX host management agent. Learn how to do this here: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100349...
4) If the above didn't do anything for you, it could be lost connectivity to a LUN, which can cause problems with ESX (less now than earlier versions ESX 2.x), connect to ESXi host directly with VI Client and perform a rescan of your storage adaptors and LUNs.

One day I will virtualise myself . . .

NeilRHunter · ‎05-31-2011

Just a question rather than an answer from me I'm afraid.

I was wondering if your Management Network was on it's own vswitch/NIC?

gbullochdst · ‎05-31-2011

Bulletproof,

We don't have network connectivity issues but after futher research, I do believe your last point to be in the right area although we cannot connect to our hosts directly either when experiencing this.

I believe the issue to be with our QLogic iSCSI HBAs, planning to transition one host to software iSCSI initiators and see if we still encounter this issue.

Neil,

Yes our management network is on it's own NICs/vSwitch but it is on the same subnet, something I am trying to change.

bulletprooffool · ‎06-01-2011

Hi - did you solve this?

One day I will virtualise myself . . .

Myshtigo · ‎06-01-2011

gbullochdst No, we are using software iSCSI on the Intel PCI Cards/Onboard Broadcom.

It has been two days and it looks better in my environment after removing from VC and restarting the Management Service Agent (in Troubleshooting).

gbullochdst · ‎06-01-2011

Bulletproof,

Not yet, I have installed the driver from the QLogic website on two hosts and tweaked the VAAI settings on one host to test their stability.

Once I am happy with their stability I will implement on all hosts, hoping this fixes the issue.

FYI, as per thread previously linked, to disable VAAI on a host type the following command;

esxcfg-advcfg -s 0 /DataMover/HardwareAcceleratedMove; esxcfg-advcfg -s 0 /DataMover/HardwareAcceleratedInit; esxcfg-advcfg -s 0 /VMFS3/HardwareAcceleratedLocking

simonjgreen · ‎06-03-2011

We're having this exact same issue as well, and have been for the last few months. We've got a massive case ongoing with Dell and VMWare to try and resolve it.

We're running on:

Dell R610 Servers (w/Intel NICs)
Dell MD3200i SAN

We've managed to narrow the exact fault down quite a lot:

We'll get reports from monitoring of any server that uses it's disk becoming unresponsive
Running `esxtop` in datastore mode on all the hosts we'll see one host showing extremely high (~10000 - 40000) G/AVG round trip time, and then 0 G/AVG, alternating every couple of seconds.
While the problem host's G/AVG is high, all other hosts get 0 CMD/s, and when the problem host's G/AVG is 0, all other hosts get a good CMD/s rate.
The problem host will be totall unresponsive to any ESX control commands from the CLI or via vSphere client.
As soon as we power cycle the host using the DRAC all other hosts instantly perform fine again and the problem disasapears.
HA triggers and the guests from that host start on the remianing hosts.

It can be anything from a few hours to 2 weeks between occurances.

I'd like to team up with anyone having the same or similar issue and get everyone pestering Dell/VMWare. There is obviously something up here!

caledunn · ‎06-03-2011

I've had a similar issue with esxi hosts on hp blades and netapp storage. It started with one host and then over two weeks hit every host. I would get an alert about a host losing connection to vcenter. No matter what I did I could not get the host to reconnect. If I tried to restart the management services in DCUI it would hang. Also I couldnt connect using the vi client. All the vms on the host were up and functioning. I was able to ssh to the host and look around. I had plenty of space and memory. Each host acted slightly different like there was some kind of a resource issue. A reboot was the only way I could get everything working again. My install was pretty vanilla. Installed on local drives on the blade. I let vmware setup the drive. I noticed that in the latest esxi 4.1 patch they menetion that the ESXI host could intermittently lose connection with vcenter server due to socket exhaustion. They talk about mailicious network traffic causing it but I wonder if a network scan could cause this type of issue. I've applied it to my lab and haven't had issues yet but my esxi hosts were up for a few months before I started to have issues so I'm not sure if there is still a problem or not.

typsupp · ‎07-26-2011

Hi,

I am having the same issue too.

2 x HP DL380 G7 , 2xCPU, 36GB RAM, 2x146GB Raid Local Storage

Running VMware ESXi 4.1 Update 1 381591.

ISCSI Connections using 4 NICs in each host to 2 3com switchs

Connected to 2 x ISCSI DataCore SANs.

When we load 10VMs on ESX01, (9 running from SAN Storage & 1 running from local storage) all the VMs on the SAN will become unresponive. The local VM is still ok. All storage patch are active and the SAN is listing active connections. When I browse the SAN datastore at the timeof the issue it appear empty. There are 2 VMs on ESX02 running the SAN and they are online at the time the problem occurs.

The only way to get the ESX01 online is to power off. VMware HA then kicks in and all VMs migrate to ESX02. If I leave the VMs on ESX02 the issue will occur within 24hours.

When there is allot I/O occuring it seems to happen. (data transfers or Veeam Backups).

I have logged a call with VMware and they know nothing about this issue. They recommended that I bind the phyical nics to the software ISCSI.

I have done this last night. I am currently montioring.

I was wondering have VMware/Dell come back to you with any fix?

gbullochdst · ‎07-26-2011

Hi typsupp,

We haven't seen this issue for about a month now. I disabled VAAI on two of our five hosts (the two that most frequently had the issue originally described). The only thing I did was disable one of the iSCSI initiators on one of our hosts that was continually disconnecting and connecting again to targets so this host now has only three connections to the iSCSI SAN.

I am unfortunately unsure if the issue is fixed as we have had periods with no issues before but I suggest you check the logs on your iSCSI array and see if any of the initiators on your hosts are continually disconnecting and connecting again. Also, you may like to disable VAAI on your hosts.

All

ESXi hosts become unresponsive, restarting management agent does not help