VMware Cloud Community
zilog_jones
Contributor
Contributor

ESXi 4.1 U1 host becomes unresponsive

I'm having this problem with an ESXi 4.1 host on an almost weekly basis - the host suddenly becomes unresponsive and the guests appear to be completely dead. I can still navigate through the vSphere Client but cannot perform any tasks, and the VM console screens are blank. The VMs appear to be running (according to status in vSphere Client), but I cannot ping them or connect to them in any way. I try the reboot from the DCUI but it does nothing - I end up having to power cycle the server to get it working again.

I have looked in /scratch/log/messages on the host and do not see anything obvious. Here's the last few minutes before the last time it hung:

Oct 24 14:22:48 Hostd: [2011-10-24 14:22:48.055 343F0B90 error 'App'] Failed to read header on stream TCP(local=127.0.0.1:51337, peer=127.0.0.1:0): N7Vmacore15SystemExceptionE(Connection reset by p
Oct 24 14:22:48 Hostd: [2011-10-24 14:22:48.068 33F2EB90 verbose 'Proxysvc Req01002'] New proxy client SSL(TCP(local=193.120.91.121:60914, peer=193.120.91.2:443))                                  
Oct 24 14:22:58 nssquery: Group lookup failed for 'S3\ESX Admins'                                                                                                                                   
Oct 24 14:22:58 Hostd: [2011-10-24 14:22:58.866 33F2EB90 warning 'UserDirectory'] Group lookup failed for 'S3\ESX Admins'                                                                           
Oct 24 14:23:27 Hostd: [2011-10-24 14:23:27.863 33F2EB90 verbose 'Cimsvc'] Ticket issued for CIMOM version 1.0, user root                                                                           
Oct 24 14:23:58 nssquery: Group lookup failed for 'S3\ESX Admins'                                                                                                                                   
Oct 24 14:23:58 Hostd: [2011-10-24 14:23:58.923 33F6FB90 warning 'UserDirectory'] Group lookup failed for 'S3\ESX Admins'                                                                           
Oct 24 14:24:58 nssquery: Group lookup failed for 'S3\ESX Admins'                                                                                                                                   
Oct 24 14:24:58 Hostd: [2011-10-24 14:24:58.983 342DBB90 warning 'UserDirectory'] Group lookup failed for 'S3\ESX Admins'                                                                           
Oct 24 14:24:59 Hostd: [2011-10-24 14:24:59.304 FFEC5E80 verbose 'Cimsvc'] Ticket issued for CIMOM version 1.0, user root                                                                           
Oct 24 14:25:59 nssquery: Group lookup failed for 'S3\ESX Admins'                                                                                                                                   
Oct 24 14:25:59 Hostd: [2011-10-24 14:25:59.038 33F2EB90 warning 'UserDirectory'] Group lookup failed for 'S3\ESX Admins'                                                                           
Oct 24 14:26:29 Hostd: [2011-10-24 14:26:29.926 33EEDB90 verbose 'Cimsvc'] Ticket issued for CIMOM version 1.0, user root                                                                           
Oct 24 14:26:59 nssquery: Group lookup failed for 'S3\ESX Admins'                                                                                                                                   
Oct 24 14:26:59 Hostd: [2011-10-24 14:26:59.092 343F0B90 warning 'UserDirectory'] Group lookup failed for 'S3\ESX Admins'                                                                           
Oct 24 14:27:13 Hostd: [2011-10-24 14:27:13.010 33F2EB90 verbose 'Proxysvc Req01003'] New proxy client TCP(local=127.0.0.1:57757, peer=127.0.0.1:80)                                                
Oct 24 14:27:13 Hostd: [2011-10-24 14:27:13.011 344B1B90 info 'Vmomi'] Activation [N5Vmomi10ActivationE:0x34708c28] : Invoke done [waitForUpdates] on [vmodl.query.PropertyCollector:ha-property-coll
Oct 24 14:27:13 Hostd: [2011-10-24 14:27:13.011 344B1B90 verbose 'Vmomi'] Arg version:                                                                                                              
Oct 24 14:27:13 Hostd: "50"                                                                                                                                                                         
Oct 24 14:27:13 Hostd: [2011-10-24 14:27:13.012 344B1B90 info 'Vmomi'] Throw vmodl.fault.RequestCanceled                                                                                            
Oct 24 14:27:13 Hostd: [2011-10-24 14:27:13.012 344B1B90 info 'Vmomi'] Result:                                                                                                                      
Oct 24 14:27:13 Hostd: (vmodl.fault.RequestCanceled) {                                                                                                                                              
Oct 24 14:27:13 Hostd:    dynamicType = <unset>,                                                                                                                                                    
Oct 24 14:27:13 Hostd:    faultCause = (vmodl.MethodFault) null,                                                                                                                                    
Oct 24 14:27:13 Hostd:    msg = "",                                                                                                                                                                 
Oct 24 14:27:13 Hostd: }                                                                                                                                                                            
Oct 24 14:27:13 Hostd: [2011-10-24 14:27:13.012 342DBB90 error 'App'] Failed to read header on stream TCP(local=127.0.0.1:62851, peer=127.0.0.1:0): N7Vmacore15SystemExceptionE(Connection reset by p
Oct 24 14:27:13 sfcb-vmware_base[5907]: LsaFindUserByName: 40008                                                                                                                                    
Oct 24 14:27:13 sfcb-vmware_base[5907]: LsaFindUserByName: 40008                                                                                                                                    
Oct 24 14:27:13 sfcb-vmware_base[5907]: LsaFindUserByName: 40008                                                                                                                                    
Oct 24 14:27:13 sfcb-vmware_base[5907]: LsaFindUserByName: 40008                                                                                                                                    
Oct 24 14:27:13 sfcb-vmware_base[5907]: LsaFindUserByName: 40008                                                                                                                                    
Oct 24 14:27:13 sfcb-vmware_base[5907]: LsaFindUserByName: 40008                                                                                                                                    
Oct 24 14:27:13 sfcb-vmware_base[5907]: LsaFindUserByName: 40008                                                                                                                                    
Oct 24 14:27:13 sfcb-vmware_base[5907]: LsaFindUserByName: 40008                                                                                                                                    
Oct 24 14:27:13 sfcb-vmware_base[5907]: LsaFindUserByName: 40008                                                                                                                                    
Oct 24 14:27:13 sfcb-vmware_base[5907]: LsaFindUserByName: 40008                                                                                                                                    
Oct 24 14:27:46 Hostd: [2011-10-24 14:27:46.929 342DBB90 verbose 'DvsManager'] PersistAllDvsInfo called                                                                                             
Oct 24 14:27:59 nssquery: Group lookup failed for 'S3\ESX Admins'                                                                                                                                   
Oct 24 14:27:59 Hostd: [2011-10-24 14:27:59.148 3436DB90 warning 'UserDirectory'] Group lookup failed for 'S3\ESX Admins'                                                                           
Oct 24 14:28:00 Hostd: [2011-10-24 14:28:00.549 33EEDB90 verbose 'Cimsvc'] Ticket issued for CIMOM version 1.0, user root                                                                           
Oct 24 14:28:59 nssquery: Group lookup failed for 'S3\ESX Admins'                                                                                                                                   
Oct 24 14:28:59 Hostd: [2011-10-24 14:28:59.203 33F2EB90 warning 'UserDirectory'] Group lookup failed for 'S3\ESX Admins'                                                                           
Oct 24 14:29:31 Hostd: [2011-10-24 14:29:31.171 33EEDB90 verbose 'Cimsvc'] Ticket issued for CIMOM version 1.0, user root                                                                           
Oct 24 14:29:59 nssquery: Group lookup failed for 'S3\ESX Admins'                                                                                                                                   
Oct 24 14:29:59 Hostd: [2011-10-24 14:29:59.260 33F2EB90 warning 'UserDirectory'] Group lookup failed for 'S3\ESX Admins'                                                                           
Oct 24 14:30:59 nssquery: Group lookup failed for 'S3\ESX Admins'                                                                                                                                   
Oct 24 14:30:59 Hostd: [2011-10-24 14:30:59.316 342DBB90 warning 'UserDirectory'] Group lookup failed for 'S3\ESX Admins'                                                                           
Oct 24 14:31:02 Hostd: [2011-10-24 14:31:02.675 343F0B90 verbose 'Cimsvc'] Ticket issued for CIMOM version 1.0, user root                                                                           
Oct 24 14:31:59 nssquery: Group lookup failed for 'S3\ESX Admins'                                                                                                                                   
Oct 24 14:31:59 Hostd: [2011-10-24 14:31:59.374 34431B90 warning 'UserDirectory'] Group lookup failed for 'S3\ESX Admins'

Then there is nothing and at 14:43:35 I rebooted the machine. I don't really understand most of the above log, but none of it looks critical to me. I can't see any obvious errors on the VMs either, and they're not under any high load.

Host hardware:

  • Dell PowerEdge R210
  • Xeon X3440 (quad core + HT)
  • 8 GB RAM
  • Dell SAS 6/iR RAID controller
  • 2x 250 GB SATA disks in RAID 1 array
  • Broadcom BCM5716 onboard NIC (NIC teamimg set up in ESXi)
  • BIOS 1.8.2, iDRAC 6 Express firmware 1.80, Lifecycle Controller firmware 1.4.0.445, RAID controller firmware up-to-date

VMs:

  • RHEL 5.7 desktop (64-bit)
  • CentOS 5.7 (64-bit)
  • Windows XP Professional SP3 (32-bit) - this is only used on occasions and was not running the last time the host failed

I have an iSCSI target set up on this (there was a Windows 2008 R2 domain controller on this too but I moved it to another host due to unreliability with this one) but it was failing before this was configured. I have installed patches on the host so it is currently running 4.1.0 Build 433742. Guests are also reasonably up to date. However this problem has been happening for a few months, even before I upgraded to U1.

I noticed one time the system failed that upon restarting, ESXi was reporting (in Configuration -> Health Status) that one of the disks in the array was rebuilding. I have not noticed this happen again, and the rebuild was successfull.

I ran Dell Diagnostics and MemTest (one pass) on the machine and everything seemed ok. There are no errors in the iDRAC event logs.

Any ideas what could be wrong?

Reply
0 Kudos
15 Replies
Jason_Stein
Contributor
Contributor

I'm having an identical issue with my HP Proliant DL380 G7.  I see no issues with the Hardware (Configuration-Health Status).  The Physical host requires a physical reboot (not software reboot) every weekend for the last few weekends.

Reply
0 Kudos
JHakimi
Enthusiast
Enthusiast

Is all your storage connected via iSCSI?  Losing your storage would yield the issues you are experiencing.  Ill wait for more info before I continue to ramble on.

Reply
0 Kudos
Jason_Stein
Contributor
Contributor

No. Nothing too strange.  All internal disks to the Physical host.

Reply
0 Kudos
zilog_jones
Contributor
Contributor

I am currently not using the iSCSI datastore for any VMs on that host, and this problem was happening before I created that datastore. I have two other hosts connected to the same iSCSI target with no issues, however they are different hardware (PowerEdge1950 and 2850).

This happened again this morning. Some things I forgot to mention in my original post:

  • TSM SSH console still works in this broken state, but it fails to reboot from that
  • I can still ping the host (but not the guests)

@Jason: What kind of disk controller does your server have? The SAS 6/iR is probably some re-branded LSI controller so it could be similar to what you have.

Reply
0 Kudos
zilog_jones
Contributor
Contributor

OK, something different has happened this morning. I noticed the following events:

Successfully restored access to volume 4cf7b225-1188abe8-668f-b8ac6f92e0c6 (datastore1) following
connectivity issues.
info
27/10/2011 08:34:02

Lost access to volume 4cf7b225-1188abe8-668f-b8ac6f92e0c6 (datastore1) due to connectivity issues. Recovery
attempt is in progress and outcome will be reported shortly.
info
27/10/2011 08:33:44

Successfully restored access to volume 4cf7b225-1188abe8-668f-b8ac6f92e0c6 (datastore1) following
connectivity issues.
info
27/10/2011 08:19:46

Lost access to volume 4cf7b225-1188abe8-668f-b8ac6f92e0c6 (datastore1) due to connectivity issues. Recovery
attempt is in progress and outcome will be reported shortly.
info
27/10/2011 08:19:16

Successfully restored access to volume 4cf7b225-1188abe8-668f-b8ac6f92e0c6 (datastore1) following
connectivity issues.
info
27/10/2011 00:48:03

Lost access to volume 4cf7b225-1188abe8-668f-b8ac6f92e0c6 (datastore1) due to connectivity issues. Recovery
attempt is in progress and outcome will be reported shortly.
info
27/10/2011 00:47:33

"datastore1" is the local SATA RAID 1 array. In Health Status now (times above are GMT+1) it is saying the array is degraded and disk 1 is in a "rebuild" state - something I have observed before. The VMs are currently responding ok, I don't know how they were acting during the above times as I was not here (this machine would have not been in use by anyone).

Is it reasonable to suspect this is an HDD or disk controller issue?

Reply
0 Kudos
JHakimi
Enthusiast
Enthusiast

Yes. Also, do you have battery backed cache on your controller??? I would get Dell to run their diagnostics and get replacement parts out asap. Also, find out about the battery backed cache for the raid controller, its important.

Reply
0 Kudos
zilog_jones
Contributor
Contributor

I believe it does not have a battery-backed cache as it is a low-end controller. I'll see what Dell say about it.

Reply
0 Kudos
KamilAzmer
Hot Shot
Hot Shot

It's better that you perform fsck on the datastore1 and see if there are any error, And advice to perform diagnostic on raid controller and hard drive cause it could able affected your host if the hard drive/raid controller failure.

@ -- visit my blog at http://www.azmer.my -- @ virtue your mind @ KamilAzmer
Reply
0 Kudos
zilog_jones
Contributor
Contributor

How do I run fsck on a datastore?

I ran full Dell Diagnostics on the machine before but it reported no errors, which is why I wasn't originally so sure what the problem could be. I did have a problem with a disk controller on another R210 (running Fedora) a while back but Dell Diagnostics did report a specific error and they replaced the motherboard.

Reply
0 Kudos
JHakimi
Enthusiast
Enthusiast

Is this box considered production?  If you can put it in maintenance and run the dell diags, you should be fine.  In fact, Dell support will require your run that as most likely they arent going to support you on the VMware level.  You would need to contact VMware support for that and they can bring in the Dell team.

Honestly,  I would try a couple things.

1) Run ESXi from a USB Key just to see if having the  OS on a seperate disk helps with the issues. 

2) Check with Dell regarding any firmware updates for the controllers and the SATA disks themselves.  There might be a bug fix.

3) run the Dell reccomended diags you previously run.  This would be from their boot disk, so it should not harm your disks.

And dont forget to make sure you have a backup of any important data before you perform any of the steps above!

Regards,

Justin

Reply
0 Kudos
zilog_jones
Contributor
Contributor

I've moved all the VMs off this machine now so I'll try diagnostics again and some other things. As said I upgraded the controller firmware (think it may have been the same version anyway), but I didn't think of HDD firmware - I'll try that too, thanks.

Reply
0 Kudos
bladeslap
Contributor
Contributor

Did you ever happen to find the problem? I just started having the same exact problem

I'm also using an LSI SAS (Supermicro) with internal storage - It happened once a few months ago, then twice in the last week. I keep getting the error message that the system is unable to connect to the datastore ... then 30 seconds later it connects, then disconnects etc until I reboot

The interesting thing is on POST reboot (thank you Supermicro for having the remote ICMI view so I can see the sensor statuses and the console redirection from post to power on), it noticed an unrecoverable ECC DIMM error ... Then I verified it as being in the log as well.

As a result, I'm replaced the memory in that entire bank (all three modules) and the CPU

I'm goign to flash the SAS controller as well ...

That's all I can do for now I suppose -

Reply
0 Kudos
zilog_jones
Contributor
Contributor

Hi, I never found out what was wrong with my server. However, I upgraded to ESXi 5.0 and the problem dissappeared Smiley Happy

Reply
0 Kudos
bladeslap
Contributor
Contributor

Thanks so much for the quick Reply ... I appreciate it!

Just a few quick questions:

1. What kind of motherboard(s) were you using?

2. Is the upgrade pretty seamless? We have the 5.0 license as well ... I understand the VMDK structure has not changed, so there will be no need to migrate anything ...

3. How long did the upgrade process take?

4. Does it seem overall more stable than 4.1

Thanks again -

Reply
0 Kudos
zilog_jones
Contributor
Contributor

Hi,

  1. I'm running a Dell PowerEdge R210
  2. I don't remember having much trouble upgrading. You don't have to upgrade anything but the VMs and VMFS can be upgraded easily enough
  3. I can't remember, it was I while ago now but upgrading using the CD didn't take too long - maybe 10 minutes or so?
  4. I haven't seen any stability issues so far.
Reply
0 Kudos