Solved: Re: ESX host reboots randomly

vuzzini · ‎02-22-2011

Hi Friends,

Two ESX hosts with different hardware and different storage arrays reboots randomly once a week.

Please check the below mentioned details:

Cluster 1
=======

ESX 4.0 261974 build

Server Model: Dell PowerEdge R805

Processor: Quad-core AMD Opteron (tm) Processor 2376

Storage Model: Local Storare + NFS mount

System reboots once or twice a week

Feb 21 14:20:01 cfdresx1 crond[8037]: pam_unix(system-auth-generic:session): session closed for user root
Feb 21 14:22:55 cfdresx1 login: pam_unix(system-auth-generic:session): session closed for user root
Feb 21 14:23:16 cfdresx1 sshd[2518]: Received signal 15; terminating.
Feb 21 14:23:17 cfdresx1 sshd[28064]: pam_unix(system-auth-generic:session): session closed for user admin
Feb 21 14:23:17 cfdresx1 sshd[28141]: pam_unix(system-auth-generic:session): session closed for user admin
Feb 21 14:23:17 cfdresx1 sshd[29326]: pam_unix(system-auth-generic:session): session closed for user admin
Feb 21 14:23:18 cfdresx1 sshd[7983]: pam_unix(system-auth-generic:session): session closed for user admin
Feb 21 14:23:18 cfdresx1 su: pam_unix(system-auth-generic:session): session closed for user root
Feb 21 14:23:18 cfdresx1 last message repeated 3 times
Feb 21 14:23:22 cfdresx1 sshd[14542]: pam_unix(system-auth-generic:session): session closed for user admin
Feb 21 14:23:22 cfdresx1 su: pam_unix(system-auth-generic:session): session closed for user root
Feb 21 14:27:20 cfdresx1 sshd[2464]: Server listening on 0.0.0.0 port 22.
Feb 21 14:27:58 cfdresx1 /usr/lib/vmware/bin/vmware-hostd[2700]: pam_per_user: create_subrequest_handle(): doing map lookup for user "root"
Feb 21 14:27:58 cfdresx1 /usr/lib/vmware/bin/vmware-hostd[2700]: pam_per_user: create_subrequest_handle(): creating new subrequest (user="root", service="system-auth-generic")
Feb 21 14:27:58 cfdresx1 /usr/lib/vmware/bin/vmware-hostd[2700]: pam_unix(system-auth-generic:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost= user=root
Feb 21 14:28:25 cfdresx1 /usr/lib/vmware/bin/vmware-hostd[2700]: pam_per_user: create_subrequest_handle(): doing map lookup for user "vpxuser"
Feb 21 14:28:25 cfdresx1 /usr/lib/vmware/bin/vmware-hostd[2700]: pam_per_user: create_subrequest_handle(): creating new subrequest (user="vpxuser", service="system-auth-local")
Feb 21 14:30:01 cfdresx1 crond[6866]: pam_per_user: create_subrequest_handle(): doing map lookup for user "root"
Feb 21 14:30:01 cfdresx1 crond[6866]: pam_per_user: create_subrequest_handle(): creating new subrequest (user="root", service="system-auth-generic")
Feb 21 14:30:01 cfdresx1 crond[6866]: pam_unix(system-auth-generic:session): session opened for user root by (uid=0)

Cluster 2
=======

ESX 4.0 261974 build

Server Model: HP Proliant DL380 G5

Storage Model: EMC FAS 2020

Processor: Intel Xeon E5310

Fiber channel drive ( Smart array P800)

System reboots once or wice a week

Feb 17 13:46:49 cfesx02 vmkernel: 3:06:29:42.934 cpu1:6071)VSCSI: 6025: handle 8245(vscsi4:0):Destroying Device for world 6072 (pendCom 0)
Feb 17 13:46:49 cfesx02 vmkernel: 3:06:29:43.057 cpu1:6071)DevFS: 2370: Unable to find device: 7f20c802-NP-5-delta.vmdk
Feb 17 13:46:49 cfesx02 vmkernel: 3:06:29:43.206 cpu1:6071)VSCSI: 6025: handle 8246(vscsi1:0):Destroying Device for world 6072 (pendCom 0)
Feb 17 13:46:50 cfesx02 vmkernel: 3:06:29:44.356 cpu4:6088)VMotionSend: 2921: 1297971987014193 S: Sent all modified pages to destination (network bandwidth ~115.203 MB/s)
Feb 17 13:47:14 cfesx02 vmkernel: 3:06:30:08.548 cpu5:4137)WARNING: NFSLock: 2036: disk is being locked by other consumer
Feb 17 13:47:14 cfesx02 vmkernel: 3:06:30:08.548 cpu5:4137)NFSLock: 2677: failed to get lock on file NP-5-36a6887b.vswp 0x410003234410 on 192.168.48.32 (192.168.48.32): Busy
Feb 17 15:02:28 cfesx02 vmkernel: TSC: 0 cpu0:0)Init: 418: cpu 0: early measured tsc speed 2300092987 Hz
Feb 17 15:02:28 cfesx02 vmkernel: TSC: 10276 cpu0:0)Init: 419: vmkLoadEntry = $[0x390ab9a0]
Feb 17 15:02:28 cfesx02 vmkernel: TSC: 18114 cpu0:0)Cpu: 346: id1.version 100f42
Feb 17 15:02:28 cfesx02 vmkernel: TSC: 25157 cpu0:0)CPUAMD: 214: Detecting xapic on AMD_K8:tcr = 0x4fc820
Feb 17 15:02:28 cfesx02 vmkernel: TSC: 30887 cpu0:0)CPUAMD: 315: effective family = 16

Could anyone please let me why the reboot is happening ?

If you found this or any other answer useful please consider the use of the Helpful or Correct buttons to award points. Sandeep Vuzzini Sr. DevOps Engineer

cody_bunch · ‎02-26-2011

Can you also drop in the host-d logs and/or the information provided from a support dump?

My suspicion is you may have something hardware related going on. Have you checked the hardware section of the VI cilent or of the HP tools?

-Cody

http://professionalvmware.com

-Cody Bunch http://professionalvmware.com

View solution in original post

sflanders · ‎02-22-2011

Logs do not show much information. You say cluster, are they configured as clusters in a vCenter Server instance? If so, how many ESX hosts per cluster and how many are experiencing issues? Do you have HA configured?

Hope this helps! === If you find this information useful, please award points for "correct" or "helpful". ===

a_p_ · ‎02-24-2011

Moved to the VMTN forum for VMware ESX.

André

Piggy · ‎02-24-2011

I had a host that rebooted occasionally. Two identical whitebox systems I built at the same time. Identical everything. On a lark trying to figure it out I updated the BIOS. Although both ran identical BIOS versions this solved the problem. What I don't recall is if there was a BIOS update from ASUS or I just applied the version they were running to both but they have not crashed in over a year.

I know updating BIOSes is not something we usually think of and considered bad practice by vendors I tend to check and apply firmware updates for everything if I'm updating software and notice a newer firmware release is available. I think vendors are afraid to recommend it because applying firmware updates always has a risk of turning equipment into toasters.

Make sure you have backups and a recovery plan. Take pictures of all firmware setting pages with your iPhone or whatever. Update firmware. Force to defaults, save, reboot and re-apply settings if you're brave enough. I don't usually default/re-apply settings unless I don't trust the vendor in the first place in which case I probably should have spent my money elsewhere. :smileycry:

Good luck.

cody_bunch · ‎02-26-2011

Can you also drop in the host-d logs and/or the information provided from a support dump?

My suspicion is you may have something hardware related going on. Have you checked the hardware section of the VI cilent or of the HP tools?

-Cody

http://professionalvmware.com

-Cody Bunch http://professionalvmware.com

All

ESX host reboots randomly