Hi Friends,
Two ESX hosts with different hardware and different storage arrays reboots randomly once a week.
Please check the below mentioned details:
Cluster 1
=======
ESX 4.0 261974 build
Server Model: Dell PowerEdge R805
Processor: Quad-core AMD Opteron (tm) Processor 2376
Storage Model: Local Storare + NFS mount
System reboots once or twice a week
Feb 21 14:20:01 cfdresx1 crond[8037]: pam_unix(system-auth-generic:session): session closed for user root
Feb 21 14:22:55 cfdresx1 login: pam_unix(system-auth-generic:session): session closed for user root
Feb 21 14:23:16 cfdresx1 sshd[2518]: Received signal 15; terminating.
Feb 21 14:23:17 cfdresx1 sshd[28064]: pam_unix(system-auth-generic:session): session closed for user admin
Feb 21 14:23:17 cfdresx1 sshd[28141]: pam_unix(system-auth-generic:session): session closed for user admin
Feb 21 14:23:17 cfdresx1 sshd[29326]: pam_unix(system-auth-generic:session): session closed for user admin
Feb 21 14:23:18 cfdresx1 sshd[7983]: pam_unix(system-auth-generic:session): session closed for user admin
Feb 21 14:23:18 cfdresx1 su: pam_unix(system-auth-generic:session): session closed for user root
Feb 21 14:23:18 cfdresx1 last message repeated 3 times
Feb 21 14:23:22 cfdresx1 sshd[14542]: pam_unix(system-auth-generic:session): session closed for user admin
Feb 21 14:23:22 cfdresx1 su: pam_unix(system-auth-generic:session): session closed for user root
Feb 21 14:27:20 cfdresx1 sshd[2464]: Server listening on 0.0.0.0 port 22.
Feb 21 14:27:58 cfdresx1 /usr/lib/vmware/bin/vmware-hostd[2700]: pam_per_user: create_subrequest_handle(): doing map lookup for user "root"
Feb 21 14:27:58 cfdresx1 /usr/lib/vmware/bin/vmware-hostd[2700]: pam_per_user: create_subrequest_handle(): creating new subrequest (user="root", service="system-auth-generic")
Feb 21 14:27:58 cfdresx1 /usr/lib/vmware/bin/vmware-hostd[2700]: pam_unix(system-auth-generic:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost= user=root
Feb 21 14:28:25 cfdresx1 /usr/lib/vmware/bin/vmware-hostd[2700]: pam_per_user: create_subrequest_handle(): doing map lookup for user "vpxuser"
Feb 21 14:28:25 cfdresx1 /usr/lib/vmware/bin/vmware-hostd[2700]: pam_per_user: create_subrequest_handle(): creating new subrequest (user="vpxuser", service="system-auth-local")
Feb 21 14:30:01 cfdresx1 crond[6866]: pam_per_user: create_subrequest_handle(): doing map lookup for user "root"
Feb 21 14:30:01 cfdresx1 crond[6866]: pam_per_user: create_subrequest_handle(): creating new subrequest (user="root", service="system-auth-generic")
Feb 21 14:30:01 cfdresx1 crond[6866]: pam_unix(system-auth-generic:session): session opened for user root by (uid=0)
Cluster 2
=======
ESX 4.0 261974 build
Server Model: HP Proliant DL380 G5
Storage Model: EMC FAS 2020
Processor: Intel Xeon E5310
Fiber channel drive ( Smart array P800)
System reboots once or wice a week
Feb 17 13:46:49 cfesx02 vmkernel: 3:06:29:42.934 cpu1:6071)VSCSI: 6025: handle 8245(vscsi4:0):Destroying Device for world 6072 (pendCom 0)
Feb 17 13:46:49 cfesx02 vmkernel: 3:06:29:43.057 cpu1:6071)DevFS: 2370: Unable to find device: 7f20c802-NP-5-delta.vmdk
Feb 17 13:46:49 cfesx02 vmkernel: 3:06:29:43.206 cpu1:6071)VSCSI: 6025: handle 8246(vscsi1:0):Destroying Device for world 6072 (pendCom 0)
Feb 17 13:46:50 cfesx02 vmkernel: 3:06:29:44.356 cpu4:6088)VMotionSend: 2921: 1297971987014193 S: Sent all modified pages to destination (network bandwidth ~115.203 MB/s)
Feb 17 13:47:14 cfesx02 vmkernel: 3:06:30:08.548 cpu5:4137)WARNING: NFSLock: 2036: disk is being locked by other consumer
Feb 17 13:47:14 cfesx02 vmkernel: 3:06:30:08.548 cpu5:4137)NFSLock: 2677: failed to get lock on file NP-5-36a6887b.vswp 0x410003234410 on 192.168.48.32 (192.168.48.32): Busy
Feb 17 15:02:28 cfesx02 vmkernel: TSC: 0 cpu0:0)Init: 418: cpu 0: early measured tsc speed 2300092987 Hz
Feb 17 15:02:28 cfesx02 vmkernel: TSC: 10276 cpu0:0)Init: 419: vmkLoadEntry = $[0x390ab9a0]
Feb 17 15:02:28 cfesx02 vmkernel: TSC: 18114 cpu0:0)Cpu: 346: id1.version 100f42
Feb 17 15:02:28 cfesx02 vmkernel: TSC: 25157 cpu0:0)CPUAMD: 214: Detecting xapic on AMD_K8:tcr = 0x4fc820
Feb 17 15:02:28 cfesx02 vmkernel: TSC: 30887 cpu0:0)CPUAMD: 315: effective family = 16
Could anyone please let me why the reboot is happening ?
Can you also drop in the host-d logs and/or the information provided from a support dump?
My suspicion is you may have something hardware related going on. Have you checked the hardware section of the VI cilent or of the HP tools?
-Cody
Logs do not show much information. You say cluster, are they configured as clusters in a vCenter Server instance? If so, how many ESX hosts per cluster and how many are experiencing issues? Do you have HA configured?
Moved to the VMTN forum for VMware ESX.
André
I had a host that rebooted occasionally. Two identical whitebox systems I built at the same time. Identical everything. On a lark trying to figure it out I updated the BIOS. Although both ran identical BIOS versions this solved the problem. What I don't recall is if there was a BIOS update from ASUS or I just applied the version they were running to both but they have not crashed in over a year.
I know updating BIOSes is not something we usually think of and considered bad practice by vendors I tend to check and apply firmware updates for everything if I'm updating software and notice a newer firmware release is available. I think vendors are afraid to recommend it because applying firmware updates always has a risk of turning equipment into toasters.
Make sure you have backups and a recovery plan. Take pictures of all firmware setting pages with your iPhone or whatever. Update firmware. Force to defaults, save, reboot and re-apply settings if you're brave enough. I don't usually default/re-apply settings unless I don't trust the vendor in the first place in which case I probably should have spent my money elsewhere. :smileycry:
Good luck.
Can you also drop in the host-d logs and/or the information provided from a support dump?
My suspicion is you may have something hardware related going on. Have you checked the hardware section of the VI cilent or of the HP tools?
-Cody