Solved: Re: ESX 3.50 Update 2 host just rebooted for no re...

dwchan · ‎08-27-2008

Today, for no apparent reason, one of our HP BL465 G5 server decide to reboot itself. It is a quad core AMD with 32G of memory. This is a new server (couple months old) with 32G of RAM, and was build with 3.5 Update 2 right out of the gate. We didn't see any error from the onboard ILO, no SIM error to indicate a hardware problem. It just plain rebooted. But, we have 6 to 7 other BL465G5 that been running longer than it without issue. I check most of the log on /var/log without finding anything. Do anyone has any more suggest where else can I dig?

Here is the log from vmkernel right before and after the reboot

0017a4770028 mtime 1188213]

Aug 27 12:10:48 wkfuxpvm06 vmkernel: 3:05:53:09.772 cpu3:1079)SCSI: 4826: path vmhb

a0:0:50: Passing device status RESERVATION_CONFLICT (18) through

Aug 27 13:17:41 wkfuxpvm06 vmkernel: TSC: 0 cpu0:0)Init: 384: cpu 0: early measured

tsc speed 2300083475 Hz

Aug 27 13:17:41 wkfuxpvm06 vmkernel: TSC: 23474 cpu0:0)Cpu: 318: id1.version 100f23

Hereis the log from esxboot2008-08-27 13:01:01 (7529) INFO : Acquiring lock on file '/etc/vmware/esx.conf'.

2008-08-27 13:01:01 (7529) INFO : Acquiring lock on file '/etc/vmware/esx_checksum.conf'.

2008-08-27 13:01:01 (7529) INFO : Releasing lock on file '/etc/vmware/esx_checksum.conf'.

2008-08-27 13:01:01 (7529) INFO : "/usr/bin/md5sum /etc/vmware/esx.conf 2>/dev/null"

2008-08-27 13:01:01 (7529) INFO : fd34e1986049efd96be7e4f024e82921 /etc/vmware/esx.conf

2008-08-27 13:01:01 (7529) INFO : "/bin/df -P -B M /boot"

2008-08-27 13:01:01 (7529) INFO : Filesystem 1048576-blocks Used Available Capacity Mounted on

2008-08-27 13:01:01 (7529) INFO : /dev/cciss/c0d0p1 193 26 158 15% /boot

2008-08-27 13:01:01 (7529) INFO : Recreating initrds...

2008-08-27 13:01:01 (7529) INFO : "/sbin/vmware-mkinitrd -r -f -v /tmp/vmware.0.tmp '2.4.21-57.ELvmnix'"

2008-08-27 13:01:02 (7529) INFO : Calling prep script: iso-initrd -r /tmp/initrd.mnt.Gx7544 2.4.21-57.ELvmnix

2008-08-27 13:01:02 (7529) INFO : Calling prep script: prepinitrd -r /tmp/initrd.mnt.Gx7544 2.4.21-57.ELvmnix

2008-08-27 13:01:04 (7529) INFO : Acquiring lock on file '/boot/initrd-2.4.21-57.ELvmnix.img'.

2008-08-27 13:01:04 (7529) INFO : "/bin/sync"

2008-08-27 13:01:04 (7529) INFO : Releasing lock on file '/boot/initrd-2.4.21-57.ELvmnix.img'.

2008-08-27 13:01:04 (7529) INFO : Using initrd helper vmware_mkinitrd_refresh.

2008-08-27 13:01:04 (7529) INFO : "/sbin/vmware-mkinitrd -r -f -v /tmp/vmware.1.tmp '2.4.21-57.ELvmnix'"

2008-08-27 13:01:05 (7529) INFO : Calling prep script: iso-initrd -r /tmp/initrd.mnt.UU7583 2.4.21-57.ELvmnix

2008-08-27 13:01:05 (7529) INFO : Calling prep script: prepinitrd -r /tmp/initrd.mnt.UU7583 2.4.21-57.ELvmnix

2008-08-27 13:01:07 (7529) INFO : Acquiring lock on file '/boot/initrd-2.4.21-57.ELvmnix.img-dbg'.

2008-08-27 13:01:07 (7529) INFO : "/bin/sync"

2008-08-27 13:01:07 (7529) INFO : Releasing lock on file '/boot/initrd-2.4.21-57.ELvmnix.img-dbg'.

2008-08-27 13:01:07 (7529) INFO : Writing esx_checksum.conf...

2008-08-27 13:01:07 (7529) INFO : Acquiring lock on file '/etc/vmware/esx_checksum.conf'.

2008-08-27 13:01:07 (7529) INFO : "/usr/bin/md5sum /etc/vmware/esx.conf 2>/dev/null"

2008-08-27 13:01:07 (7529) INFO : fd34e1986049efd96be7e4f024e82921 /etc/vmware/esx.conf

2008-08-27 13:01:07 (7529) INFO : Releasing lock on file '/etc/vmware/esx_checksum.conf'.

2008-08-27 13:01:07 (7529) INFO : Releasing lock on file '/etc/vmware/esx.conf'.

2008-08-27 13:01:07 (7529) INFO : /usr/sbin/esxcfg-boot completed successfully.

2008-08-27 13:13:14 (1103) INFO : Acquiring lock on file '/etc/vmware/esx.conf'.

2008-08-27 13:13:14 (1103) INFO : Acquiring lock on file '/etc/vmware/esx_checksum.conf'.

2008-08-27 13:13:14 (1103) INFO : Releasing lock on file '/etc/vmware/esx_checksum.conf'.

2008-08-27 13:13:14 (1103) INFO : "/usr/bin/md5sum /etc/vmware/esx.conf 2>/dev/null"

2008-08-27 13:13:14 (1103) INFO : 94789812a1e7e199b7fe7dae73d1a8b2 /etc/vmware/esx.conf

2008-08-27 13:13:14 (1103) INFO : Releasing lock on file '/etc/vmware/esx.conf'.

2008-08-27 13:13:14 (1103) INFO : /usr/sbin/esxcfg-boot completed successfully.

and finally, here is the log from hostd.log[2008-08-27 13:04:22.478 'TaskManager' 101522352 info] Task Created : haTask-pool0-vim

.ResourcePool.updateConfig-87220

Task Completed : haTask-pool0-v

im.ResourcePool.updateConfig-87220

Task Created : haTask-pool1-vim.

ResourcePool.updateConfig-87221

Task Completed : haTask-pool1-vi

m.ResourcePool.updateConfig-87221

Hw info file: /etc/vmware

/hostd/hwInfo.xml

Config target info loaded

Task Created : haTask-pool0-vi

m.ResourcePool.updateConfig-87312

Task Completed : haTask-pool0-

vim.ResourcePool.updateConfig-87312

Task Created : haTask-pool1-vim.

ResourcePool.updateConfig-87313

Task Completed : haTask-pool1-vi

m.ResourcePool.updateConfig-87313

Log for VMware ESX Server, pid=3707, version=3.5.0, build=build-110268, option=Release

, section=2

Current working directory: /var/log/vm

ware

HOSTINFO: Seeing AMD CPU, numCore

sPerCPU 4 numThreadsPerCore 1.

HOSTINFO: hyperthreading disabled

, setting number of threads per core to 1.

HOSTINFO: This machine has 2 phys

ical CPUS, 8 total cores, and 8 logical CPUs.

dwc

sheetsb · ‎10-16-2008

That is correct. No messages in SIM nor on the ILO indicating bad memory. You should also look at this post regarding the same issues: http://communities.vmware.com/message/1073314#1073314

There is a pointer there to this url: http://www.hpfuwu.cn/post/5.html

Bill S.

View solution in original post

Schorschi · ‎08-27-2008

APIC Table set to FULL? If possible, and document the interrupt mapping, and compare to other servers. You may find an odd stacking of interrupts. HP is usually better at this than Dell or IBM, but that is a place to start, HP ASRs for no reason are rare, also check the HP IML and HP agents webpage for any alerts or warnings. Make sure HP ASR is set 30 minutes.

dwchan · ‎08-27-2008

where do I check for the APIC table setting at? what interrupt mapping areyou talking about? Some BIOS etting? Also.I triple check and that the HP IML log is clean

Texiwill · ‎08-28-2008

Hello,

Yes it is a BIOS setting under Advanced. I would disable ASR so you can also see if anything appears on the console.

Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

CIO Virtualization Blog: http://www.cio.com/blog/index/topic/168354

As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

dwchan · ‎08-28-2008

I have check every BIOS setting, and it is the same as the rest of the servers in the cluster. APIC is turn on FULL. Any other idea?

dwc

mike_laspina · ‎08-28-2008

Maybe an IRQ sharing issue?

Look at

cat /proc/vmware/interrupts

cat /proc/vmware/pci

http://blog.laspina.ca/ vExpert 2009

sheetsb · ‎09-17-2008

We are seeing exactly the same problem on two out of four systems. It looks like the same configuration--BL465c G5, dual processor, quad core AMD, 32G of RAM. Did you ever determine the problem? I have an SR open with VMware and HP on this now...

Bill S.

dwchan · ‎09-19-2008

so far, HP support is useless, and VMware is not that much better. In short, the only way see to fix the problem (work around) are the following

1. reboot the server

2. Disconnect the server from the cluster, REMOVE, and rejoin the cluster

in either case, it looks as if some configure file for HA that work with vpx service got hose

dwc

sheetsb · ‎09-19-2008

Do you have a support incident with HP and VMWare I could provide to them? I have a conference call with both at 1pm today Pacific time, about an hour and a half away.

Bill S.

dwchan · ‎09-22-2008

OK , our case ID with VMware is 1135699321. The only solution (work around so far) is a reboot, and we did reboot the server and it seem stable for about a week now. I can run another vm-support, but not sure if the log is relevant.

dwc

COS · ‎09-22-2008

See TexWill's post here: http://communities.vmware.com/thread/164803?tstart=0

It looks like HA is having some issues and may be related.

sheetsb · ‎09-22-2008

I looked at the thread, briefly, and must have missed something. How do you feel it may be related to this problem? I didn't see anything about hosts rebooting.

Bill S.

COS · ‎09-22-2008

You mentioned "it looks as if some configure file for HA that work with vpx service got hose".

I'm Pulling at straws for you.

sheetsb · ‎09-22-2008

I appreciate every straw I can get. I have an upcoming conference call with VMware and HP support to try to track this down. So far I haven't had any luck. I did get a question from HP support regarding specifics as to the memory installed in the system. Maybe a bad batch?

Bill S.

sheetsb · ‎09-22-2008

I'm not sure what's up. HP support sent this and asked me to compare my DIMs to it. I told them mine matched exactly and they want to ship me new memory.

You might want to check some in your host.

Bill S.

COS · ‎09-22-2008

Yeah, HP always has us do this. I would recommend pulling half of the DIMMS installed then boot back up and see what happens. If you still have issues, swap the last set with the ones you pulled and boot again. HP will most likely have you do this anyway.

cau · ‎10-16-2008

I have the same problem with 2 of my BL465 G5 servers aswell...they suddenly reboot.

Got any solution for this?

dwchan · ‎10-16-2008

well, I just open up another case to the same server that had this issue less about a month or two ago! Same symptons, it reboot for no reason, no dump, no SIM or hardware error, just reboot! No purple screen (we disable ASR and try to capture a DUMP). The only trace that it seem to point to a panic is /tmp/vmkdump.log was generate with no valuable info. We have one other server that did this in our farm also. All other seem fine! Any anyone has any more clue related to the problem?

dwc

dwchan · ‎10-16-2008

did you get any SIM error that point to memory problem? Any kernel dump? After swapping the physical memory, did it help?

sheetsb · ‎10-16-2008

We never got any logs of any kind that would help diagnose the error. I sent support logs to VMware and they reviewed them. The SE stated that the logs didn't show anything. It was if someone pulled the power cord on the servers. I finally got a case opened with HP and they suggested replacing the memory. We replaced all the memory (32GB) in both servers and the problem hasn't resurfaced for about three weeks.

I hope that's the end of it.

Bill S.

All

ESX 3.50 Update 2 host just rebooted for no reason