ESX 3.0.1 hosts failures

Oli_L · ‎09-30-2007

Hi,

We have had 5 instances of host failures, 2 on one host, 3 on individual hosts where the host has frozen outputting some red text on the console welcome screen. We have 9 hosts in our datacentre in london where this is happeneing, all of the same patch level and all have the same hardware in. The server model is an HP DL380 G5 16Gig of memory (8x2G DDR2 SDRAM FB-DIMM) 2x quad core Intel Xeon Processors @ 2.33 GHz, 2 x quad port NICs NC340Ts with 2 on board NICs, VMware ESX Server 3.0.1 build-44686 - we have all of our VM's running from our DataCore SAN symphony 5 patch 11

When the hosts has failed we have had this red text outputted on the welcome screen stating the following:

(63:00:56:30.472 cpu0:1096)APIC: 1265: Lint1 interrupt on pcpu 0 (port x61 contains 0xb1)

We have had this crash happen 3 times in one day all on different hosts. HA kicked in and restarted the VMs on the other hosts in the lcuster but this is very concerning...

Has anyone had errors like this or has any one got the knowlegde to be able to diagnose our problem?

(Working on a sunday to fix it... !)

Thanks in advance for any info

Oli

RParker · ‎09-30-2007

http://communities.vmware.com/message/641718

I found one post on VM ware, but you might try google.com there are LOTS of hits on this exact error.

Oli_L · ‎09-30-2007

When you says there are LOTS I couldn't really find much when I searched google.com?

E.g,

APIC: 1265: Lint1 interrupt on pcpu 0 - returns 5 results on one page, 3 from vmware...including this thread..

&

(63:00:56:30.472 cpu0:1096)APIC: 1265: Lint1 interrupt on pcpu 0 (port x61 contains 0xb1) - returns my thread?

&

Nevertheless I have done some research and found a few similar threads... we have a call open with vmware and the best thread I read to do with this error was this link

http://communities.vmware.com/message/73671

post from JMills....

We reseated all the memory modules today and we are running memtest86 found on this site.... http://www.memtest86.com on two hosts that had the problem but so far no reported errors.. we also ran the smartstart 7.9 diagnostics test and set a 2 loop test... no errors?!

I logged this call with VMWare and the tech guys said he couldn't help which is not surprising as it looks like a hardware issue, but he did not mention anything about deciphering the error which wasn't helpful...

I then phoned up again and got someone else who was a little more helpful and told us to reseat the memory modules...

Does anyone have any other useful information or knowledge that will help me, ie DL380 known config issues.. I read somewhere that G4's max config is 6 slots of memory? Does this apply to G5's?

Its one of those errors where people have suggested memory but we spent £20000 on memory so getting it all replaced is not going to happen! Need to find the culprits but it's happened across 4 hosts now so it's difficult to pin down....

I can't have hosts failing like this as we run so many important VMs - like our email gateway and exchange boxes, doc management....!

Any help would be really helpful... Thanks a mil

Oli

RParker · ‎09-30-2007

Well perhaps it IS a hardware issue, did you try calling HP? That's what I saying about LOTS on google, it ALL refers to HP. If VM ware couldn't help (and in your opinion they weren't helpful because it was "hardware" error) I think that should tell you the problem may be hardware related, and NOT software.

HP, like DELL, has diagnostics programs you can run. I would suggest that you run one of those on your machines. Firmware, backplane, SCSI/RAID all need hardware updates from time to time, but since it's happening on ALL your machines, that doesn't sound like a VM Ware issue either.

IT must be hardare. If you resolve one machine, I bet a dollar to a donut that it will fix ALL. Start with BIOS / RAID controller updates.

dominic7 · ‎09-30-2007

This actually sounds pretty easy to solve but will take some sleuthing on either your part, or the part of your hardware vendor as suggested above. First you want to run vm-support. This will gather up a support file that you can send to VMware or whoever you buy support through. If you untar/gzip the package there should be a file in ./<vm-support-systemname-timestamp/root/vmkernel-zdump-<timestamp>

Use your favorite editor that can deal with a large txt file ( ie notepad++ ), and look for the crash. Sometimes it's hard to locate and sometimes it isn't. You can very quickly search for a machine check exception which almost always points to a hardware failure. Here is an example of a machine check exception from a host with a bad set of ram.

+ [31;1m3:04:56:52.344 cpu3:1156)ALERT: MCE: 147: Machine Check Exception [0m+
+ [31;1m3:04:56:52.344 cpu3:1156)ALERT: MCE: 160: Machine Check Exception: General Status 0000000000000004 [0m+
+ [31;1m3:04:56:52.344 cpu3:1156)ALERT: MCE: 171: Machine Check Exception: Bank 0, Status 0000000000000000 [0m+
+ [31;1m3:04:56:52.344 cpu3:1156)ALERT: MCE: 171: Machine Check Exception: Bank 1, Status 0000000000000000 [0m+
+ [31;1m3:04:56:52.344 cpu3:1156)ALERT: MCE: 171: Machine Check Exception: Bank 2, Status 0000000000000000 [0m+
+ [31;1m3:04:56:52.344 cpu3:1156)ALERT: MCE: 171: Machine Check Exception: Bank 3, Status 0000000000000000 [0m+
+ [31;1m3:04:56:52.344 cpu3:1156)ALERT: MCE: 171: Machine Check Exception: Bank 4, Status f66da00125080813 [0m+
+ [31;1m3:04:56:52.344 cpu3:1156)ALERT: MCE: 188: Machine Check Exception: Bank 4, Addr 000000070b234600 [0m+

I've personally dealt with more than my fair share of ESX crashes due to faulty hardware. Not all errors are easy to detect but this works at least 70% of the time.

leacho · ‎10-01-2007

Thanks for your response - if I sound desparate then it is becuase I am...! It's a case of going down as many avenues as possible, getting peoples opinions and putting all the evidence together! So thanks to both of you for your posts, really appreciated.

I have a call logged with VMWare and have run vm-support and sent them the logs... I have also ran HP Diagnostics smart start 7.90 and memtest86 3.3 (still running) but we have no reported errors. VMware support said that there was no vmkernel dump before the crash happened so the logs were non-conclusive.

We did have one possibilty that has sprung to mind. We had a A/C fault and the room varied in temperature by about 3/4c which may have cuased the chips to expand contract hence moving the memory... possible option but very difficult to diagnose

The hosts seems stable now but I am so unsure and have no cofidence that they will remain running.. All I want is a stable environment and surely that is possible where the hosts don't crash!

Going to phone VMware again and HP!

Thanks again

Oli

(logged in as my 'other' user!)

zorro_rks · ‎07-06-2008

Hello Oli,

I am facing a similar issue with one of my server, I have a IBM system X3650 ,can you tell me if this issue was resolved .Ofcourse the fix too. Thank you

Oli_L · ‎12-04-2008

we upgraded our hosts to 3.02 and that fixed the issue for us but if you are having the problems in July 2008 then you should by now be on this version.

BUT I did also ugrade the BIOS firmware too. Bother seem to fix the issue thatnkfully

I would log a call as it sounds different.

Hope this helps!

Texiwill · ‎12-05-2008

Hello,

Moved to ESX 3.0 forum.

Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

Blue Gears and other Blogs: http://www.astroarch.com/wiki/index.php/Blog_Roll

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill