VMware Cloud Community
sonicsw
Enthusiast
Enthusiast

ESX 3.5 CoreDump Help Plz

Hello Community,

I have a real big problem, hope you can help me somehow...

I have 2 ESX Server here, exaclty the same configuration as i have already 5 other ones running without problems.

Intel S3000AH with an Adaptec 3405 SATA/SAS Raid, i know not HCL bit we a not so ritch in our business...

The adaptec driver is called in the esx aacraid and arround that i found some errors in the core dump.

Could you maybe help me to find a sence out of it? I Changed the hole hardware and put less load on, but also on one VM it crashed again.

I maked some parts where i my see the maybe the error...

0:01:40:13.506 cpu3:1070)World: vm 1072: 895: Starting world vmm1:W2k3-Exclaimer with flags 8

0:01:40:25.502 cpu3:1070)VSCSI: 4059: Creating Virtual Device for world 1071 vscsi0:0 (handle 8192)

0:01:40:25.644 cpu3:1070)World: vm 1073: 895: Starting world vmware-vmx with flags 44

0:01:40:25.645 cpu0:1073)World: vm 1074: 895: Starting world vmware-vmx with flags 44

0:01:40:25.645 cpu3:1073)World: vm 1075: 895: Starting world vmware-vmx with flags 44

0:01:40:25.646 cpu3:1071)Init: 1054: Received INIT from world 1071

0:01:40:25.790 cpu3:1073)World: vm 1076: 895: Starting world vmware-vmx with flags 44

0:01:40:25.795 cpu2:1072)Init: 1054: Received INIT from world 1072

0:01:40:25.813 cpu0:1071)Uplink: 2495: Setting capabilities 0x0 for device vmnic0

0:01:40:41.695 cpu2:1072)Uplink: 2495: Setting capabilities 0x0 for device vmnic0

0:01:40:41.760 cpu2:1072)Net: 4203: unicastAddr 00:50:56:80:7d:b1;

0:01:40:41.761 cpu2:1072)Net: 4203: unicastAddr 00:50:56:80:7d:b1;

0:01:40:41.761 cpu2:1072)Net: 4203: unicastAddr 00:50:56:80:7d:b1;

0:01:40:41.761 cpu2:1072)Net: 4203: unicastAddr 00:50:56:80:7d:b1;

0:01:40:41.761 cpu2:1072)Net: 4203: unicastAddr 00:50:56:80:7d:b1;

0:01:40:41.761 cpu2:1072)Net: 4203: unicastAddr 00:50:56:80:7d:b1;

0:01:40:43.081 cpu0:1071)Net: 4203: unicastAddr 00:50:56:80:7d:b1;

0:01:40:43.092 cpu0:1071)Net: 4203: unicastAddr 00:50:56:80:7d:b1;

0:01:40:58.496 cpu2:1072)Net: 4203: unicastAddr 00:50:56:80:7d:b1;

0:02:50:15.149 cpu1:1061)LinSCSI: 3201: Abort failed for cmd with serial=2144, status=bad0001, retval=bad0001

0:02:50:30.205 cpu1:1061)LinSCSI: 3201: Abort failed for cmd with serial=2145, status=bad0001, retval=bad0001

0:02:50:55.151 cpu1:1061)LinSCSI: 3201: Abort failed for cmd with serial=2144, status=bad0001, retval=bad0001

[31;1m0:02:51:05.285 cpu1:1025)ALERT: Heartbeat: 470: PCPU 0 didn't have a heartbeat for 62 seconds. may be locked up [0m

[31;1m0:02:51:05.285 cpu0:1064)ALERT: NMI: 1625: Faulting eip:esp [0m

0:02:51:05.286 cpu0:1064)0x3aa3e88:[0x63fa9b]Util_Udelay+0x5a stack: 0x5, 0x89d690, 0x4020

0:02:51:05.286 cpu0:1064)0x3e241088:[0x89d6b4]aacraid_esx30+0x76b3 stack: 0x0, 0x0, 0x0

0:02:51:10.207 cpu1:1061)LinSCSI: 3201: Abort failed for cmd with serial=2145, status=bad0001, retval=bad0001

0:02:51:35.153 cpu1:1061)LinSCSI: 3201: Abort failed for cmd with serial=2144, status=bad0001, retval=bad0001

0:02:51:50.209 cpu1:1061)LinSCSI: 3201: Abort failed for cmd with serial=2145, status=bad0001, retval=bad0001

0:02:52:15.155 cpu1:1061)LinSCSI: 3201: Abort failed for cmd with serial=2144, status=bad0001, retval=bad0001

0:02:52:30.211 cpu1:1061)LinSCSI: 3201: Abort failed for cmd with serial=2145, status=bad0001, retval=bad0001

0:02:52:55.157 cpu1:1061)LinSCSI: 3201: Abort failed for cmd with serial=2144, status=bad0001, retval=bad0001

[31;1m0:02:53:05.285 cpu1:1025)*ALERT: Heartbeat: 470: PCPU 0 didn't have a heartbeat for 182 seconds. may be locked up [0m*

  • [31;1m0:02:53:05.285 cpu0:1064)ALERT: NMI: 1625: Faulting eip:esp [0m*

0:02:53:05.285 cpu0:1064)0x3aa3e88:[0x63fa9b]Util_Udelay+0x5a stack: 0x5, 0x89d6a5, 0x4020

0:02:53:05.285 cpu0:1064)0x3e241088:[0x89d6b4]aacraid_esx30+0x76b3 stack: 0x0, 0x0, 0x0

0:02:53:10.213 cpu1:1061)LinSCSI: 3201: Abort failed for cmd with serial=2145, status=bad0001, retval=bad0001

0:02:53:11.086 cpu0:1064)<3>aacraid: aac_fib_send: first asynchronous command timed out.

Usually a result of a PCI interrupt routing problem;

update mother board BIOS or consider utilizing one of

the SAFE mode kernel options (acpi, apic etc)

[7m0:02:53:11.086 cpu0:1064)WARNING: CpuSched: vm 1064: 8269: excessive time: deltaSec=183.077007 [0m

[7m0:02:53:11.086 cpu0:1064)WARNING: CpuSched: vm 1064: 8351: excessive time: chargeSec=183.059347 [0m

[45m [33;1mVMware ESX Server [0m

Exception type 14 in world 1024:console @ 0x89fc1f

frame=0x1402d5c ip=0x89fc1f cr2=0xffc00004 cr3=0x13401000 cr4=0x6f0

es=0x4028 ds=0x1004028 fs=0x0 gs=0x1400000

eax=0x0 ebx=0x3e205798 ecx=0x3e2058c8 edx=0x0

ebp=0x2837700 esi=0x6b1c800 edi=0x0 err=2 eflags=0x10046

*0:1024/console 1:1076/vcpu-1:W2 2:1026/idle2 3:1071/vmm0:W2k3

@BlueScreen: Exception type 14 in world 1024:console @ 0x89fc1f

0x2837700:[0x89fc1f]aacraid_esx30+0x9c1e stack: 0x0, 0x0, 0x0

VMK uptime: 0:02:53:11.087 TSC: 24939409677138

0:02:51:05.285 cpu0:1064)NMI: 1625: Faulting eip:esp

0:02:53:05.285 cpu1:1025)Heartbeat: 470: PCPU 0 didn't have a heartbeat for 182 seconds. may be locked up

0:02:53:05.285 cpu0:1064)NMI: 1625: Faulting eip:esp

Starting coredump to disk Starting coredump to disk Dumping using slot 1 of 1... using slot 1 of 1... log

why they may or shoud have an pci irq error, its pritty the same as all the other server here.

THX for any coul help ref, give me what u got,

thx stefan

0 Kudos
12 Replies
RParker
Immortal
Immortal

Rather than diagnose issues, it's much easier to reinstall the ESX server.

Just make sure you use the option to preserve the datastore and the VM's it contains.

Then you can reconfigure the host once it's up. It really doesn't take that long.

NTP, rescan for LUNS, remediate, done. Then no more problems.

0 Kudos
sonicsw
Enthusiast
Enthusiast

hi,

I moved all vms to another ESX i have and realy cleand the ESX from scratch.

Day1: Running with 1VM no Backup

0 Kudos
Texiwill
Leadership
Leadership

Hello,

In general it is best to call in your VMware Support Specialist to resolve such issues. However, I have found that most faults are due to failing hardware. As RParker has mentioned you can easily reinstall ESX. If its an ESX configuration issue that will solve it. If its a hardware issue it will take time. To verify the hardware, run your vendor provided diagnostics for at least 48 (72 is better) solid hours. Then run memtest86 for another 48. One of the two will show issues. Also run it within the same environment that failed if possible. Also make sure you have the firmware required by your vendor for ESX (usually the latest versions) and that you are using the proper BIOS settings for the systems (again provided by your vendor).

We had a very subtle failure problem with some 1U servers, it turned out that the heatsink was not attached to the CPU properly so it would over heat easily and crash/reboot the system when it got very hot. We could not find this problem on the bench, but only within a stacked environment.


Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education. As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
0 Kudos
beng
Contributor
Contributor

G'day,

We also have a similar issue that we have narrowed down to an Adaptec 3805 card.

The symptom is almost identical, with a "Heartbeat Lost".

This is an in house testing system.

History, we had ESX 3.01 on an Asus board and we were running the VMFS over iSCSI (before the Adaptec 3805), we then upgraded to 3.02 with no problems, it was just a bit(very!) slow, so we installed an Adaptec 3405 card and things were going well. A client needed a SAS card in a hurry, so we swapped out the 3405 card and replaced with the 3805 after a couple of days, and again all seemed ok. But at about the same time (within a couple of days) we upgraded to ESX 3.5 and the problem started, every few days we would get a "heartbeat lost"..... didn't worry too much about it, as I knew we are running on non-HCL hardware. Same issue with ESX 3.5 Update 1.

Then a few days ago, we purchased an Intel SC2500, Xeon 5310, 8GB of Intel cert memory and 2x 300GB SAS drives as a replacement, thinking that we could just drop in the 3805 and all would be well, and the above problem would go away. (As far as I know it's all on the HCL?)

Nope. Same issue, after spending 2 days on it, checking Bios, removing the RMM card, running memtest checks etc, we are no closer to a solution.

It seems to be related to I/O, but it is hard to determine as it is not every time.... mostly if we do anything like suspend a VM with 2GB Memory, or try and install Win2K8.

We had a clients Adaptec 5805 handy so just to test, we popped it in and the system ran flawlessly for 5 days. (a new internal record! :smileygrin:) So my sights are firmly fixed on the 3805... It has the latest Bios firmware and does not report any errors.... so we can't return it easily.

I'd rather not submit this to VMware as it is really a "test system", but just interested in others feedback on the Adaptec 3x05 cards. Up until now I had thought they were fairly solid cards.

Rgds Ben.

0 Kudos
sonicsw
Enthusiast
Enthusiast

hello ben

thx mucht for your post. nice to see that im not the only one. i have think i have fixed my esx now. since over 5 weeks it running with not problems.

the sulotuion was kinda simpel i guess. i checked the IRQ mapping in the esx and saw that the aachraid and the COS uses the same IRQ.

So i diabled all unused bios feat. like serial usb paralell ata and make a clean reinsallation. after i recheckt the IRQ mapping and non device was shared with the COS.

as well i installed sp1 for the 3.5 and ist looking good.

may also works for u? let me know

stef

0 Kudos
beng
Contributor
Contributor

G'day Stef, very interesting find..... I wonder if it is as simple as the IRQ mapping. Well, since it has been stable for us with a 2405 card. I am loathe to upset it again... only so many hours in the day Smiley Happy But if I get time this weekend I'll swap them over and I'll let you know if the 2405 changes the IRQs in use....

Thanks for the insight.

Rgds Ben

0 Kudos
dagkl
Contributor
Contributor

Did you try to swap the cards to see if the IRQ changed. I have a 3405 card that does the same. I can reproduce the error by running a rescan of my storage, so this must be the controller.

0 Kudos
beng
Contributor
Contributor

G'day, No, we ended up using the 3805 in a Open-E DSS box, and replaced it with a 5805 which has worked perfectly (and so did the 2405 but we wanted Raid50). Only so many hours in the day and the 5805 is a great card. But I still think it was an IRQ issue, so if it ever happens again, that is where I will be looking first. Other posts/articles have mentioned severe performance issues with shared IRQ's (on NICs as well) so it does seem like a culprit.....In the meantime we are standardising on the 5xxx and 2xxx range anyway.

The article I found was here:

http://www.tuxyturvy.com/blog/index.php?/archives/37-Troubleshooting-VMware-ESX-network-performance....

What MB are you using? Does it allow you to force IRQ's?

Rdgs Ben.

0 Kudos
dagkl
Contributor
Contributor

Hi,

I saw that article too, and it helped me too troubleshoot the interrupts, because that was causing the problems. I have an ASUS crosshair motherboard with two onboard nvidia 590 Nics, but it doesnt seem to enable me to control interrupts. I had some problems with the performance on these, and also if I plugged them into a cisco ASA i got, the entire esx froze. Didnt have a console attached at the moment, but I assumed that there was a driver problem. Bought a new dualport NIC from Intel from the HCL, and now I had network. also bought the RAID card at the same moment.

When I started to copy some data onto the disks same thing happened, and this time determined to find out what it was, I attached a console. Stupid of me to not doing it before, but anyway... Found the same error message that was described in this post, and that article about interrupts.

Then I did some tests with the cards, and enabled/disabled different hardware on the motherboard, and also reinstalling esx three or four times before I finally managed to get a configuration that worked alright. I had to disable USB completely, and reinstall esx, then there was no conflict between the COS and the onboard Nics and the RAID card. Although the RAID card and the dualport NIC shares the same interrupt now, this works ok. They are both PCIe-cards and though I swapped slots it was not possible to get them to use separate interrupts. This could of course have some performance impact but allinall it is doing fine. No hangups and coredumps at all. I am running 7MB/second on a 100 Mbit/s network, and my guess is that if I change the virtual switch to use the onboard nics instead of the dualport this will increase to about 10Mb/s which is close to the theoretical limit of the 100 Mbit/sec network. However I will soon buy a managed switch with GB/s and then I will have enough performance for my setup, which is streaming videos to my PS3, that is all that I need performance for.

The conclusion is to check out the cat /proc/vmware/interrupts and see if the COS shares its interrupt with some hardware. My guess is that with another motherboard I would not have had these issues, and maybe also have access to USB, although I have not managed to find out how to make use of that in my VMs yet.

Regards Dag

0 Kudos
beng
Contributor
Contributor

Yep, Hindsight is 20:20 so I regret not following the HCL closer... it is amazing that ESX is so touchy about hardware... gives Linux a bad name Smiley Happy

When it really comes down to it, if we value our time, and factor it in, it's worth just buying the stuff in the firt place, eg Intel S5000 board, Xeon and Memory should only set you back 12hrs:-)

Good to see you worked it out. Let me know what you get on GBE.

Perhaps Vmware should consider making an IRQ check part of the install or part of the VI Client......

Rgds Ben

0 Kudos
tlduong
Enthusiast
Enthusiast

Hi,

I was wondering if you were using SATA drives when this happened. I used an Adaptec 5805 with SAS drives without error on an ESX host, however, when I added SATA drives (and made an array from them) into the mix, I started seeing the "Abort failed for cmd" messages. Most of the time, the messages come up but are harmless, however, every couple weeks the system would PSOD.

Tuan

0 Kudos
beng
Contributor
Contributor

G'day Tuan,

To be honest, I don't recall exactly what drives we had in the box each time. However one thing we did do was ensure that the drive internal write Cache was disabled, I think from memory it is enabled by default? Don't recall the reference, but it had been linked to issues between the Raid and Drive cache. Note this is the Drive cache(8MB/16MB whatever), not the Raid Controller Cache (Battery backed etc.)

We also had an issue when we mixed SAS and SATA on the one controller, it just locked up. (Since this was a system that just had SAS in production, we ignored it and just took about the SATA and never investigated further.

Rgds Ben.

0 Kudos