Purple Screen Of Death - - ESX goes down whenever ...

fusebox · ‎07-17-2008

Hi Folks!

Below are some specifics of the environment in which I am having the above mentioned issue:This issue is reproducable.So that is what worries Me the most.

ESX Version : 3.5 (Update 1)

Platform : Sun Fire 4450

SAN Storage : EMC Clarrion CX340

Issue: We added a new 700gb shared lun to this host and then do a lun rescan from the VI client. Then after few seconds in the VI client,I see that the host is not responding and has gone down. When I go to the box physically,I see that it is sitting at a purple screen of death with some errors and a message that vmkernel core dump created successfully. I have the dump file too. Please find the relevant info,screenshots for your reference. Hope you all can help Me identify the issue. I went through the dump and see some specific messages which are looking fatal.Below is an excerpt of that:

0:23:42:45.328 cpu13:1047)SCSI: 861: GetInfo for adapter vmhba3, , max_vports=0, vports_
inuse=0, linktype=0, state=0, failreason=0, rv=-25, sts=bad0001
0:23:42:45.328 cpu13:1047)ScsiScan: 395: Path 'vmhba3:C0:T0:L0': Vendor: 'Sun ' Model: 'STK Mir
rorNT ' Rev: 'V1.0'
0:23:42:45.329 cpu13:1047)ScsiUid: 754: Path 'vmhba3:C0:T0:L0' does not support VPD Serial Id page.
0:23:42:45.329 cpu13:1047)ScsiUid: 781: Path 'vmhba3:C0:T0:L0' does not support VPD Device Id page.
0:23:42:45.329 cpu13:1047)ScsiScan: 516: Path 'vmhba3:C0:T0:L0': No standard UID: Failure
0:23:42:45.330 cpu1:1048)SCSI: 861: GetInfo for adapter vmhba3, , max_vports=0, vports_i
nuse=0, linktype=0, state=0, failreason=0, rv=-25, sts=bad0001
0:23:42:45.340 cpu13:1047)SCSI: 861: GetInfo for adapter vmhba3, , max_vports=0, vports_
inuse=0, linktype=0, state=0, failreason=0, rv=-25, sts=bad0001
VMWARE SCSI Id: Supported VPD pages for vmhba3:C0:T0:L0 : 0x1f 0x0
0:23:42:45.340 cpu13:1047)VMWARE SCSI Id: Could not get disk id for vmhba3:C0:T0:L0
0:23:42:50.210 cpu1:1047)SCSI: 861: GetInfo for adapter vmhba32, , max_vports=0, vports_i
nuse=0, linktype=0, state=0, failreason=0, rv=-19, sts=bad0001

0:23:42:54.333 cpu6:1049)ScsiScan: 395: Path 'vmhba3:C0:T0:L0': Vendor: 'Sun ' Model: 'STK MirrorNT ' Rev: 'V1.0'
0:23:42:54.333 cpu6:1049)ScsiUid: 754: Path 'vmhba3:C0:T0:L0' does not support VPD Serial Id page.
0:23:42:54.333 cpu6:1049)ScsiUid: 781: Path 'vmhba3:C0:T0:L0' does not support VPD Device Id page.
0:23:42:54.333 cpu6:1049)ScsiScan: 516: Path 'vmhba3:C0:T0:L0': No standard UID: Failure
0:23:42:54.335 cpu6:1049)SCSI: 861: GetInfo for adapter vmhba3, , max_vports=0, vports_inuse=0, linktype=0, state=0, failreason=0, rv=-25, sts=bad0001
0:23:42:54.345 cpu6:1049)SCSI: 861: GetInfo for adapter vmhba3, , max_vports=0, vports_inuse=0, linktype=0, state=0, failreason=0, rv=-25, sts=bad0001
VMWARE SCSI Id: Supported VPD pages for vmhba3:C0:T0:L0 : 0x1f 0x0
0:23:42:54.345 cpu6:1049)VMWARE SCSI Id: Could not get disk id for vmhba3:C0:T0:L0

I would like to know what exactly the above messages mean.Atleast if someone could point me in the right way. I am also contacting VMware support with this issue as its reproducable. Also attaching the vmkernel dump.

I really aprreciate any help on this,just about anything on this would be a real lifesaver.Thanks in advance to everyone.

Message was edited by: fusebox

lamw · ‎07-17-2008

We saw this recently, it might be your FC HBA .. but we had few fail and found out that a specific LUN that was masked from our filer actually caused the crash, if you unmask the LUN it worked. Luckily no running VMs were on these LUNs, it might not be the case for you. I would say, try to VMotion the VMs off and once that's been handled. Depending on your hardware you can run HP SIM or equilv. from Dell or some other vendor to diag. to make sure it's not hardware. Usualy the case is either your HBA having a bad port, down to fabric switch/port/cable which is highly rare but it could happen. I would go down that route with your hardware vendor as VMware will tell you its a hardware issue with FC HBA or your SAN. Once you've replaced that, try to see if it has issues re-scanning, if it continues, I would look at the SAN and see if they have any errors connecting to the iniatitors.

fusebox · ‎07-17-2008

If its an hardware or initiators issue,then I should be able to reproduce this issue even if I remove this problematic new LUN which I added right? So,incase I go ahead and delete this new LUN I added ,reboot the esx,try a rescan..Do you think I will still face this issue ? We actually started having this issu only after this new lun was added.

fusebox · ‎07-17-2008

Ok.Checked with VMware and they confirmed that this was an issue with Sun Fire 4450 and with an IBM server which both have the same model Adaptec Card(s). The fix for this from vmware side is to update the bios and the adaptec firmware and do some vmware recommended settings in the navishphere.Have to see if this fix actually works.

Dave_Mishchenko · ‎07-17-2008

Which Adapter controller do you have and did they mention if the problem was limited to that model or other Adaptec models as well?

fejf · ‎07-17-2008

The screenshots show that vmhba3 is an adaptec controller. Have you installed the ESX-Server on the adaptec controller or the sata controller? From the error message it looks like the local storage on the adaptec controller (vmhab3:0:0:8) is the problem.

--

There are 10 types of people. Those who understand binary and the rest. And those who understand gray-code.

-- There are 10 types of people. Those who understand binary and the rest. And those who understand gray-code.

fusebox · ‎07-17-2008

I am trying to find out what is the model number of this adaptec card. Is there any way I can check this from SC? Else,I will have to check this in BIOS when I upgrade it. I see the below module loaded for this card.

# esxcfg-module -q|grep raid

aacraid_esx30.o

VMware said that many clients who are running esx 3.5 on Sun Fire 4450 and another IBM server (He didnt tell me the model.Sorry!)which have the same Adaptec Cards have this problem.Apart from the bios and adaptec firmware updates,vmware also advised to have the FLARE also updated to 19-26 and recommended the below settings for the hba(s):

==================================

These settings are for each path to each HBA:

Failover Mode = 1 Initiator Type = "Clariion Open" Array Commpath = "Enabled" or 1

===================================

Hope this info helped.

fusebox · ‎07-17-2008

Here is the exact copy of the log at the time of esx server crash...Hope it helps :

===================================

0:23:48:55.689 cpu1:1150)<4>lpfc0:0754:FPe:SCSI timeout Data: x5a3ce78 xe x82d281 xe11 0:23:48:59.081 cpu0:1049)<3>aacraid: aac_fib_send: first asynchronous command timed out.

Usually a result of a PCI interrupt routing problem;*+ update mother board BIOS or consider utilizing one of +*the SAFE mode kernel options (acpi, apic etc)

0:23:48:59.081 cpu0:1049)<4>aacraid: aac_probe_container query failed.

ESC[7m0:23:48:59.081 cpu0:1049)WARNING: CpuSched: vm 1049: 8269:

excessive time: deltaSec=183.061634ESC[0m

ESC[7m0:23:48:59.081 cpu0:1049)WARNING: CpuSched: vm 1049: 8351:

excessive time: chargeSec=183.032457ESC[0m

0:23:48:59.082 cpu14:1106)VSCSIFs: 439: fd 4111 status Busy

ESC[45mESC[33;1mVMware ESX Server ESC[0m

Exception type 14 in world 1024:console @ 0x909c1f

frame=0x1402d5c ip=0x909c1f cr2=0xffc00004 cr3=0x13401000 cr4=0x6f0

es=0x7014028 ds=0x2824028 fs=0x0 gs=0x0

eax=0x0 ebx=0x660ae38 ecx=0x660af68 edx=0x0

ebp=0x293e700 esi=0x3db3d800 edi=0x0 err=2 eflags=0x10046

*0:1024/console 1:1150/vmm0:nywp 2:1123/vmware-vm 3:1145/mks:webti

4:1028/idle4 5:1122/vmm1:nywp 6:1121/vmm0:nywp 7:1111/Worker#0:

8:1098/vmm0:nywp 9:1134/vmm0:nywp 10:1135/vmm1:nywp 11:1124/mks:nywpa

12:1092/vmm1:nywp 13:1091/vmm0:nywp 14:1106/vmm1:nywp 15:1039/idle15

@BlueScreen: Exception type 14 in world 1024:console @ 0x909c1f

0x293e700:[0x909c1f]aacraid_esx30+0x9c1e stack: 0x0, 0x0, 0x0

VMK uptime: 0:23:48:59.086 TSC: 205780679023770

0:23:46:54.313 cpu0:1049)NMI: 1625: Faulting eip:esp

0:23:48:54.313 cpu1:1150)Heartbeat: 470: PCPU 0 didn't have a heartbeat

for 180 seconds. may be locked up

0:23:48:54.313 cpu0:1049)NMI: 1625: Faulting eip:esp

Starting coredump to disk Starting coredump to disk Dumping using slot 1

of 1... using slot 1 of 1... Log

===================================

fejf · ‎07-17-2008

I am trying to find out what is the model number of this adaptec card.

Try "lspci -v".

--

There are 10 types of people. Those who understand binary and the rest. And those who understand gray-code.

-- There are 10 types of people. Those who understand binary and the rest. And those who understand gray-code.

Rubeck · ‎07-17-2008

Hi fusebox..

If U have an issue which is reproducable I would suggest that you open up a support case....

Maybe that could help others with identical HW setups, as PSODs are pure pain for an ESX admin ;-):D

/Rubeck

fusebox · ‎07-17-2008

Rubeck,

I have opened a support case with the VMware and they suggested firmware updates of the server,the adaptec controller and some hba settings. That case is not yet closed. But,as per them once these updates and settings are in place,the issue shouldnt crop up again.So,will have to wait and see till then.Will be updating the firmware in few days.

And,yes..PSODs are real PIA They are real horror.My boss was blazing his cannons when he heard this...One thing to cheer about is,luckily this issue surfaced before the environment went live.And,we got to know that its an issue with Adaptec Controllers in Sun Fire and some IBM servers (Minus those latest updates ofcourse).Imagine,if only this happened when all the prod esx hosts were running the prod vm(s) in full throttle.

A Stitch in time saved nine

Thanks for all the support.Will keep this thread updated with this issue as you rightly said,it will help others with the same hardware.

fusebox · ‎07-17-2008

I doubt lspci -v or dmesg will show the exact model# of the adaptec controller.But,anyway I will check and reconfirm this on the thread.

fusebox · ‎07-18-2008

Hi Fej,

Here is the relevant part of the lspci output :

09:00.0 RAID bus controller: Adaptec Adaptec SCSI (rev 09) * Subsystem: Sun Microsystems Computer Corp.: Unknown device 0286*

Flags: bus master, fast devsel, latency 0, IRQ 19

Memory at fc600000 (64-bit, non-prefetchable)

Expansion ROM at fc880000

Capabilities: Power Management version 2

Capabilities: #0d

Capabilities: Message Signalled Interrupts: 64bit+ Queue=0/1 Enable-

Capabilities: #10

Capabilities: Vital Product Data

Like I thought, lspci or dmesg dont show the exact model of the Adaptec Raid Controller. I am planning to call SUN on this and enquire what the model number of this Adaptec RAID controller is on the sun fire 4450 .

Will update the thread with the info given by SUN.

fusebox · ‎07-18-2008

The adaptec raid controller details are below: Seems to be a from Sun Storagetek (STK) ..But,why does it say custom-IBM? So,this one's firmware first needs to be updated. I am in the process of downloading the firmware updates for sun fire x4450 from sun.com.

(Adaptec Raid Controller: 1.1-5[d-8930]custom-IBM)

Vendor: Sun Model: STK RAID INT

flags=SAI_READ_CAPACITY_16

kernel: 5.2-0[15583]

monitor: 5.2-0[15583]

*bios: (5.2-0[15583])

Message was edited by: fusebox

lamw · ‎07-18-2008

Sun could have rebranded and a re-seller of the card, it's not rare that this is the case, we also have an adaptec card that has Sun branded.

fusebox · ‎07-23-2008

Updated the firmware and bios of the 4 out of 5 esx hosts and then added a lun,did a lun rescan...the host(s) didnt go down..So,assuming the firmware update fixed the issue...I still have to update and check one last host..Hoping it would also be fine...

So,Sun Fire X4450 firmware and bios needs to updated and rebooted to fix this PSOD issue after a lun rescan.The latest firmware can be downloaded as an ISO from Sun's support site..Its little over 600mb image...

Thanks for all the contributions.

All

Purple Screen Of Death - - ESX goes down whenever I Do A LUN Rescan--Screenshots Attached