VMware Communities > VMTN > VMware Infrastructure™ > VI: ESX 3.5 > Discussions
1 2 Previous Next
15 Replies Last post: Nov 26, 2008 4:51 PM by fusebox
Reply

Purple Screen Of Death - - ESX goes down whenever I Do A LUN Rescan--Screenshots Attached

Jul 17, 2008 11:15 AM

Click to view fusebox's profile Hot Shot fusebox 99 posts since
Mar 28, 2007
Hi Folks!

Below are some specifics of the environment in which I am having the above mentioned issue:This issue is reproducable.So that is what worries Me the most.

ESX Version : 3.5 (Update 1)


Platform : Sun Fire 4450
SAN Storage : EMC Clarrion CX340

Issue: We added a new 700gb shared lun to this host and then do a lun rescan from the VI client. Then after few seconds in the VI client,I see that the host is not responding and has gone down. When I go to the box physically,I see that it is sitting at a purple screen of death with some errors and a message that vmkernel core dump created successfully. I have the dump file too. Please find the relevant info,screenshots for your reference. Hope you all can help Me identify the issue. I went through the dump and see some specific messages which are looking fatal.Below is an excerpt of that:



{color:#800000}{color:#ff0000}0:23:42:45.328 cpu13:1047)SCSI: 861: GetInfo for adapter vmhba3, 0x40031100, max_vports=0, vports_
inuse=0, linktype=0, state=0, failreason=0, rv=-25, sts=bad0001
0:23:42:45.328 cpu13:1047)ScsiScan: 395: Path 'vmhba3:C0:T0:L0': Vendor: 'Sun ' Model: 'STK Mir
rorNT ' Rev: 'V1.0'
0:23:42:45.329 cpu13:1047)ScsiUid: 754: Path 'vmhba3:C0:T0:L0' does not support VPD Serial Id page.
0:23:42:45.329 cpu13:1047)ScsiUid: 781: Path 'vmhba3:C0:T0:L0' does not support VPD Device Id page.
0:23:42:45.329 cpu13:1047)ScsiScan: 516: Path 'vmhba3:C0:T0:L0': No standard UID: Failure
0:23:42:45.330 cpu1:1048)SCSI: 861: GetInfo for adapter vmhba3, 0x40031100, max_vports=0, vports_i
nuse=0, linktype=0, state=0, failreason=0, rv=-25, sts=bad0001
0:23:42:45.340 cpu13:1047)SCSI: 861: GetInfo for adapter vmhba3, 0x40031100, max_vports=0, vports_
inuse=0, linktype=0, state=0, failreason=0, rv=-25, sts=bad0001
VMWARE SCSI Id: Supported VPD pages for vmhba3:C0:T0:L0 : 0x1f 0x0
0:23:42:45.340 cpu13:1047)VMWARE SCSI Id: Could not get disk id for vmhba3:C0:T0:L0
{color:#ff0000}0:23:42:50.210 cpu1:1047)SCSI: 861: GetInfo for adapter vmhba32, 0x58fff80, max_vports=0, vports_i

nuse=0, linktype=0, state=0, failreason=0, rv=-19, sts=bad0001
0:23:42:54.333 cpu6:1049)ScsiScan: 395: Path 'vmhba3:C0:T0:L0': Vendor: 'Sun ' Model: 'STK MirrorNT ' Rev: 'V1.0'
0:23:42:54.333 cpu6:1049)ScsiUid: 754: Path 'vmhba3:C0:T0:L0' does not support VPD Serial Id page.
0:23:42:54.333 cpu6:1049)ScsiUid: 781: Path 'vmhba3:C0:T0:L0' does not support VPD Device Id page.
0:23:42:54.333 cpu6:1049)ScsiScan: 516: Path 'vmhba3:C0:T0:L0': No standard UID: Failure
0:23:42:54.335 cpu6:1049)SCSI: 861: GetInfo for adapter vmhba3, 0x40031100, max_vports=0, vports_inuse=0, linktype=0, state=0, failreason=0, rv=-25, sts=bad0001
0:23:42:54.345 cpu6:1049)SCSI: 861: GetInfo for adapter vmhba3, 0x40031100, max_vports=0, vports_inuse=0, linktype=0, state=0, failreason=0, rv=-25, sts=bad0001
VMWARE SCSI Id: Supported VPD pages for vmhba3:C0:T0:L0 : 0x1f 0x0
{color:#ff0000}0:23:42:54.345 cpu6:1049)VMWARE SCSI Id: Could not get disk id for vmhba3:C0:T0:L0
I would like to know what exactly the above messages mean.Atleast if someone could point me in the right way. I am also contacting VMware support with this issue as its reproducable. Also attaching the vmkernel dump.

I really aprreciate any help on this,just about anything on this would be a real lifesaver.Thanks in advance to everyone.

Message was edited by: fusebox
Attachments:
Reply Re: Purple Screen Of Death - - ESX goes down whenever I Do A LUN Rescan Jul 17, 2008 10:51 AM
Click to view lamw's profile Virtuoso lamw 2,074 posts since
Nov 27, 2007
We saw this recently, it might be your FC HBA .. but we had few fail and found out that a specific LUN that was masked from our filer actually caused the crash, if you unmask the LUN it worked. Luckily no running VMs were on these LUNs, it might not be the case for you. I would say, try to VMotion the VMs off and once that's been handled. Depending on your hardware you can run HP SIM or equilv. from Dell or some other vendor to diag. to make sure it's not hardware. Usualy the case is either your HBA having a bad port, down to fabric switch/port/cable which is highly rare but it could happen. I would go down that route with your hardware vendor as VMware will tell you its a hardware issue with FC HBA or your SAN. Once you've replaced that, try to see if it has issues re-scanning, if it continues, I would look at the SAN and see if they have any errors connecting to the iniatitors.
Reply Re: Purple Screen Of Death - - ESX goes down whenever I Do A LUN Rescan Jul 17, 2008 11:31 AM
in response to: lamw
Click to view fusebox's profile Hot Shot fusebox 99 posts since
Mar 28, 2007
If its an hardware or initiators issue,then I should be able to reproduce this issue even if I remove this problematic new LUN which I added right? So,incase I go ahead and delete this new LUN I added ,reboot the esx,try a rescan..Do you think I will still face this issue ? We actually started having this issu only after this new lun was added.
Reply Re: Purple Screen Of Death - - ESX goes down whenever I Do A LUN Rescan Jul 17, 2008 12:36 PM
in response to: lamw
Click to view fusebox's profile Hot Shot fusebox 99 posts since
Mar 28, 2007
Ok.Checked with VMware and they confirmed that this was an issue with Sun Fire 4450 and with an IBM server which both have the same model Adaptec Card(s). The fix for this from vmware side is to update the bios and the adaptec firmware and do some vmware recommended settings in the navishphere.Have to see if this fix actually works.
Reply Re: Purple Screen Of Death - - ESX goes down whenever I Do A LUN Rescan Jul 17, 2008 12:45 PM
in response to: fusebox
Click to view Dave.Mishchenko's profile Guru Dave.Mishchenko 8,439 posts since
Nov 15, 2005
Moderator
Which Adapter controller do you have and did they mention if the problem was limited to that model or other Adaptec models as well?
Reply Re: Purple Screen Of Death - - ESX goes down whenever I Do A LUN Rescan--Screenshots Attached Jul 17, 2008 12:47 PM
Click to view fejf's profile Hot Shot fejf 208 posts since
May 29, 2007
The screenshots show that vmhba3 is an adaptec controller. Have you installed the ESX-Server on the adaptec controller or the sata controller? From the error message it looks like the local storage on the adaptec controller (vmhab3:0:0:8) is the problem.

--
There are 10 types of people. Those who understand binary and the rest. And those who understand gray-code.
Reply Re: Purple Screen Of Death - - ESX goes down whenever I Do A LUN Rescan Jul 17, 2008 1:07 PM
in response to: Dave.Mishchenko
Click to view fusebox's profile Hot Shot fusebox 99 posts since
Mar 28, 2007

I am trying to find out what is the model number of this adaptec card. Is there any way I can check this from SC? Else,I will have to check this in BIOS when I upgrade it. I see the below module loaded for this card.

root@nyvmpesx01 root# esxcfg-module -q|grep raid
aacraid_esx30.o

VMware said that many clients who are running esx 3.5 on Sun Fire 4450 and another IBM server (He didnt tell me the model.Sorry!)which have the same Adaptec Cards have this problem.Apart from the bios and adaptec firmware updates,vmware also advised to have the FLARE also updated to 19-26 and recommended the below settings for the hba(s):

==================================
These settings are for each path to each HBA:
Failover Mode = 1
Initiator Type = "Clariion Open"
Array Commpath = "Enabled" or 1

===================================

Hope this info helped.


Reply Re: Purple Screen Of Death - - ESX goes down whenever I Do A LUN Rescan--Screenshots Attached Jul 17, 2008 1:11 PM
in response to: fejf
Click to view fusebox's profile Hot Shot fusebox 99 posts since
Mar 28, 2007

Here is the exact copy of the log at the time of esx server crash...Hope it helps :

===================================
0:23:48:55.689 cpu1:1150)<4>lpfc0:0754:FPe:SCSI timeout Data: x5a3ce78
xe x82d281 xe11
0:23:48:59.081 cpu0:1049)<3>aacraid: aac_fib_send: first asynchronous
command timed out.

Usually a result of a PCI interrupt routing problem;
update mother board BIOS or consider utilizing one of
the SAFE mode kernel options (acpi, apic etc)
0:23:48:59.081 cpu0:1049)<4>aacraid: aac_probe_container query failed.
ESC[7m0:23:48:59.081 cpu0:1049)WARNING: CpuSched: vm 1049: 8269:
excessive time: deltaSec=183.061634ESC[0m
ESC[7m0:23:48:59.081 cpu0:1049)WARNING: CpuSched: vm 1049: 8351:
excessive time: chargeSec=183.032457ESC[0m
0:23:48:59.082 cpu14:1106)VSCSIFs: 439: fd 4111 status Busy
ESC[45mESC[33;1mVMware ESX Server Releasebuild-82663ESC[0m
Exception type 14 in world 1024:console @ 0x909c1f
frame=0x1402d5c ip=0x909c1f cr2=0xffc00004 cr3=0x13401000 cr4=0x6f0
es=0x7014028 ds=0x2824028 fs=0x0 gs=0x0
eax=0x0 ebx=0x660ae38 ecx=0x660af68 edx=0x0
ebp=0x293e700 esi=0x3db3d800 edi=0x0 err=2 eflags=0x10046
*0:1024/console 1:1150/vmm0:nywp 2:1123/vmware-vm 3:1145/mks:webti
4:1028/idle4 5:1122/vmm1:nywp 6:1121/vmm0:nywp 7:1111/Worker#0:
8:1098/vmm0:nywp 9:1134/vmm0:nywp 10:1135/vmm1:nywp 11:1124/mks:nywpa
12:1092/vmm1:nywp 13:1091/vmm0:nywp 14:1106/vmm1:nywp 15:1039/idle15
@BlueScreen: Exception type 14 in world 1024:console @ 0x909c1f
0x293e700:0x909c1faacraid_esx30+0x9c1e stack: 0x0, 0x0, 0x0
VMK uptime: 0:23:48:59.086 TSC: 205780679023770
0:23:46:54.313 cpu0:1049)NMI: 1625: Faulting eip:esp
0x6400fc:0x3a677a4
0:23:48:54.313 cpu1:1150)Heartbeat: 470: PCPU 0 didn't have a heartbeat
for 180 seconds. may be locked up
0:23:48:54.313 cpu0:1049)NMI: 1625: Faulting eip:esp
0x6400f5:0x3a677a4
Starting coredump to disk Starting coredump to disk Dumping using slot 1
of 1... using slot 1 of 1... Log
===================================

Reply Re: Purple Screen Of Death - - ESX goes down whenever I Do A LUN Rescan Jul 17, 2008 1:33 PM
in response to: fusebox
Click to view fejf's profile Hot Shot fejf 208 posts since
May 29, 2007
I am trying to find out what is the model number of this adaptec card.

Try "lspci -v".

--
There are 10 types of people. Those who understand binary and the rest. And those who understand gray-code.
Reply Re: Purple Screen Of Death - - ESX goes down whenever I Do A LUN Rescan--Screenshots Attached Jul 17, 2008 2:27 PM
in response to: fusebox
Click to view Rubeck's profile Master Rubeck 542 posts since
Mar 7, 2008
Hi fusebox..

If U have an issue which is reproducable I would suggest that you open up a support case....

Maybe that could help others with identical HW setups, as PSODs are pure pain for an ESX admin ;-):D


/Rubeck

Reply Re: Purple Screen Of Death - - ESX goes down whenever I Do A LUN Rescan--Screenshots Attached Jul 17, 2008 4:29 PM
in response to: Rubeck
Click to view fusebox's profile Hot Shot fusebox 99 posts since
Mar 28, 2007

Rubeck,

I have opened a support case with the VMware and they suggested firmware updates of the server,the adaptec controller and some hba settings. That case is not yet closed. But,as per them once these updates and settings are in place,the issue shouldnt crop up again.So,will have to wait and see till then.Will be updating the firmware in few days.

And,yes..PSODs are real PIA ;-) They are real horror.My boss was blazing his cannons when he heard this...One thing to cheer about is,luckily this issue surfaced before the environment went live.And,we got to know that its an issue with Adaptec Controllers in Sun Fire and some IBM servers (Minus those latest updates ofcourse).Imagine,if only this happened when all the prod esx hosts were running the prod vm(s) in full throttle.

A Stitch in time saved nine :-)

Thanks for all the support.Will keep this thread updated with this issue as you rightly said,it will help others with the same hardware.


Reply Re: Purple Screen Of Death - - ESX goes down whenever I Do A LUN Rescan Jul 17, 2008 4:34 PM
in response to: fejf
Click to view fusebox's profile Hot Shot fusebox 99 posts since
Mar 28, 2007
I doubt lspci -v or dmesg will show the exact model# of the adaptec controller.But,anyway I will check and reconfirm this on the thread.
Reply Re: Purple Screen Of Death - - ESX goes down whenever I Do A LUN Rescan Jul 18, 2008 8:10 AM
in response to: fejf
Click to view fusebox's profile Hot Shot fusebox 99 posts since
Mar 28, 2007

Hi Fej,

Here is the relevant part of the lspci output :

09:00.0 RAID bus controller: Adaptec Adaptec SCSI (rev 09)

  • Subsystem: Sun Microsystems Computer Corp.: Unknown device 0286*
Flags: bus master, fast devsel, latency 0, IRQ 19
Memory at fc600000 (64-bit, non-prefetchable) size=2M
Expansion ROM at fc880000 disabled size=512K
Capabilities: 98 Power Management version 2
Capabilities: b0 #0d 0007
Capabilities: a0 Message Signalled Interrupts: 64bit+ Queue=0/1 Enable-
Capabilities: d0 #10 0001
Capabilities: 90 Vital Product Data

Like I thought, lspci or dmesg dont show the exact model of the Adaptec Raid Controller. I am planning to call SUN on this and enquire what the model number of this Adaptec RAID controller is on the sun fire 4450 .

Will update the thread with the info given by SUN.

Reply Re: Purple Screen Of Death - - ESX goes down whenever I Do A LUN Rescan Nov 26, 2008 4:51 PM
in response to: fusebox
Click to view fusebox's profile Hot Shot fusebox 99 posts since
Mar 28, 2007
The adaptec raid controller details are below: Seems to be a from Sun Storagetek (STK) ..But,why does it say custom-IBM? So,this one's firmware first needs to be updated. I am in the process of downloading the firmware updates for sun fire x4450 from sun.com.

(Adaptec Raid Controller: 1.1-52415custom-IBM)
Vendor: Sun Model: STK RAID INT
flags=SAI_READ_CAPACITY_16
kernel: 5.2-015583
monitor: 5.2-015583
*bios: (5.2-015583)

Message was edited by: fusebox

Reply Re: Purple Screen Of Death - - ESX goes down whenever I Do A LUN Rescan Jul 18, 2008 11:29 AM
in response to: fusebox
Click to view lamw's profile Virtuoso lamw 2,074 posts since
Nov 27, 2007
Sun could have rebranded and a re-seller of the card, it's not rare that this is the case, we also have an adaptec card that has Sun branded.
1 2 Previous Next
Actions