VMware Cloud Community
SebastianGrugel
Hot Shot
Hot Shot

Virtual SAN device is under permanent error

Hi

We have :

  • vCenter 5.5 U2 x 1
  • Two clusters Compute and Management
  • ESXi 5.5 U2 - 4 hosts in cluster
  • and VSAN

Today in MGT cluster we have bellow issue:

- Virtual SAN device is under permanent error

- Virtual SAN device has gone offline

Screenshot_7.png

We dont see information about used storage:

Screenshot_9.png

In MANAGE > Virtual SAN > Disk Management

Disk group looks healthy

Screenshot_1.png

but drives inside Disk Groups in this houst dont have information about healthy status:

Screenshot_2.png

How can i troubleshoot more this case or try how try fix ?

vExpert VSAN/NSX/CLOUD | VCAP5-DCA | VCP6-DCV/CMA/NV ==> akademiadatacenter.pl
Tags (1)
0 Kudos
10 Replies
zdickinson
Expert
Expert

It sounds like the SSD in a disk group has failed and need to be replaced.  This happened to us and we deleted the disk group, replaced the SSD, re-created the disk group, and let everything re-balance.  There might have been some trickiness around deleting the disk group, I cannot remember.  Thank you, Zach.

0 Kudos
SebastianGrugel
Hot Shot
Hot Shot

Thanks Zach for fast reaction.

We have opened SR in Vmware: What we know for now after VMware engineer investigation:

Report after troubleshooting first MGT cluster:

"It appears SSD naa.5001e8200282f398 on host XXXXXXXXXXXX  experienced a hardware issue:

### vmkernel.log ###

2016-04-27T07:47:14.165Z cpu2:32803)NMP: nmp_ThrottleLogForDevice:2349: Cmd 0x1a (0x412e8089cf00, 0) to dev "naa.5001e8200282f398" on path "vmhba0:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0xcd 0x0. Act:NONE 2016-04-27T07:47:14.165Z cpu2:32803)ScsiDeviceIO: 2363: Cmd(0x412e8089cf00) 0x1a, CmdSN 0x107 from world 0 to dev "naa.5001e8200282f398" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0xcd 0x0.

Sense key 0x4 translates to "Hardware Error".

ASC/ASCQ 0xcd 0x0 is not listed on www.t10.org (http://www.t10.org/lists/asc-alph.htm), so I can't say what further information the RAID controller actually supplied here. I assume this value is vendor specific in this case.

Besides that I only see "I/O error" and " "Disk naa... not found in healthy state" messages in the logs for the disks.

As the SCSI error with sense data 0x4 0xcd 0x0 was only reported for one of the two SSDs, I'm not sure why the 2nd SSD didn't get mounted either.

It might still be related though. Looking at the used HBAs, there is only 1 RAID controller used, correct? So if there is a hardware issue with the controller itself actually, and not just SSD naa.5001e8200282f398, this might have a knock-on affect on the other disks as well.

Hence, my recommendation is to open a ticket with the hardware vendor, Dell, to investigate the hardware error further."

What is interesting day after this we had that same warning in second CMP (compute) cluster.

Screenshot_9.png

Screenshot_5_p.png

Screenshot_4.png

We found in logs many entries:

========= vmkernel.log ==============

2016-04-28T09:26:02.364Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412e803f0180) 0x2a, CmdSN 0x19998ae4 from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.

2016-04-28T09:26:02.364Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412f07305fc0) 0x2a, CmdSN 0x19998aec from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

2016-04-28T09:26:02.364Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412e8040c940) 0x2a, CmdSN 0x19998aed from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.

2016-04-28T09:26:02.364Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412f22113180) 0x2a, CmdSN 0x19998b10 from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.

2016-04-28T09:26:02.364Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412e8040cbc0) 0x2a, CmdSN 0x19998afc from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.

2016-04-28T09:26:02.364Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412f3ce16bc0) 0x2a, CmdSN 0x19998b07 from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x1.

2016-04-28T09:26:02.364Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412e80426ac0) 0x2a, CmdSN 0x19998afe from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.

2016-04-28T09:26:02.364Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412e803c51c0) 0x2a, CmdSN 0x19998b0d from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.

2016-04-28T09:26:02.365Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412e803dbb00) 0x2a, CmdSN 0x19998b11 from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.

2016-04-28T09:26:02.365Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412e8044ec00) 0x2a, CmdSN 0x19998af9 from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.

2016-04-28T09:26:02.365Z cpu26:20552761)NMP: nmp_ThrottleLogForDevice:2349: Cmd 0x2a (0x412f092e3a00, 0) to dev "naa.5001e8200282656c" on path "vmhba0:C0:T0:L0" Failed: H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0. Act:EVAL

2016-04-28T09:26:02.365Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412f092e3a00) 0x2a, CmdSN 0x19998aeb from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.

2016-04-28T09:26:02.365Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412f0b331d00) 0x2a, CmdSN 0x19998b06 from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.

2016-04-28T09:26:02.365Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412f0a43c440) 0x2a, CmdSN 0x19998ae3 from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.

2016-04-28T09:26:02.365Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412f0afaa5c0) 0x2a, CmdSN 0x19998af8 from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.

2016-04-28T09:26:02.365Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412f3ce14dc0) 0x2a, CmdSN 0x19998b13 from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.

2016-04-28T09:26:02.365Z cpu26:20552761)ScsiDeviceIO: 2363: Cmd(0x412f08f26600) 0x2a, CmdSN 0x19998af1 from world 0 to dev "naa.5001e8200282656c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.

We tried manually mount drives without success:

esxcli vsan storage diskgroup mount -s naa.5001e8200282656c


In this issue helps server reboot....


Screenshot_8_p.png


Unfortunately still i dont know why "Health status" is not showing up in one Disk Group. Maybe somebody know ?

Screenshot_7_p.png

After this we receive short description from VMware engineer:

"H:0x5 (Aborts) on the affected host XXXXXXXXXX:

2016-04-28T09:27:31.280Z cpu42:27106261)ScsiDeviceIO: 2363: Cmd(0x412f4b776ac0) 0x28, CmdSN 0x7bcc36e9 from world 0 to dev "naa.5001e82002826808" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.

2016-04-28T09:27:31.280Z cpu48:27106244)ScsiDeviceIO: 2363: Cmd(0x412f4b773280) 0x28, CmdSN 0x7bcc36e2 from world 0 to dev "naa.5001e82002826808" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.

2016-04-28T09:27:31.280Z cpu42:27106261)ScsiDeviceIO: 2363: Cmd(0x412f4b771fc0) 0x28, CmdSN 0x7bcc36e7 from world 0 to dev "naa.5001e82002826808" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.

2016-04-28T09:27:31.280Z cpu29:27106294)ScsiDeviceIO: 2363: Cmd(0x412f4b775a80) 0x28, CmdSN 0x7bcc36e6 from world 0 to dev "naa.5001e82002826808" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.

Prior to that we could see megasas aborts in the vmkernel logs.

This particular issue is described in the following KB article: http://kb.vmware.com/kb/2109665

One of the main steps to resolve this on a long term basis, is to increase the values for /LSOM/diskIoTimeout and /LSOM/diskIoRetryFactor (exact steps are also described in the mentioned KB article):

esxcfg-advcfg -s 100000 /LSOM/diskIoTimeout

esxcfg-advcfg -s 4 /LSOM/diskIoRetryFactor"

-------------------------------------------------------------------------------------------------

Case is still open because now we will create additional case in DELL to investigate hardware in first cluster which still have issue(reboot dont helps).

And in second cluster we still dont see Health state of disks in one Disk group.

I will inform about additional investigation.




vExpert VSAN/NSX/CLOUD | VCAP5-DCA | VCP6-DCV/CMA/NV ==> akademiadatacenter.pl
0 Kudos
elerium
Hot Shot
Hot Shot

Sandisk/Seagate combo, are you by chance using Dell H730 or FD332-PERC raid controllers? If so, you've run into the LSI 3108 firmware/driver issues that have been plaguing this discussion VSAN Node Crashed - R730xd - PF Exception 14 in world 33571: Cmpl-vmhba0- IP 0x41802c3abd44 addr 0x5...


The issue is worked around by adding the timeouts settings that VMware support provided to you. There were also new firmwares/drivers released yesterday that are supposed to be a permanent fix to the problem, but you may not want to apply those yet as VSAN HCL only has VSAN 6.0+ versions tested with them so far.

0 Kudos
elerium
Hot Shot
Hot Shot

Looks like new H730 firmware/drivers certified for all 6.* versions now:

VMware Compatibility Guide - vsanio

0 Kudos
SebastianGrugel
Hot Shot
Hot Shot

We use controller PERC H730 Mini (Embedded) with disks SANDISK(ssd) and Seagate(hdd).

I will read your post today. We have similiar issue in other our location... Restart host is not solution...

We will thinking about serious solution. This timeouts is some kind of workaround...

vExpert VSAN/NSX/CLOUD | VCAP5-DCA | VCP6-DCV/CMA/NV ==> akademiadatacenter.pl
0 Kudos
SebastianGrugel
Hot Shot
Hot Shot

Update after next step:

Here's a summary of what we've done during troubleshooting with Vmware engineer:

1) Executed mount command for diskgroup with fronting SSD naa.5001e82002826808 on host XXXcmp001:

esxcli vsan storage diskgroup mount -s naa.5001e82002826808

Afterwards the Health Status was correctly displayed in the Web Client again.

2) Executed mount command for both diskgroups on host XXXmgt002:

esxcli vsan storage diskgroup mount -s naa.5001e8200282efa0

esxcli vsan storage diskgroup mount -s naa.5001e8200282f398

Again, afterwards the Health Status was correctly displayed in the Web Client again.

AFTER - disk inside Disk Group back to healthy

Screenshot_2.png

Information about "Capacity" back:

Screenshot_3.png

We will try reboot again for those host for check if after reboot those disk groups will be Healthy again.

After those mounting back capacity to our datastores:

Before manually mount:

Dbefore.png

After manually mount:

DAfterMount.png

For now issue is resolved but we will check what can we do to avoid similar situation in future.

vExpert VSAN/NSX/CLOUD | VCAP5-DCA | VCP6-DCV/CMA/NV ==> akademiadatacenter.pl
0 Kudos
SebastianGrugel
Hot Shot
Hot Shot

Hi All

For your information

And last update after VMware troubleshooting:

"I've looked through the syslog files from host XXXmgt002 that you uploaded last week and noticed that it contains the same abort messages that we've seen on host XXXcmp001 before the diskgroup failed:

grep ABORT messages-2016-04-2*

messages-2016-04-26:Apr 26 23:46:06 192.168.110.12 vmkernel: cpu12:33105)megasas: ABORT sn 12997396290 cmd=0x28 retries=0 tmo=0 messages-2016-04-26:Apr 26 23:46:09 192.168.110.12 vmkernel: cpu15:27933954)megasas: ABORT sn 12997396548 cmd=0x28 retries=0 tmo=0 messages-2016-04-26:Apr 26 23:46:11 192.168.110.12 vmkernel: cpu9:27933955)megasas: ABORT sn 12997396506 cmd=0x28 retries=0 tmo=0 messages-2016-04-26:Apr 26 23:46:13 192.168.110.12 vmkernel: cpu27:27933984)megasas: ABORT sn 12997396556 cmd=0x28 retries=0 tmo=0 messages-2016-04-26:Apr 26 23:46:15 192.168.110.12 vmkernel: cpu6:27933985)megasas: ABORT sn 12997396475 cmd=0x28 retries=0 tmo=0 messages-2016-04-26:Apr 26 23:46:17 192.168.110.12 vmkernel: cpu25:27933986)megasas: ABORT sn 12997396507 cmd=0x28 retries=0 tmo=0 messages-2016-04-26:Apr 26 23:46:19 192.168.110.12 vmkernel: cpu19:27933987)megasas: ABORT sn 12997396521 cmd=0x28 retries=0 tmo=0 messages-2016-04-26:Apr 26 23:46:35 192.168.110.12 vmkernel: cpu4:27934057)megasas: ABORT sn 12997396513 cmd=0x28 retries=0 tmo=0

That was on the night from 26/04/2016 to 27/04/2016, so just before the morning when you noticed the issue.

I've checked the mentioned configuration settings from KB article http://kb.vmware.com/kb/2144936 and they haven't been applied on that host either yet:

/config/LSOM/intOpts/> get diskIoTimeout Vmkernel Config Option {

   Default value:20000

   Min value:100

   Max value:120000

   Current value:20000

   hidden config option:1

   Description:Disk IO timeout in msec

}

/config/LSOM/intOpts/> get diskIoRetryFactor Vmkernel Config Option {

   Default value:3

   Min value:1

   Max value:100

   Current value:3

   hidden config option:1

   Description:Disk IO retry factor

}

So that KB article has to be applied on that host too (and any other on which those settings are still on their default values and use that RAID controller).

Nevertheless, I think it's still a good idea to have Dell check for hardware errors on the RAID controller (due to sense key 0x4 reported on the morning of 27/04/2016), just to be on the safe side."

vExpert VSAN/NSX/CLOUD | VCAP5-DCA | VCP6-DCV/CMA/NV ==> akademiadatacenter.pl
0 Kudos
elerium
Hot Shot
Hot Shot

Sounds like you were able to remount the disk groups without a restart, if so then it's not the raid controller/firmware issue I mentioned (these cases result in a crash where only a host restart corrects the problem, remount wouldn't work).

You may indeed have a problematic or failing disk in your disk group, you can read more on why VSAN would unmount a diskgroup here and possible options:

VMware KB:    VMware Virtual SAN 6.1 or 5.5 Update 3 Disk Groups show as Unmounted in the vSphere We...

VSAN 6.1 New Feature - Handling of Problematic Disks - CormacHogan.com

VSAN 6.2 Part 10 - Problematic Disk Handling - CormacHogan.com

0 Kudos
elerium
Hot Shot
Hot Shot

Actually, I just noticed you're on 5.5 U2 so these links may not apply as they are for 5.5 U3 changes to unmount handling.

0 Kudos
SebastianGrugel
Hot Shot
Hot Shot

I doubled check this and on begining was my mistake:

According to information on site: https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=10145...

vCenter build: 21821111   -  vCenter: vCenter Server 5.5 Update 2b 

ESXi build 3116895 - ESXi 5.5 Update 3a  (Express Patch 😎

vExpert VSAN/NSX/CLOUD | VCAP5-DCA | VCP6-DCV/CMA/NV ==> akademiadatacenter.pl
0 Kudos