VMware Cloud Community
markotsg80
Enthusiast
Enthusiast

VSAN 6.2 and disk failures

Hello all

We recently deployed Vsan in our environment.

5X DL380 G9 per cluster, each host with 3 controllers , each with 7+1 SSD disks. ( each capacity disk 1.2TB and 800GB SSD)

10GB and 1GB switches.

We had number of disk failures and difficulties identifying the failure and affected disk host.

When the disk fails, we would get the Log Insight alert email, showing affected host and naa Id of the failed disk.

From this point we would have limited time to get on the VC and set the affected host to maint. mode and evacuate the host.

Very shortly host would go into non responding mode and then disconnect from VC and PSOD.

VMs would be still running and accessible via RDP.

VC - monitor - VSAN would show disks as healthy.

we would also see this scenario:

1)HP SIM shows the predictive SSD failure alert

2) ILO for same host shows this same SSD as healthy

3) VC shows same SSD as healthy

4. ESXCLI VSAN storage list shows same SSD healthy, 20,000 days to failure

5) Ruby console, disk info shows same SSD as healthy.

We have deployed the firmware 5.52 for all hosts.

Has anyone came accross this before?

8 Replies
TheBobkin
Champion
Champion

Hello markotsg80​.

What build version of ESXi is installed?

"each host with 3 controllers"

As in one controller per disk-group and 3 disk-groups per host?

What Make and Model of controllers with what driver and firmware?

In RAID0 or Passthrough mode?

"each with 7+1 SSD disks. ( each capacity disk 1.2TB and 800GB SSD)"

What Make and Model of cache and capacity drives and firmware in use?

"we would get the Log Insight alert emai"

Please give some examples of the log messages.

"We had number of disk failures"

How many times? Always the same drive? Always on the same host? Capacity or cache-tier?

"set the affected host to maint. mode and evacuate the host"

By evacuate do you mean vMotion VMs off or evacuate the data off the disk-groups?

"host would go into non responding mode and then disconnect from VC and PSOD."

This is not normal behaviour for a disk-group failure, please attach vmkernel.log, vobd.log and vmkwarning.log from when this occurred.

Do you have a screenshot of the backtrace of the PSOD?

Bob

markotsg80
Enthusiast
Enthusiast

We are still on ESXI 6.0

Smart Array P440ar 5.52

1 on board controller, and 2 PCI Smart Array P440ar controllers.

each host has 3 disk groups 1 controller 8 disks

Passthrough mode

Log insight email alert

1) Critical Storage vsan device offline

2nd) Log insigihht found the following events mathing the ccriteia vsan object componenct state changed to degraded

3rd email alert) VSAN magenetic disk failure

partition table read from device naa.... failed

also if i run esxcli ssacli cmd-q controller slot 0 show config details, i would get  a list of disks for each controller, failed disk would be missing from the list and that way i would know from which controller box location disk is,

same would be noticable from HP SIM but not from ILO, ILO would show storage disks as healthy.

Reply
0 Kudos
markotsg80
Enthusiast
Enthusiast

Capacity disks

HP EG1200JEMDA 1.2TB 10000RPM SAS 12GBPS SFF

i believe this is SSD

HP 400 GB 2.5" Internal Solid State Drive - SAS | HP® Official Store

Manufacturer Part Number

  779168-B21 

Reply
0 Kudos
markotsg80
Enthusiast
Enthusiast

just to add, 5 DL380 G9 s, each host woud be on seperate rack , so FD1, FD2, FD3, FD4, FD5

Reply
0 Kudos
TheBobkin
Champion
Champion

Hello markotsg80​,

Okay so a few things to start:

What is the build number of ESXi installed - not the version (you said that in the title).

You can see this from the Summary tab of the host in the Web Client, via CLI using vmware -vl or on the splash-screen of DCUI.

Are you using an onboard controller for one of the disk-groups?

If so then do NOT do this unless it is somehow a vSAN-supported (which I don't think exist on HPE gear) - you can use one P440ar controller for more than one disk-group.

What driver are you using on the P440ar controllers? This matters a lot as some drivers will not play well with some firmware and issues will be frequent.

Firmware version 5.52 is NOT supported for any 6.0 version of vSAN

The latest you should be on is driver: hpsa version 6.0.0.124 and Firmware: 5.04

https://www.vmware.com/resources/compatibility/detail.php?deviceCategory=vsanio&productid=37447

Bob

Reply
0 Kudos
markotsg80
Enthusiast
Enthusiast

Many thanks Bob

I will find out the build version

we use onboard and 2x PCI storage controllers.  8 disks per each controller and  3 disk groups per controlller

I will find the driver version as well

Many thanks

Reply
0 Kudos
TheBobkin
Champion
Champion

Hello markotsg80​,

I can't find out if DL380 G9's had an option of vSAN supported 'on-board' controller, you can check what you have on these hosts via the Web Client:

Host > Configure/Manage > Storage Adapters

or via CLI:

# esxcfg-scsidevs -a

# vmkchdev -l |grep vmhba0

# vmkchdev -l |grep vmhba1

# vmkchdev -l |grep vmhba2

To check driver in use:

# vmkload_mod -s hpsa |grep Version

# vmkload_mod -s nhpsa |grep Version

VMware Knowledge Base

"8 disks per each controller and  3 disk groups per controlller"

3 Disk-groups total on one controller? Or 1 Disk-group on each of the 3 controllers?

Bob

Reply
0 Kudos
markotsg80
Enthusiast
Enthusiast

VC Version 6.0.0 Build 5112533

P440 firmware 5.52

P440AR 5.52

ESXI 6.0 Build 5572656 -ESXi 6.0 Update 3a (ESXi 6.0 Patch 5)

vsan 6.0.0-3.69.5568629

scsi-hpsa  6.0.0.124-1OEM.600.0.0.2494585

Reply
0 Kudos