VSAN instability

johandijkstra · ‎03-22-2017

We are currently working in an environment with 8 Dell R730 Hosts with the PERC H730 Mini Controller.
There are 8 hosts with each 10 SAS Disks (mostly Toshiba 300GB SAS AL13SXB30EN) and Toshiba SAS SSD (PX02SSF020).

Some hosts have already some replaced disks for Seagate 300GB SAS (ST300MP0005)

And there is a host with only Seagate SAS disks and Samsung SSD

Running vSphere ESXi 6.0.0, 4600944 on all hosts.

The PERC H730 controller has the latest firmware and driver version.

The problem is that 1 / 2 (Toshiba) disks per month are now failing, also SSD disks are failing.

First question is, why are these disks failing so quick and so much, I cannot find any info on these disks that they are bad.

The disks are currently 1,5 years old.

Other Question, for vSAN can mixing vendor disks cause vSAN Issues?

Also, when a failure starts, vSAN is reporting issues, but Dell ipmi says, well this disk is OK! Until the server is rebooted, then IPMI say's this disk is failing.

So,

vSAN say's the disk is failing, but IPMi and the controller are saying all is OK.

The Toshiba SAS disks are End Of Life...

Confidence in vSAN at this moment is for the customer at a low point.
We are trying to fix these issues, but at the moment every day there is a major break down / issue in the VSAN cluster.

Any help/suggestions is appreciated

TheBobkin · ‎03-22-2017

Hello Johan,

How long these disks have been in use is not really a good metric for how many read/writes they have endured, this depends on workloads over this period, though 1.5 years does seem a bit soon. What is the warranty period on these devices if any?(IMO a good indication of how long they will likely last as the manufacturer would otherwise be gambling)

What kind of applications are these backing and do you have some rough metric of the IOPS these would be on average exposed to?

And also other possible factors such as regular huge resyncs, Object Policy changes for large quantities of data, frequent snapshot-based back-ups etc.

Did you get near zero physical disk failures until they started failing 1-2 a month?

If so this would make sense as if they were all put in disk-groups near about the same time then they likely have gone through fairly equal read/writes and may be reaching a phase of increased failure likelihood.

Regarding the vSAN/ipmi reported failures:

What state are the disks failing in vSAN?

Are they unmounting as per dying-disk-handling method or just dropping out of CMMDS?:

http://cormachogan.com/2016/03/08/vsan-6-2-part-10-problematic-disk-handling/

After rebooting, ipmi always picks up on this disk being dead or sometimes just vSAN?

Have you ever had a disk not reporting failed in vSAN nor ipmi, rebooted and ipmi shown as failed?

Those SSDs failing is a tad concerning as they have relatively good advertised endurance stats (Endurance Class D >=7300 TBW).

Do these recover on reboot and/or only fail when an MD in that disk-group has failed?

(Sorry for the plethora of questions!)

Bob

-o- If you found this comment useful or answer please select as 'Answer' and/or click the 'Helpful' button ,please ask follow-up questions if you have any -o-

ewoudhofman · ‎03-23-2017

Chiming in here as I'm looking into the same problem;

- The vSAN is hosting horizon view 6 RDS hosts, IOPS is fairly low compared to a similarly sized VDI environment, so the disks are not stressed much.

- No unmounting as per dying-disk-handling is present in logs.

- We were not present with the other disks failing, but the disk that dropped out of the cluster yesterday showed as healthy in vSAN/ESXi/IPMI, but we pulled the server out of the vSAN because it shortly went into error state destabilizing the cluster with unusual high read and write latencies. After isolation we started a check of the server and only after a reboot did the disk come up as faulty.

- These disks are currently still under warranty, but the remaining warranty time has to be checked.

VMware clearly states in the documentation that it is against best practice to run with mixed hardware (understandable as you want uniformity for predictability) but does anybody have any real life experience with this?

johandijkstra · ‎03-23-2017

Well, we are encountering other issues now....

At this moment we see a drive failing (unmounted in vSAN), but IPMI says, hey this disk is OK.

So, we do not know which disk is failing (in a production cluster), so we don't want to get the cluster offline, on the other hand. vSAN is not using this part of the host anymore.

But, we like to remove the disk in order to check it's health, but we cannot check which disk is failing.

In short, we cannot match Physical disk in host < -.> vsphere host.

FDISK not working in vsphere 6....

esxcfg-info | less -I does not give the serial number of the drives...

esxcfg-scsidevs -l | egrep -i 'display name|vendor' also does not give the information we need

In vsphere there is a blink option for the disks, but that is also not working

TheBobkin · ‎03-23-2017

Hello Johan,

Use the identification method outlined here:

http://vsebastian.net/en/vmware-en/check-physical-disk-bay-using-naa-identifier-in-esxi/

Additional device information commands:

https://kb.vmware.com/kb1014953

Bob

-o- If you found this comment useful or answer please select as 'Answer' and/or click the 'Helpful' button ,please ask follow-up questions if you have any -o-

johandijkstra · ‎03-23-2017

Thanks!

johandijkstra · ‎03-24-2017

Some other thing we see in all the hosts vmkernel.log

NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x12 (0x43a6ac95e800, 0) to dev "naa.5000039678090dfd" on path "vmhba0:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. Act:NONE

2017-03-23T12:30:49.477Z cpu26:33419)WARNING: Unable to deliver event to user-space - can't alloc event metadata

2017-03-23T12:30:49.477Z cpu26:33419)lsi_mr3: megasas_hotplug_work:256: event code: 0x71.

This is happening for all the disks:

017-03-23T12:35:47.461Z cpu3:32902)NetPort: 1780: disabled port 0x3000004

2017-03-23T12:35:47.461Z cpu3:32902)<6>i40e 0000:84:00.1: Netqueue features supported: QueuePair RSS_DYN Latency Dynamic Pre-Emptible

2017-03-23T12:35:47.461Z cpu3:32902)<6>i40e 0000:84:00.1: Supporting next generation VLANMACADDR filter

2017-03-23T12:35:47.461Z cpu3:32902)Uplink: 7317: enabled port 0x3000004 with mac 3c:fd:fe:9c:57:75

2017-03-23T12:35:47.977Z cpu3:33536)WARNING: DVFilter: 1192: Couldn't enable keepalive: Not supported

2017-03-23T12:35:49.463Z cpu26:33419)WARNING: Unable to deliver event to user-space - can't alloc event metadata

2017-03-23T12:35:49.463Z cpu26:33419)lsi_mr3: megasas_hotplug_work:256: event code: 0x71.

2017-03-23T12:35:49.469Z cpu41:33585)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x12 (0x43a6b9908b80, 0) to dev "naa.500003967802c19d" on path "vmhba0:C0:T8:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. Act:NONE

2017-03-23T12:35:49.471Z cpu26:33419)WARNING: Unable to deliver event to user-space - can't alloc event metadata

2017-03-23T12:35:49.471Z cpu26:33419)lsi_mr3: megasas_hotplug_work:256: event code: 0x71.

2017-03-23T12:35:49.477Z cpu41:33585)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x12 (0x43a6b9908b80, 0) to dev "naa.5000039678090dfd" on path "vmhba0:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. Act:NONE

2017-03-23T12:35:49.480Z cpu28:33419)WARNING: Unable to deliver event to user-space - can't alloc event metadata

2017-03-23T12:35:49.480Z cpu28:33419)lsi_mr3: megasas_hotplug_work:256: event code: 0x71.

2017-03-23T12:35:49.486Z cpu41:33585)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x12 (0x43a6b9908b80, 0) to dev "naa.500003967808091d" on path "vmhba0:C0:T2:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. Act:NONE

2017-03-23T12:35:49.488Z cpu28:33419)WARNING: Unable to deliver event to user-space - can't alloc event metadata

2017-03-23T12:35:49.488Z cpu28:33419)lsi_mr3: megasas_hotplug_work:256: event code: 0x71.

2017-03-23T12:35:49.494Z cpu41:33585)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x12 (0x43a6b9908b80, 0) to dev "naa.500003967808ce31" on path "vmhba0:C0:T10:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. Act:NONE

2017-03-23T12:35:49.497Z cpu38:33419)WARNING: Unable to deliver event to user-space - can't alloc event metadata

2017-03-23T12:35:49.497Z cpu38:33419)lsi_mr3: megasas_hotplug_work:256: event code: 0x71.

2017-03-23T12:35:49.503Z cpu41:33585)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x12 (0x43a6b9908b80, 0) to dev "naa.500003967808cd55" on path "vmhba0:C0:T12:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. Act:NONE

2017-03-23T12:35:49.505Z cpu38:33419)WARNING: Unable to deliver event to user-space - can't alloc event metadata

2017-03-23T12:35:49.505Z cpu38:33419)lsi_mr3: megasas_hotplug_work:256: event code: 0x71.

2017-03-23T12:35:49.511Z cpu41:33585)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x12 (0x43a6b9908b80, 0) to dev "naa.5000039678090e2d" on path "vmhba0:C0:T3:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. Act:NONE

2017-03-23T12:35:49.517Z cpu38:33419)WARNING: Unable to deliver event to user-space - can't alloc event metadata

2017-03-23T12:35:49.517Z cpu38:33419)lsi_mr3: megasas_hotplug_work:256: event code: 0x71.

2017-03-23T12:35:49.523Z cpu41:33585)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x12 (0x43a6b9908b80, 0) to dev "naa.500003967808cddd" on path "vmhba0:C0:T4:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. Act:NONE

2017-03-23T12:35:49.530Z cpu38:33419)WARNING: Unable to deliver event to user-space - can't alloc event metadata

2017-03-23T12:35:49.530Z cpu38:33419)lsi_mr3: megasas_hotplug_work:256: event code: 0x71.

2017-03-23T12:35:49.536Z cpu41:33585)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x12 (0x43a6b9908b80, 0) to dev "naa.5000039678090de9" on path "vmhba0:C0:T9:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. Act:NONE

2017-03-23T12:35:53.013Z cpu3:33536)WARNING: DVFilter: 1192: Couldn't enable keepalive: Not supported

2017-03-23T12:35:58.050Z cpu3:33536)WARNING: DVFilter: 1192: Couldn't enable keepalive: Not supported

2017-03-23T12:36:01.817Z cpu12:32843)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x9e (0x439e9072b340, 0) to dev "mpx.vmhba32:C0:T0:L0" on path "vmhba32:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

2017-03-23T12:36:03.086Z cpu3:33536)WARNING: DVFilter: 1192: Couldn't enable keepalive: Not supported

When looking at the : mhba0:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. Act:NONE

H:0x0 D:0x2 P:0x0

H : Host - No error

D : Device : 02h = Check Condition

P : PlugIn - No Error

0x5 0x24 0x0

0x5 : ILLEGAL REQUEST

According to kb2144936, Configure the vSAN IO timeout settings.

esxcfg-advcfg -s 100000 /LSOM/diskIoTimeout

esxcfg-advcfg -s 4 /LSOM/diskIoRetryFactor

But,

The values below are the default values for vSAN 6.2 in ESXI 6.0 patch ESXi600-201608001 and later. If you are running this patch or later, there is no need to change the vSAN IO timeout settings using esxcfg-advcfg. For more information about this patch, see VMware ESXi 6.0, Patch Release ESXi600-201608001 (2145663)

We are running this patch...

But it looks similar to what this KB says...., any thoughts?

2144936

TheBobkin · ‎03-24-2017

Hello Johan,

No, if you were hitting the issue that those settings resolve (along with fixes in later drivers/firmware) you would be seeing resets (H:0x7) and aborts(H:0x8) reported by the driver. Older H730P driver/firmware builds were notoriously flappy but if you are using anything on vSAN HCL and released in the last 4-5 months you should be fine (update to latest if not though, and maybe worth double-checking driver/firmware pair in use now are suited for each other).

Useful SCSI Sense Code Translator:

http://www.virten.net/vmware/esxi-scsi-sense-code-decoder/

"Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0"

These Sense Codes can be ignored.

"If a SCSI device does not support a particular VPD page, it may respond to such a request with an error. Specifically, it may return SCSI Status 2 (Check Condition), with sense key 5 (Illegal Request) and additional sense code (ASC) and additional sense code qualifier (ASCQ) set to 0x20/0x0 or 0x24/0x0.

ASQ/ASCQ 0x20/0x0 is Invalid Command Operation Code

ASQ/ASCQ 0x24/0x0 is Invalid Field in CDB (Command Descriptor Block)"

https://kb.vmware.com/kb/1010244

Do you have any non-vSAN local drives (e.g. for logging, VMFS or boot) attached to these same controllers?

This can cause issues:

https://kb.vmware.com/kb/2129050

Bob

-o- If you found this comment useful or answer please select as 'Answer' and/or click the 'Helpful' button ,please ask follow-up questions if you have any -o-

johandijkstra · ‎03-26-2017

Well,

We see :

2017-03-23T11:43:16.223Z cpu46:33585)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x4d (0x43a69e4c9e80, 35869) to dev "naa.500003967808ce31" on path "vmhba0:C0:T10:L0" Failed: H:0x7 D:0xf0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL

2017-03-23T11:43:16.223Z cpu46:33585)ScsiDeviceIO: 2651: Cmd(0x43a69e4c9e80) 0x4d, CmdSN 0x8c8fb from world 35869 to dev "naa.500003967808ce31" failed H:0x7 D:0xf0 P:0x0 Possible sense data: 0x0 0x0 0x0.

We are not for sure if this disk is now failing. But we see this errors also.

We don't have storage for logging etc attached to the controller.

Only 10 SAS disks per controler and 2 SSD disks per controller

The disks are HCL, but hosts have mixed Toshiba/Seagate disks (because the faulty disks in the past have been replaces by Seagate)

At this point there are 8 hosts, 2 hosts have Seagate SAS disks and Samsung SSD's (non-HCL) and furthermore 6 hosts with Toshiba SAS and Toshiba SSD and some disks in hosts have been replaced by Seagate disks

The errors above are very host specific.

Any thoughts about that?

johandijkstra · ‎03-27-2017

We are sure, after rebooting the server, no errors in iDrac. So the discs in this case are fine (according to iDrac and vSAN), but after recreating the disk groups as they where faulty last Friday.

Directly created new disk groups and rebooted the server.

Retrieved the logs from the host and we see the following :

2017-03-27T07:55:41.130Z cpu38:32899)ScsiNpiv: 1505: GetInfo for adapter vmhba0, [0x4302ec207100], max_vports=0, vports_inuse=0, linktype=0, state=0, failreason=0, sts=bad0020

the sts=bad002 gives me some questions... maybe controller issue then?

johandijkstra · ‎03-30-2017

We have some other issues...

When we restart a host, it takes more then 30 minutes for the VSAN : Initializing SSD : 5269............... Please wait...

For a SAS disk I can imagine it takes a while, but initializing a SSD.... should not take 30 minutes right?

johandijkstra · ‎03-30-2017

According to VMware this is normal behavior.

It is rebuilding the data in the diskgroup from cache and that is taking time

depping · ‎03-30-2017

Just to be sure, did you file a Support Request? Some of your issues definitely need to be looked at by support,.

johandijkstra · ‎03-30-2017

Yes, currently we are working on it with VMware Support, thanks!

depping · ‎03-31-2017

Can you post the SR number? I would like to follow the SR.

vitz3 · ‎04-19-2017

Heya,

We've run into very similar issues with VSAN.

What driver are you running?

Check on the command line with:

esxcli software vib list | grep lsi-mr3

I assume these servers also have an iDrac as well yeah? Could you check the firmware version for your SATA controller too? Should be under storage/controller on the left menu.

Basically the solution was to upgrade to;

Firmware: 25.5.0.0018

Driver: 6.904.43.00-1OEM.600.0.0.2768847

Somewhere along the line we needed to do an SSD firmware upgrade too. Might be worth checking out as well. After the upgrades, the timeouts stopped.

For the firmware you can upgrade via the iDrac with the file SAS-RAID_Firmware_2H45F_WN64_25.5.0.0018_A08.EXE

For the driver, there's a vib in VMware's support archive.

johandijkstra · ‎04-27-2017

Hi guys,

Sorry for the delay, we had a lot to do to fix all the problems, due to that, I did not manage to get back on the forum. Sorry!
But, at this moment, the vSAN is stable!

What we have done in order:
- Update all ESXi version to the same level (all hosts are the same)
- Update Firmware/Bios (iDrac)
- Update Controller Firmware
- Replace Intel NICS by Dell Intel NICS (where mixed)
- Replace all Toshiba Hard Drives for Seagates

And another several items

But now, for more then one week, everything looks stable (from a vSAN perspective, we have other issues at this moment, but not vSAN related)

@vitz3 i will look into that for you as soon as possible and will let you know.

@Depping, I will ask customer if i may share the SR for this case.

Will try to update more often from now! Thanks for all your help!

johandijkstra · ‎04-27-2017

I have created a blog post on my own blog about identifying a broken disk in vSAN (reported by vSAN, but not through iDRAC).
There are more posts about this, but for my (and maybe your reference) I will share it here:

https://jadijkstra.nl/2017/03/24/identifying-faulty-disks-in-vmware-vsan/

johandijkstra · ‎04-28-2017

Hi,

We have :

Firmware Version 25.5.0.0018

Driver Version 6.904.43.00

So that's OK!

Other items I will come back onto next week..

All

VSAN instability