Solved: ESXi 6.5: System disk RAID1 failure = unstable ESX...

bellocarico · ‎09-09-2019

As summarized in the title I have an Adaptec 6805T with a RAID1 system disk (2x 256GB SSD) where I'm running a standalone ESXi 6.5.

The story goes:

One of the VMs started to behave strangely last week (high memory usage, filesystem check every second reboot) slow response. So I initially focused on fixing the VM issue but after a bit I noticed similar problems on a second VM hence started to investigate ESXi itself.

Without detailing all the steps I discovered an issue with one of the two disks in the system RAID1. Rebooted the host entered in the controller BIOS and noticed a "Rebuild" action was taking place.

So used the Controller interface I check the Health of the disks and one of the two didn't complete the check after stopping with an error.

So Ok: disk broken not a big deal.......... but can anybody explain why ESXi is suffering from this? Isn't the controller meant to exclude the first disk and use the second one as the primary? Why the performance degradation?

Also (more an Adaptec question to be fair) assuming the disk is not completely broken as I can still see it, does anybody know how I can exclude this from the RAID1 and force a DEGRADED array until I get a replacement hardware? Unplugging would be an option but I don't have physical access to the box until I go on site (in 10 days) so only IPMI access.

Thanks!

gregsn · ‎09-09-2019

>>So Ok: disk broken not a big deal.......... but can anybody explain why ESXi is suffering from this?

From personal experience: I've had major performance/reliability issues when using non-Intel brand SSDs on Adaptec controllers. Sometimes the array would slow to a crawl and once I've had an entire array become corrupt for no apparent reason (no issues reported by the controller other than a handful of aborted commands on 1 SSD in an array, which should have been no problem). Since moving to Intel DC S4XXX series SSDs, I've not had any further issues.

>>Isn't the controller meant to exclude the first disk and use the second one as the primary?

Generally, yes, but drives sometimes tend to fail in strange ways and sometimes the controller is unable to drop it from the array automatically.

>>Why the performance degradation?

This is possible due to the SSDs themselves. Try using Intel DC S4XXX drives. Also, if you lose 1 drive in a RAID1 array, you'll loose some read performance because I believe the Adaptec controller will read from both disks to improve performance in RAID1 (though, it's probably negligible if you're using good quality SSDs).

>>Also (more an Adaptec question to be fair) assuming the disk is not completely broken as I can still see it, does anybody know how I can exclude this from the RAID1 and force a DEGRADED array until I get a replacement hardware?

Normally, yes you could do this using the "arcconf" command. You would first need to install the drivers and CIM providers from Adaptec's site, but 6-series controllers lock up if you install the drivers/CIM providers from Adaptec's site on ESXi >6.0. You can attempt installing the drivers/CIM providers, but you'll likely have a crashed controller/frozen host shortly after reboot:

Adaptec - Adaptec RAID 6805T

For reference, the arrconf command you would use would have the following syntax:

Usage: SETSTATE <Controller#> DEVICE <Channel# ID#> <State> [LOGICALDRIVE <LD#> [LD#] ... ] [noprompt] [nologs]

Usage: SETSTATE <Controller#> LOGICALDRIVE <LogicalDrive#> OPTIMAL [ADVANCED option] [noprompt] [nologs]

Usage: SETSTATE <Controller#> DEVICE <Channel# ID#> <State> MAXCACHE

Example: SETSTATE 1 DEVICE 0 0 RDY

Example: SETSTATE 1 LOGICALDRIVE 0 ADVANCED nocheck

Example: SETSTATE 1 DEVICE 0 0 DDD

===================================================================================

Redefine the state of a physical or logical device from its

current state to the designated state.

DEVICE parameters

Channel# ID# : The Channel and ID of the device whose state will be altered.

Physical states : HSP, Create a hot spare from a ready drive.

Optional [LOGICALDRIVE <LD#> [LD#] ...] parameters

Dedicates the HSP to one or more logical devices.

: RDY, Remove a hot spare designation.

Optional [LOGICALDRIVE <LD#> [LD#] ...] parameters

removes a dedicated HSP from one or more logical devices.

Attempts to change drive from Failed to Ready.

: DDD, Force a drive to Failed.

LOGICALDRIVE parameters

LogicalDrive# : logical device ID to be forced optimal.

ADVANCED : Optional parameter indicating an advanced option will be attempted.

Advanced Options Attention!: using advanced options is dangerous and may result in data loss.

nocheck : Force optimal without a consistency check.

MAXCACHE : This creates a dedicated hot spare for the maxcache provided its a SSD

noprompt : Don't prompt for confirmation.

---------------------------------------------------------------------------------------------------------

Was it helpful? Let us know by completing this short survey here.

View solution in original post

gregsn · ‎09-09-2019

>>So Ok: disk broken not a big deal.......... but can anybody explain why ESXi is suffering from this?

From personal experience: I've had major performance/reliability issues when using non-Intel brand SSDs on Adaptec controllers. Sometimes the array would slow to a crawl and once I've had an entire array become corrupt for no apparent reason (no issues reported by the controller other than a handful of aborted commands on 1 SSD in an array, which should have been no problem). Since moving to Intel DC S4XXX series SSDs, I've not had any further issues.

>>Isn't the controller meant to exclude the first disk and use the second one as the primary?

Generally, yes, but drives sometimes tend to fail in strange ways and sometimes the controller is unable to drop it from the array automatically.

>>Why the performance degradation?

This is possible due to the SSDs themselves. Try using Intel DC S4XXX drives. Also, if you lose 1 drive in a RAID1 array, you'll loose some read performance because I believe the Adaptec controller will read from both disks to improve performance in RAID1 (though, it's probably negligible if you're using good quality SSDs).

>>Also (more an Adaptec question to be fair) assuming the disk is not completely broken as I can still see it, does anybody know how I can exclude this from the RAID1 and force a DEGRADED array until I get a replacement hardware?

Normally, yes you could do this using the "arcconf" command. You would first need to install the drivers and CIM providers from Adaptec's site, but 6-series controllers lock up if you install the drivers/CIM providers from Adaptec's site on ESXi >6.0. You can attempt installing the drivers/CIM providers, but you'll likely have a crashed controller/frozen host shortly after reboot:

Adaptec - Adaptec RAID 6805T

For reference, the arrconf command you would use would have the following syntax:

Usage: SETSTATE <Controller#> DEVICE <Channel# ID#> <State> [LOGICALDRIVE <LD#> [LD#] ... ] [noprompt] [nologs]

Usage: SETSTATE <Controller#> LOGICALDRIVE <LogicalDrive#> OPTIMAL [ADVANCED option] [noprompt] [nologs]

Usage: SETSTATE <Controller#> DEVICE <Channel# ID#> <State> MAXCACHE

Example: SETSTATE 1 DEVICE 0 0 RDY

Example: SETSTATE 1 LOGICALDRIVE 0 ADVANCED nocheck

Example: SETSTATE 1 DEVICE 0 0 DDD

===================================================================================

Redefine the state of a physical or logical device from its

current state to the designated state.

DEVICE parameters

Channel# ID# : The Channel and ID of the device whose state will be altered.

Physical states : HSP, Create a hot spare from a ready drive.

Optional [LOGICALDRIVE <LD#> [LD#] ...] parameters

Dedicates the HSP to one or more logical devices.

: RDY, Remove a hot spare designation.

Optional [LOGICALDRIVE <LD#> [LD#] ...] parameters

removes a dedicated HSP from one or more logical devices.

Attempts to change drive from Failed to Ready.

: DDD, Force a drive to Failed.

LOGICALDRIVE parameters

LogicalDrive# : logical device ID to be forced optimal.

ADVANCED : Optional parameter indicating an advanced option will be attempted.

Advanced Options Attention!: using advanced options is dangerous and may result in data loss.

nocheck : Force optimal without a consistency check.

MAXCACHE : This creates a dedicated hot spare for the maxcache provided its a SSD

noprompt : Don't prompt for confirmation.

---------------------------------------------------------------------------------------------------------

Was it helpful? Let us know by completing this short survey here.

All

ESXi 6.5: System disk RAID1 failure = unstable ESXi performance