Solved: vSAN disk groups failure once a month

pineapplehead · ‎08-23-2017

We have deployed 8 Dell R730xd hosts equipped with H730 Mini controller at the end of last year. All hosts are running vSphere 6 build 4600944. Each host has 2 vSAN disk groups with hybrid configurtion. Each disk group has 1 Intel SSDSC2BX40 (400GB) and 6 Toshiba AL14SEB060N (600GB).

The firmware of all devices are on the vSAN HCL since the deployment. Since March of this year, the vSAN starts having issues. The vSAN would report “Flash drives dead or error” on one of the hosts and the 2 disk groups on that host would drop out from the vSAN. In iDRAC, it would show all disks are fine, but the log would show all disks are reset when the issue is occurring. When the issue occurs, we will have to reboot the server in order for the disk groups on the host to be working again. However, the next day or so, the same problem will occur on the other host. One after another. After the host is rebooted, the host works fine for about 1 month, then back to the same cycle.

We have worked with VMware and Dell support for months on this issue. We have replaced the controllers and backplanes on some of the hosts at the beginning. That didn’t help. Newer firmware came out for the controller (25.5.2.0001) and backplane (3.32), we did the upgrade. Those didn’t help neither. We have purchased additional drives for each host and planned to expand the vSAN capacity. However, the support recommended not to make any changes since they’re still trying to figure the issue. We have planned to purchase additional servers to build other vSAN clusters. Due to this issue, those projects have been on-hold. Feels like we’re stuck. The only thing is to wait for the support to come up with any new ideas. Every time they find something (ex. a new firmware just comes out) and we make change, we will have to wait a month to see if the problem comes back again or not. This has been dragging for months. Does anyone have any comments or suggestions?

pineapplehead · ‎01-27-2018

After working with Dell and had a few firmware updates for the H730 month after month, we still had the exact same issue. We ended up replacing all H730 controllers with HBA330 controllers. The issue has been resolved since then. The servers have been running for ~3 months now without issue.

View solution in original post

SureshKumarMuth · ‎08-23-2017

first step is to isolate the issue , if the issue is due to ESXi or driver or firmware. Mostly, this is due to driver/firmware issue where hardware vendor should work much on it.

From your update i understand they have recommended to go for different versions of firmware/driver as trial and error method. Have they given any update on where the issue lies ?

What does the driver/firmware dump says. If you say this issue occurs everymonth , is it occurring with some pattern ? If so it could be due to load due to scheduled jobs when more load comes the driver/firmware or hardware could not handle.

Regards,
Suresh
https://vconnectit.wordpress.com/

pineapplehead · ‎08-23-2017

Thank you for the reply, Suresh. When we deployed the system at the end of last year, the firmware/driver on all devices (controller, SSD, etc.) were up to date and on the vSAN HCL at that time. As the issue started occurring months ago, new firmware/driver also came out (ex. Apr. 2017) and vSAN HCL has updated. What we did is to upgrade the firmware/driver to the versions according to the most current vSAN HCL.

We have asked a few times to replace the controller/SSDs to a different model to help to isolate the problem, but either support is interested in going that route. The latest update I got is from the VMware support. VMware engineering team is working with Dell team to investigate further, but there is no ETA for this investigation.

I have been keeping track since the end of May, but I can’t see a pattern.

Date	Server	vSAN	Perform Task
5/24/2017	h	Reported Flash drives dead or error	replaced controller and backplane
6/7/2017	a	Reported Flash drives dead or error	replaced controller and backplane
6/11/2017	g	Reported Flash drives dead or error	rebooted the server
6/14/2017	d	Reported Flash drives dead or error	rebooted the server
6/15/2017	e	Reported Flash drives dead or error	rebooted the server
6/20/2017	f	Reported Flash drives dead or error	rebooted the server
6/25/2017	c	Reported Flash drives dead or error	rebooted the server
7/6/2017	ALL		Upgraded SSD firmware to DL2D
7/14/2017	a		Upgraded backplane firmware to 3.32
7/14/2017	g		Upgraded backplane firmware to 3.32
8/6/2017	d	Reported Flash drives dead or error	rebooted the server, upgraded backplane firmware to 3.32
8/7/2017	a	Reported Flash drives dead or error	rebooted the server
8/10/2017	f	Reported Flash drives dead or error	rebooted the server, upgraded backplane firmware to 3.32
8/11/2017	e	Reported Flash drives dead or error	rebooted the server, upgraded backplane firmware to 3.32
8/12/2017	g	Reported Flash drives dead or error	rebooted the server
8/12/2017	c	Reported Flash drives dead or error	rebooted the server, upgraded backplane firmware to 3.32

SureshKumarMuth · ‎08-23-2017

Very sad to see the issue is very frequent. Looks like the issue is with almost all the servers so it is mostly due to driver/firmware, now it is with vendors to determine the cause as most of the options were already tried apart from completely removing and readding the cluster. You have to wait till they come back. However, by using some debugging tool they should be able to find the cause but I am not sure if they asked you to run some debug build or by some other means till now.

Regards,
Suresh
https://vconnectit.wordpress.com/

TheBobkin · ‎08-23-2017

Hello pineapplehead,

Sorry to hear you having a bad run with vSAN.

As both disk-groups go at the same time it is almost certainly a controller driver/firmware issues. Both disk-groups on single controller, yes? Any non-vSAN disks (boot OR log/dump etc.) on same controller?

If VMware Engineering are engaged then any common issue or simple explanation has likely been ruled out - this may be a case of perfect-storm of components and scenario that happens to trigger the issue.

Can you share vmkernel.log and vobd.log from a host running currently? And if possible, the same logs from a period when the issue occurred. Also can you share or PM me the SR number (no promises that I can look or assist for various reasons but I do want to read any related PRs).

Bob

pineapplehead · ‎08-23-2017

Hi Bob.

I was in the vSphere client twice when the issue just occurred. What I have seen is one disk group would fail first. Then, ~15 minutes later, the 2^nd disk group would fail.

Yes, both disk groups are on the same controller. There is only one controller in each host. There is no non-vSAN disk on the controller. The boot disk is on its own dedicated SD media flash disk.

I have attached the logs. The SR# is 17488918406. Any input is appreciated.

Thank you.

timalexanderINV · ‎08-29-2017

We have been experiencing a very similar issue on vSAN 6.2 with Dell FX2 (FD332 is essentially a PERC 730). Dell eventually seemed to understand that it is a firmware bug and their suggested workaround was to swap the controller out for the PERC330 (which you cant do with the FX2 platform). One thing that has helped is increasing the timeout value (suggested by Dell support but not strictly recommended by VMware from what I understand):

esxcfg-advcfg -s 10000 /VMFS3/OptLockReadTimeout

We could ascertain the fault was the controller by looking at the log stream from the DCUI. You may well see similar entries along the lines of "lsi_mr3: controller firmware in critical state" and multiple naa disk resets. My understanding is that the latest driver does not have the same issue, even on the current HCL Firmware, but it is still not certified to be used and therefore not on the HCL yet . We are currently limping along waiting for the HCL update/certification so we can hopefully get some stability back.

pineapplehead · ‎08-29-2017

Hi timalexanderINV,

Thank you for sharing the information. It’s good to know that we’re not alone here. Since my last post, we have 2 more hosts experiencing this issue. VMware support did mention about changing a setting on the ESXi side at one time, but later decided not to do it until further investigation. Not sure if it’s the timeout value that you mentioned.

I will keep an eye on the vSAN HCL. I will also keep you posted if I get any updates from the support.

mhaberman · ‎11-16-2017

Any resolution? We are having the same issue?

mckenp · ‎11-21-2017

We had exactly the same issue across our H730p-equipped R730s.

Despite all firmwares being fully up to date and compliant with HCL, we could almost guarantee that a disk group would crap out at some point soon (we built the architecture in such a way to withstand these failures in the end, which is painful!).

VMware tickets were helpful insofar as checking that there the values in this article were correct - VMware Knowledge Base and they assisted in checking data parity across the member nodes.

Beyond that, we ended up rebuilding the VSAN cluster several times in horror.

I am disappointed that VMware has not discovered a resolution to this, which existed in 6.0 and persists today afaik. I have since moved on from that infrastructure.

srodenburg · ‎12-02-2017

Has all the symptoms of Controllers and controller-firmware. We run Supermicro hosts with vSAN 6.6.1 Hybrid and two LSI 9207-8i cards in each. If I put the HCL recommended Firmware 20 on those cards, we are totally screwed just like you.

We went back to Firmware 19 (downgraded all cards) and we have had 0 issues since then. So much for HCL Firmware recommendations eh.

pineapplehead · ‎01-27-2018

After working with Dell and had a few firmware updates for the H730 month after month, we still had the exact same issue. We ended up replacing all H730 controllers with HBA330 controllers. The issue has been resolved since then. The servers have been running for ~3 months now without issue.

alsmk2 · ‎01-30-2018

We have a site running Cisco UCS rackmount servers that displays a very similar issue to this on a regular basis with the VIC cards; however, it appears to be completely cosmetic and no storage /host is ever lost.

The controller drivers all show up on the HCL as fully supported and there are no errors in any logs that indicate an issue. I probably should get it logged with VMware, but for now we've just learnt to ignore it.

TheBobkin · ‎01-30-2018

Hello alsmk2,

"We have a site running Cisco UCS rackmount servers that displays a very similar issue to this on a regular basis with the VIC cards"

VICs are used for connecting either Network or Networked-storage - not locally-attached vSAN storage - maybe you are referring to a 12G SAS RAID controller that commonly is configured in such servers?

"However, it appears to be completely cosmetic and no storage /host is ever lost."

How are you discerning that storage-availability is never lost?

VMs running as FTT=1 on vSAN can stay up when one of its data-mirrors is lost (e.g. a disk or disk-group becomes unavailable due to failure), so uptime of VMs is not a good measure of this.

"The controller drivers all show up on the HCL as fully supported and there are no errors in any logs that indicate an issue. "

Please ensure you are referring to the vSAN-specific HCL for the required configuration (e.g. controller-cache disabled or set to 100% Read) and drivers/firmware of: controllers, cache-tier SSD/NVMe and capacity-tier HDD/SSD devices.

"I probably should get it logged with VMware, but for now we've just learnt to ignore it."

'Alarm fatigue' can be a dangerous thing, I would advise at the very least aiming to get some clarification on what is causing the alerts - it may be something benign and ignorable as you suggested, or it may be something more serious that resolves itself and that vSAN chugs on through without the VM/Guest-OS layer ever noticing.

Bob

kmcd03 · ‎12-16-2019

We're experiencing this same problem, but with fc640 blades using and FD332 storage sleds with FD332-PERC (dual NOC) controller.

In last two weeks I've had vSAN mark SSDs permanently disabled (PDL) on two different hosts.

I had same problem eight months ago (May). I had three different hosts have same problem and I opened tickets with GSS and Dell support. Recommendation was to update the controller firmware from version 25.5.5.0004 to 25.5.5.0005. And the lsi_mr3 driver from version 7.703.18.00 to 7.703.20. Both version combinations of firmware and driver are on the VCG for vSAN.

I have new tickets opened both with GSS and Dell Support for over ten days now. I was asked to upgrade firmware to 25.5.6.0009 that was released in Sept-2019. I think there continues to be a problem with the H730 family of controllers and vSAN 6.x. And replacing the current controllers with HBA330 cards is the fix.

How much effort is needed to replace/change controllers with vSAN (configured for encryption and de-dupe)? Is it as simple as putting host in maintenance mode with option ensure accessibility then replace controller? Will vSAN see the SSDs and diskgroups as unchanged? The controller is in pass-through mode so wondering as long as the drivers load, there's no change to the SSDs, disk IDs/signatures, and diskgroups. (14+1 stretched cluster so capacity for applied storage policies)

Or do I have to evacuate all stored components from the host and delete the diskgroups, and then swap controller and recreate the diskgroups?