Re: Disk failure - but all SSD's also fail?

justinbennett · ‎12-22-2015

Is this typical for the PERC H730p controller and VSAN to have a single disk failure cascade to all SSD's being marked as failed too? We're running the latest version of ESXi 6.0.0 3247720, driver: lsi-mr3 version 6.606.12.00-1OEM, controller firmware: 25.3.0.0016, and Toshiba Phoenix M2+ SSD firmware SSD A4AF. We noticed the disk was resetting just prior to the failure.

Also, we are replacing the failing magnetic disk.

Thank you all in advance!

srodenburg · ‎12-24-2015

Read these two Blog-posts incl. the comments below them. All your questions will be answered there.

How VSAN handles a disk or host failure

What happens in a VSAN cluster in the case of a SSD failure?

justinbennett · ‎12-24-2015

If you read VMware Virtual SAN Operations: Replacing Disk Devices - Virtual Blocks - VMware Blogs, to replace a single magnetic disk failure, all that should be done is:

"...

vSphere Web Client Procedure (Pass-through Mode)

1.Login to the vSphere Web Client

2.Navigate to the Hosts and Clusters view and select the Virtual SAN enabled cluster

3.Go to the manage tab and select Disk management under the Virtual SAN section

4.Select the disk group with the failed magnetic device

5.Select the failed magnetic device and click the delete button

..."

As you can see in the , all of the SSD's in my disk groups were additionally marked as "Permanent Disk Failure", along with the one "Absent Disk" - which is our failed magnetic disk.

If the VSAN node has a single disk group, I wouldn't have thought twice about this issue - but we don't have 1 SSD and 1 disk group.

We have 4 disk groups per VSAN node - why would every disk group on my node be tripped into failure? This is my ultimate question. Driver issue? Firmware issue? Configuration issue? Evil ghost from a Scooby Doo episode?

Thank you.

srodenburg · ‎12-25-2015

Aha. I'm sorry. I interpreted your initial post incorrectly. To this, I don't have an answer. I'm afraid you'll have to contact VMware Support in such an extreme case. This goes against everything I know about "how it should work"...

One thought: it could be a driver failure. If you have multiple controllers in a host that use the same driver, and one controller-card goes bananas, the driver can lock up/crash, dragging the other cards down with it. Same goes for NIC's. If you have 8 NIC ports in a system, all using the same driver because the chips used are in the same product-family etc, an electrical failure in one chip/asic can cause the driver to shit itself and all 8 NIC-ports become unusable.

larstr · ‎01-13-2016

Hi,

Did you find a solution or workaround to this problem? We have a newly installed environment here and we're seeing problems that look very much like the ones you've described here.

We're having a very similar setup with R730XD servers with 4 SSDs and 20 SAS drives. We have setup a 6 node vSAN cluster and the past 12 hours we've had two occurences where one of the nodes suddenly had an all disk failure. A reboot is solving the problem, but we're worried about moving production load to this system if it continues like this.

Lars

justinbennett · ‎01-13-2016

Actually, Dell Tech. Support reached out to us yesterday and sent us a replacement backplane, PERC, and cables.

I'd suggest opening a case with Dell Support. Wish I had a better answer.

May want to also check the KB 2109665 and look around this thread.

Best of luck.

Justin

srodenburg · ‎01-14-2016

Just to add: I've seen SAS Backplace cause random failures on many drives on the plane even when only one drive really was broken and the rest were fine. Basically, the broken drive dragged other disks down under.

This should not be possible with real SAS drives as SAS is a point-to-point bus and not a chain like FC, but it still happened regardless. In all cases, the backplane manufacturer send updates for the backplane to solve the issue.

Summary: a broken SAS drive can cause mass-death of other drives (Sometimes, all other drives. Sometimes only a few, oft in a random fashion) but it's always a problem with the backplane or the controller (at least in my experience).

Real SATA Drives on SAS Backplanes are more tricky due to the fact that SATA drive data is tunneled in the STP Protocol (Sata Tunneling Protocol) and i've seen lost of problems with SATA drives in SAS Backplanes screwing things up. I'm always a bit carefull when manufacturers say "you can mix and match SAS and SATA in our backplanes".

justinbennett · ‎01-20-2016

Thank you for the information. After all the parts, we just had another repeat of this issue. On the phone with with Dell Tech Support. I'll follow-up with the outcome.

justinbennett · ‎01-25-2016

We've had this issue re-occur once more on this same host, even after all the replacement parts. We've also had the issue occur on a 2nd host and most recently PSOD. After a reboot, the VSAN node appears to recover with no intervention as if nothing has happened. I doubt any hardware error is occurring - other than the backplane reset as Dell pointed out to in the controller logs.

elerium · ‎01-26-2016

I've had VSAN nodes using H730 mini's for the past 9 months or so and they experience what you describe too (loss of all storage groups on a host). The symptom seems to be continual disk resets that show up in lifecycle controller log followed by the raid controller not responding. I haven't seen a case where disk resets from the raid controller weren't in the logs prior to the problem occurring.

Sometimes the host still stays up after the raid controller fails and sometimes it just PSODs. The raid controller being non responsive is what causes the loss of the disk groups. I have also seen that rebooting the problem node fixes everything (following resync).

At this point my best guess is some driver or firmware issue with the H730 controller and until combo of Dell/VMware supply a fixed driver/firmware for this, we'll continue to see this occur.

jackchentoronto · ‎01-26-2016

This is so scary. We have experienced two incidents of "permanent disk failure" in last three months, happened on two Micron SSD cards on two different hosts ( almost brand new hardware in compliance with vSAN's HCL ) . Micron's application shows the disks are healthy. So far, the only explanation I got from Vmware is "appears to have been a temporary hiccup". This really shake our confidence with vSAN.

justinbennett · ‎01-28-2016

Just a quick update:

We're working with VMware Technical Support. They're having us use a newer driver - not yet on the HCL. I'll let you know what comes of it.

Thank you,

Justin

elerium · ‎01-28-2016

Out of curiosity, what driver version are they having you try?

vishalchand · ‎01-28-2016

We had same issue and found its important to use qualified hardware but equally important to make sure the correct firmware and drivers are installed. Best way to check the HCL would suggest installing vsan plugin and do the healthcheck. Also we are using megaraid controller, by default the ESXi host was using vSphere native lsi_mr3 driver. After loading megaraid vib, you will need to disable default driver for new driver to be loaded with a reboot.

The health check should show all Passed....

elerium · ‎01-29-2016

I think this issue is just for H730 based raid controllers (which is LSI based but runs Dell firmware). For H730 based cards, VMware HCL shows using lsi_mr3 driver and not megaraid, but loading the one from HCL (6.606.12.00-1OEM) and all firmware recommendations still results in sporadic disk group failures or PSODs. I know VMware's aware of the issue and it's pending possible updated drivers.

Like the author, i experience these problems with Dell H730 raid card and my healthcheck passes. All parts on HCL and all firmware/drivers are following recommended KBs + HCL.