VMware Cloud Community
WhiskyTangoFoxt
Enthusiast
Enthusiast

Hosts randomly drop off VSAN cluster. Anyone have advanced troubleshooting steps?

We have a problem of VSAN hosts dropping off our cluster, and I'm not getting anywhere with the regular troubleshooting methods. I've had a ticket open with GSS and the VSAN team since July, but we've gotten nowhere on the cause.

We have a 4 host VSAN cluster that was perfect for the first year until we upgrade the VSAN from 6.1 to 6.2. In July we had host #3 drop off the cluster. The VMs are still running on it, but the clients briefly lost their connection to the VMS running on it.The host showed disconnected from the cluster. Attempts to reconnect it to the cluster fail with timeouts. Attempts to connect to it with a web browser fail, as do attempts with the original VSphere client.

I am able to connect to it via SSH, which proves VMKernel IP availability, yet attempts to restart the services hang. I looked at the VPXA logs from the client from around the failure time as well as the VPXD logs from VCenter, but I'm unable to determine anything unusual.

All servers are running on R730XDs and have the latest firmware/drivers from the VMWare HCL.

In August Host #4 had the exact same issue, then 3 days ago, host #2 now has the same issue.

The only way to get the cluster up and working again was to RDP into the guests, do a graceful shutdown of each, open them on a different host, then reboot the failed host, join the cluster, vmotion the guests back onto it.

Even the shutdown cannot happen gracefully as attempting to shut down via SSH or DCUI hangs as well.

Hosts are joined by Brocade VDX 10G switches which have no errors on any ports, only discards on the receiving ends. There are two distributed switches set up, one for VSAN (dedicated ports) and one for VM traffic/VMotion traffic. The VMWare managment network is separate VLAN, separate physical switches, on regular vswitches. The fact that the VSAN and management networks on only one host dropped off at the same time seems to point to the host.

I think that it is a bug with the Hosts' OS. Build number is 3825889

Anyone experienced this? Troubleshooting route?

Thanks,,

B

Reply
0 Kudos
10 Replies
WhiskyTangoFoxt
Enthusiast
Enthusiast

I'll add that ESXTOP doesn't show high CPU utilization. VMKping works to VCenter and other hosts on all three networks. VPXa and HostD services are reported as running.

When I try to pull up a list of the running VMs with 'esxcli vm process list', the system just hangs there. Other hosts in the cluster display the running VMs just fine.

I know a reboot will fix the issue after the VMs have been manually started on other hosts, but I need to get to the root of the problem.

B

Reply
0 Kudos
thibaudpeter
Enthusiast
Enthusiast

Hi there,

When you said "drop off", is from the vCenter View or from the VSAN Cluster ?

Do you have your SR number ?

Best Regards.

Reply
0 Kudos
WhiskyTangoFoxt
Enthusiast
Enthusiast

Peter,

Drop off was referring to no longer participating in the VSAN. VMs would show as being disconnected. Attempts to reconnect the host from VCenter would fail with timeout.

SR# 16184570707.


Last night I was looking though the logs from a good host and comparing them to the failed host. I noticed that the VMKernel.log on the good host had eight files over an eight day span. The VMKernel logs on the failed host had also eight logs, but over a period of 16 minutes. Looking into the logs I saw a lot of "WARNING: lsi_mr3: fusionReset:2779: megaraid_sas: Hardware critical error, returning FAILED" - exactly what the current firmware 25.4.0.0017 was supposed to prevent.


Then I started to look deeper in to the H730 hardware logs: (first incident from July)

pastedImage_3.png


Only now do I see a bulletin from a week ago Best practices for VSAN implementations using Dell PERC H730 or FD332-PERC storage controllers (2109...‌ indicating that there is yet another firmware release (25.4.1.0004) to fix this problem.


I'm not 100% sure that this is going to do it, as the release notes from Dell indicate that the newest firmware update disables T10 correction between the controller and drives. The drives that we are using are not T10 capable to begin with:

pastedImage_2.png

The latest update from VMware increases the timeout and retry count on the drive controller which seems like a bandaid to me.

From August 2015 to May 2016 we were running the factory default firmware that came pre-installed on the R730XD without incident. It wasn't until we updated to VSAN 6.2 that we were told that we needed to use the firmware from the VSAN HCL (25.4.0.0017) that we started having problems.


B

Reply
0 Kudos
thibaudpeter
Enthusiast
Enthusiast

Hi,

You are right not to upgrade to the firmware as it's not supported (at the moment) for VSAN.

Dell PERC H730/H730P/H830 Mini/Adapter /FD33xS/FD33xD RAID Controllers firmware version 25.4.1.0004 ...

Unless you are daily affected by the crashes, I can not recommend you to upgrade to the next firmware version.

It's up to you.

Check with your engineer, so he can add your ticket to the Dell Bug.

Reply
0 Kudos
chgonzaleze
Contributor
Contributor

Hi WhiskyTangoFoxtroti'm working on a similar case where a random host would just isolate from the vSAN cluster with no apparent reason and just hangs until it's rebooted from idrac, they´re also R730XD hosts. Question, what kind of disks are you using? 512e by any chance?

Reply
0 Kudos
WhiskyTangoFoxt
Enthusiast
Enthusiast

We're using the following disks:

SSDs TOSHIBA PX02SSF020

SAS SEAGATE  ST600MM0088

Are you seeing these storage event shown under the "storage" section of your iDRAC?

Disk 5 in Backplane 1 of Integrated RAID Controller 1 was reset

Reply
0 Kudos
WhiskyTangoFoxt
Enthusiast
Enthusiast

So, it turns out that there is a stability fix for the PERC firmware v 25.4.1.0004 which is now on the VSAN HCL: Best practices for VSAN implementations using Dell PERC H730 or FD332-PERC storage controllers (2109...

There are also firmware updates to both the Dell (Toshiba) SSDs and the Dell (Seagate) SAS drives.

I found the easiest way to update all firmware on the system was to use Dell's bootable ISO: R730XD - Box

I was able to easily mount is as a virtual CD/DVD drive on the iDRAC and reboot. Everything autoruns from there.

Here's the weird thing though. After running the August 2016 VMWare update, the driver reverts to a non VSAN 6.2 version, causing the health check to go crazy. I had to re-install the 6.2 compatible driver separately:

[root@ESXi-3:~] esxcli software vib install -d /vmfs/volumes/Updates/ESXi600-201608001.zip

Installation Result

  Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective.

Reboot Required: true

VIBs Installed: VMware_bootbank_esx-base_6.0.0-2.43.4192238, VMware_bootbank_esx-ui_1.4.0-3959074, VMware_bootbank_lsi-mr3_6.605.08.00-7vmw.600.1.17.3029758, VMware_bootbank_misc-drivers_6.0.0-2.43.4192238,VMware_bootbank_net- vmxnet3_1.1.3.0-3vmw.600.2.43.4192238, VMware_bootbank_vsan_6.0.0-2.43.4097166, VMware_bootbank_vsanhealth_6.0.0-3000000.3.0.2.43.4064824, VMware_locker_tools-light_6.0.0-2.43.4192238

VIBs Removed: Avago_bootbank_lsi-mr3_6.903.85.00-1OEM.600.0.0.2768847, VMware_bootbank_esx-base_6.0.0-2.37.3825889…


with this:

[root@VHHQESXi-3:~] esxcli software vib update -d "/vmfs/volumes/Updates/VMW-ESX-6.0.0-lsi_mr3-6.903.85.00_MR-offline_bundle-3818071.zip"



So far so good. It's been 5 days. Time will tell...


Reply
0 Kudos
chgonzaleze
Contributor
Contributor

Yes i have those events as well, in intervals of like 2 or 3 days. In our case those disks are out of vSAN, we are using them just to store logs.

Disk we have:

SSD TOSHIBA PX02SSF040

SAS SEAGATE ST2000NX0273

As of now we are in compliance with the recommended driver/firmware combinations mentioned in Best practices for VSAN implementations using Dell PERC H730 or FD332-PERC storage controllers (2109...

Thanks for all the info, lets see how this goes.

Reply
0 Kudos
chgonzaleze
Contributor
Contributor

Hey WhiskyTangoFoxtrot

Any updates on your hosts?

So far everything has been stable for my part.

Reply
0 Kudos
SupportExperts
Contributor
Contributor

HI  Guy

Aware that VMware compatible guide release latest PERC firmware 25.5.3.0005 driver for vSAN , All  Flash

 

PERC H730 Mini

https://www.vmware.com/resources/compatibility/detail.php?deviceCategory=vsanio&productid=34860&devi...

Best Regard

Nattakun Ch.

Reply
0 Kudos