VMware Cloud Community
LR_Analyst
Contributor
Contributor

VSAN 6.0 - VMs freeze after host failure

Hi everyone!

We are facing a problem when a host in a VSAN cluster fails, the virtual machines simply freezes.  Linux OS normally turn filesystems to read only, Windows OS stop respond.

bellow some information about environment:

  • 2 x Clusters (1 x 10 hosts / 2 x 20 hosts);
  • Host: HP DL380 Gen 9;
  • Storage Controller: HP Smart HBA H240ar;
  • 4 x SSD Disk 400GB eMLC + 20 x SAS Disk 1.2 TB, per host;
  • 2 x 10GB (LACP + MTU 9000 + Multicast) to VSAN + 2 x 10GB (LACP) to LAN, per host;
  • VMware ESXi, 6.0.0, 2715440

Any help?

Thanks,

CH

Reply
0 Kudos
5 Replies
elerium
Hot Shot
Hot Shot

If you are talking about VMs that were running on the failed host freezing, that is to be expected on any host that fails.

If VMs that are running on a different host than the failed one are stalling/stopping that would be an issue. I'm also assuming you're using FTT=1 or default, if you are on FTT=0, the VMs using FTT=0 disks may stop responding.

Reply
0 Kudos
LR_Analyst
Contributor
Contributor

Hi, thanks for help!  I forgot some information:

FTT=1

VMs freezing in different hosts too (host where the VMs disk are working on VSAN)

The often recorrence is because there is a issue with H240ar, the controller stops to work and lost access to VSAN Disks, but the VMs running in this host keep going working. As VSAN distributes the storage workload in different hosts, when 1 host fail (Controller fail) the VM's freeze and other hosts too.

Thanks!

CH

Reply
0 Kudos
elerium
Hot Shot
Hot Shot

My best guess is that the disk resync load after the host failure may be causing the VSAN to stall to high latency. If it occurs again, you would need to use VSAN Observer or esxtop on your hosts to check disk/network latency and confirm this. If you open a support case, VMWare might be able to figure out what happened as well from your support log bundle.

Reply
0 Kudos
velocity08
Contributor
Contributor

Hi CH

we have the same setup HPE dl360’s, same hba and have been seeing this behaviour for years with still no fix from VMware.

Host that’s stuck needs a hard power cycle to release VM’s so they can vmotion to other hosts.

we are on latest 6.5 and issue has been the same across ESX versions.

did you ever get an answer on this issue or a fix?

"”Cheers

G

Reply
0 Kudos
TheBobkin
Champion
Champion

Hello velocity08​,

Firstly, I wouldn't advise necroing a 5-year old thread that doesn't really appear to have any real technical information or insightful analysis.

You say "we have the same setup HPE dl360’s, same hba and have been seeing this behaviour for years with still no fix from VMware.", but if your controller is broken/dropping disks/driver and/or firmware losing the plot then how/why would you expect this to get fixed from the VMware side?

Starting point should be to provide some useful information:

- Exact controller model in use (e.g. IDs not just name).

- Current driver and firmware configured for this controller.

- Model of Cache and Capacity-tier disks and the firmware installed on them.

- Whether you have other potentially problematic elements configured (e.g. logging to vsanDatastore, Mixing VMFS used for VMs and vSAN disks on the same controller, mix of RAID0 and passthrough devices on vSAN controller).

- Your findings and analysis so far from logs prior to/during the issue occurring - merely stating "Host that’s stuck" is unlikely to get you anywhere.

Bob

Reply
0 Kudos