Hi everyone!
We are facing a problem when a host in a VSAN cluster fails, the virtual machines simply freezes. Linux OS normally turn filesystems to read only, Windows OS stop respond.
bellow some information about environment:
Any help?
Thanks,
CH
If you are talking about VMs that were running on the failed host freezing, that is to be expected on any host that fails.
If VMs that are running on a different host than the failed one are stalling/stopping that would be an issue. I'm also assuming you're using FTT=1 or default, if you are on FTT=0, the VMs using FTT=0 disks may stop responding.
Hi, thanks for help! I forgot some information:
FTT=1
VMs freezing in different hosts too (host where the VMs disk are working on VSAN)
The often recorrence is because there is a issue with H240ar, the controller stops to work and lost access to VSAN Disks, but the VMs running in this host keep going working. As VSAN distributes the storage workload in different hosts, when 1 host fail (Controller fail) the VM's freeze and other hosts too.
Thanks!
CH
My best guess is that the disk resync load after the host failure may be causing the VSAN to stall to high latency. If it occurs again, you would need to use VSAN Observer or esxtop on your hosts to check disk/network latency and confirm this. If you open a support case, VMWare might be able to figure out what happened as well from your support log bundle.
Hi CH
we have the same setup HPE dl360’s, same hba and have been seeing this behaviour for years with still no fix from VMware.
Host that’s stuck needs a hard power cycle to release VM’s so they can vmotion to other hosts.
we are on latest 6.5 and issue has been the same across ESX versions.
did you ever get an answer on this issue or a fix?
"”Cheers
G
Hello velocity08,
Firstly, I wouldn't advise necroing a 5-year old thread that doesn't really appear to have any real technical information or insightful analysis.
You say "we have the same setup HPE dl360’s, same hba and have been seeing this behaviour for years with still no fix from VMware.", but if your controller is broken/dropping disks/driver and/or firmware losing the plot then how/why would you expect this to get fixed from the VMware side?
Starting point should be to provide some useful information:
- Exact controller model in use (e.g. IDs not just name).
- Current driver and firmware configured for this controller.
- Model of Cache and Capacity-tier disks and the firmware installed on them.
- Whether you have other potentially problematic elements configured (e.g. logging to vsanDatastore, Mixing VMFS used for VMs and vSAN disks on the same controller, mix of RAID0 and passthrough devices on vSAN controller).
- Your findings and analysis so far from logs prior to/during the issue occurring - merely stating "Host that’s stuck" is unlikely to get you anywhere.
Bob