Our VSAN cluster has been stuck for several days showing a warning "vSphere HA detected an HA cluster state version inconsistency in cluster DEV in datacenter X". A task "Edit VSAN iSCSI target service in cluster" has been stuck at 50% completion for two days. I am unable to view any VSAN status or configuration info in the vSphere client (all selections hang), and eventually an error pops up "The query execution timed out because of a back-end property provider 'com.vmware.vsphere.client.vsan.health.VsanHealthPropertyProvider' which took more than 120 seconds". vCenter does not seem to be able to contact one of the ESXi hosts and I can't put it in maintenance mode or manually migrate any VMs off it..
One weird thing I am seeing is /locker/var/core (which is ver small, only 256MB) is filling up with core files named python-zdump.000, .001. (there is only room for 2)Any ideas what those are?
The VMs in the cluster seems to be running OK. How do I get control of VSAN bacK?
This is vSphere Version 188.8.131.5200 Build 6671409, ESXi 6.5.0, 5969303
Are all the hosts on the same build version (5969303)?
Test if restarting vsanmgmtd on all hosts has any effect on web client info availability (this has no negative effects and does not break communication with vCenter):
Check if same visibility situation when going via the HTML client:
There is actually Health check for vSAN cluster available via direct connection to single host via HTML client in 6.6:
Host > Datastores > vsanDatastore > Monitor > Health (or something similar - no lab access at present to check, sorry)
If restarting vsanmgmtd has no effect then (depending on other factors), you could try rebooting vCenter (or restarting all of its services):
SSH to the host that is disconnected from vCenter, the following will have no impact on activity of a host that is disconnected from vCenter:
# ps | grep hostd (this checks open hostd processes)
#/etc/init.d/hostd status (checks hostd status e.g. 'running')
#/etc/init.d/hostd stop (stops hostd)
# ps | grep hostd (if hostd processes are still open then hostd is hung and need to find out why)
Can try killing these hung processes but this will likely not work, rebooting host may be only option if hostd can not be remediated.
If the host can not be reconnected to vCenter (due to hostd issues) then unfortunately you can not vMotion anything off, however if the VMs are still up and functional (more likely than with non-vSAN storage) you can shut-down VMs from CLI or via HTML client at a time that is best and put host in MM with Ensure Accessibility and reboot (either via CLI or HTML client depending which is accessible).
The above points are not everything possible to do in this situation - just the 'Cliff's Notes' - if deeper analysis is required and available then always if in doubt open a Support Request with GSS.
Just a short comment based on you updating your question:
Could those python-zdumps have been present before updating to 6.5 U1?
"If the disk is damaged, you can not activate vSAN or add ESXi host to vSAN cluster
If there is a corrupted storage device on the host, executing vSAN or adding a host to the vSAN cluster may cause the operation to fail. After doing this operation, Python zdump will be present on the host, the vdq -qcommand will fail and a core dump will be created on the relevant host.
This issue has been fixed in this release."
Check the time stamps on them:
# ls -lah /locker/var/core
Dumps are safe to move to another location if you wish to retain them and are not active - though if they are not doing anything untoward like filling up a RAMdisk then moving is unlikely to change the situation.
Attempting to restart vsanmgtd caused all kins of havoc:
[root] /etc/init.d/vsanmgmtd restart
watchdog-vsanperfsvc: Terminating watchdog process with PID 68918
stopping timed out
Failed to clear memory reservation to resource pool 'vsanperfsvc'
vsanperfsvc is running
vSAN health alarm 'vSphere cluster members do not match vSAN cluster members'
The core dumps are brand new and get recreated each time I stop and restart HA.
Restarting that service may be allowing retrieval of information that was not being communicated until now (potential cause of alerts).
I think it is necessary to establish whether there is a vSAN-cluster issue or a vCenter-to-host communication issue:
check cluster membership that all nodes are participating in cluster:
# esxcli vsan cluster get
Check the reported decom state for all nodes (from any host):
#cmmds-tool find -t NODE_DECOM_STATE -f json
Check clomd service is running on all hosts (should not be running on Witness if stretched-cluster)
# /etc/init.d/clomd status
Are the alarms you are seeing for the node that was disconnected from vCenter or another node?
Time to call support. Th host has disappeared from vcenter entirely and is partitioned off fro the rest of the cluster although all the VMs on it seems to be up....
[root] vim-cmd vmsvc/getallvms
Skipping invalid VM '1'
Skipping invalid VM '10'
Skipping invalid VM '11'
Skipping invalid VM '12'
Skipping invalid VM '2'
Skipping invalid VM '3'
Skipping invalid VM '37'
Skipping invalid VM '38'
Skipping invalid VM '4'
Skipping invalid VM '5'
Skipping invalid VM '6'
Skipping invalid VM '7'
Skipping invalid VM '9'
Vmid Name File Guest OS Version Annotation
[root] esxcli vsan cluster get
Current Local Time: 2017-11-09T16:49:06Z
Local Node UUID: 593f27d9-a240-e79a-1701-0cc47ad3f8ca
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 593f27d9-a240-e79a-1701-0cc47ad3f8ca
Sub-Cluster Backup UUID:
Sub-Cluster UUID: 52aac3f6-0daf-b3df-2ab7-f444ee7a223a
Sub-Cluster Membership Entry Revision: 0
Sub-Cluster Member Count: 1
Sub-Cluster Member UUIDs: 593f27d9-a240-e79a-1701-0cc47ad3f8ca
Sub-Cluster Membership UUID: 95c4035a-a0ef-5297-cd96-0cc47ad3f8ca
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: c4f3fb52-9373-4039-932e-a62dce8cc020 1 2017-11-08T23:49:15.643