Re: VSAN stuck in inconsistent state

wsanders11 · ‎11-08-2017

Our VSAN cluster has been stuck for several days showing a warning "vSphere HA detected an HA cluster state version inconsistency in cluster DEV in datacenter X". A task "Edit VSAN iSCSI target service in cluster" has been stuck at 50% completion for two days. I am unable to view any VSAN status or configuration info in the vSphere client (all selections hang), and eventually an error pops up "The query execution timed out because of a back-end property provider 'com.vmware.vsphere.client.vsan.health.VsanHealthPropertyProvider' which took more than 120 seconds". vCenter does not seem to be able to contact one of the ESXi hosts and I can't put it in maintenance mode or manually migrate any VMs off it..

One weird thing I am seeing is /locker/var/core (which is ver small, only 256MB) is filling up with core files named python-zdump.000, .001. (there is only room for 2)Any ideas what those are?

The VMs in the cluster seems to be running OK. How do I get control of VSAN bacK?

This is vSphere Version 6.5.0.10000 Build 6671409, ESXi 6.5.0, 5969303

TheBobkin · ‎11-08-2017

Hello wsanders11,

Are all the hosts on the same build version (5969303)?

Test if restarting vsanmgmtd on all hosts has any effect on web client info availability (this has no negative effects and does not break communication with vCenter):

#/etc/init.d/vsanmgmtd restart

Check if same visibility situation when going via the HTML client:

https://vCenterIP/ui

There is actually Health check for vSAN cluster available via direct connection to single host via HTML client in 6.6:

https://HostIP/ui

Host > Datastores > vsanDatastore > Monitor > Health (or something similar - no lab access at present to check, sorry)

If restarting vsanmgmtd has no effect then (depending on other factors), you could try rebooting vCenter (or restarting all of its services):

https://kb.vmware.com/s/article/2109881

SSH to the host that is disconnected from vCenter, the following will have no impact on activity of a host that is disconnected from vCenter:

# ps | grep hostd (this checks open hostd processes)

#/etc/init.d/hostd status (checks hostd status e.g. 'running')

#/etc/init.d/hostd stop (stops hostd)

# ps | grep hostd (if hostd processes are still open then hostd is hung and need to find out why)

Can try killing these hung processes but this will likely not work, rebooting host may be only option if hostd can not be remediated.

If the host can not be reconnected to vCenter (due to hostd issues) then unfortunately you can not vMotion anything off, however if the VMs are still up and functional (more likely than with non-vSAN storage) you can shut-down VMs from CLI or via HTML client at a time that is best and put host in MM with Ensure Accessibility and reboot (either via CLI or HTML client depending which is accessible).

The above points are not everything possible to do in this situation - just the 'Cliff's Notes' - if deeper analysis is required and available then always if in doubt open a Support Request with GSS.

Bob

TheBobkin · ‎11-08-2017

Hello wsanders11,

Just a short comment based on you updating your question:

Could those python-zdumps have been present before updating to 6.5 U1?

"If the disk is damaged, you can not activate vSAN or add ESXi host to vSAN cluster

If there is a corrupted storage device on the host, executing vSAN or adding a host to the vSAN cluster may cause the operation to fail. After doing this operation, Python zdump will be present on the host, the vdq -qcommand will fail and a core dump will be created on the relevant host.

This issue has been fixed in this release."

https://docs.vmware.com/en/VMware-vSphere/6.5/rn/vsphere-esxi-651-release-notes.html

Check the time stamps on them:

# ls -lah /locker/var/core

Dumps are safe to move to another location if you wish to retain them and are not active - though if they are not doing anything untoward like filling up a RAMdisk then moving is unlikely to change the situation.

Bob

wsanders11 · ‎11-08-2017

Attempting to restart vsanmgtd caused all kins of havoc:

[root] /etc/init.d/vsanmgmtd restart

watchdog-vsanperfsvc: Terminating watchdog process with PID 68918

stopping timed out

Failed to clear memory reservation to resource pool 'vsanperfsvc'

vsanperfsvc is running

vSAN health alarm 'Hosts with connectivity issues'

vSAN health alarm 'Physical disk health retrieval issues'

vSAN health alarm 'vSphere cluster members do not match vSAN cluster members'

vSAN health alarm 'vSAN CLOMD liveness'

The core dumps are brand new and get recreated each time I stop and restart HA.

TheBobkin · ‎11-08-2017

Hello wsanders11,

Restarting that service may be allowing retrieval of information that was not being communicated until now (potential cause of alerts).

I think it is necessary to establish whether there is a vSAN-cluster issue or a vCenter-to-host communication issue:

check cluster membership that all nodes are participating in cluster:

# esxcli vsan cluster get

Check the reported decom state for all nodes (from any host):

#cmmds-tool find -t NODE_DECOM_STATE -f json

Check clomd service is running on all hosts (should not be running on Witness if stretched-cluster)

# /etc/init.d/clomd status

Are the alarms you are seeing for the node that was disconnected from vCenter or another node?

Bob

wsanders11 · ‎11-09-2017

Time to call support. Th host has disappeared from vcenter entirely and is partitioned off fro the rest of the cluster although all the VMs on it seems to be up....

[root] vim-cmd vmsvc/getallvms

Skipping invalid VM '1'

Skipping invalid VM '10'

Skipping invalid VM '11'

Skipping invalid VM '12'

Skipping invalid VM '2'

Skipping invalid VM '3'

Skipping invalid VM '37'

Skipping invalid VM '38'

Skipping invalid VM '4'

Skipping invalid VM '5'

Skipping invalid VM '6'

Skipping invalid VM '7'

Skipping invalid VM '9'

Vmid Name File Guest OS Version Annotation

[root] esxcli vsan cluster get

Cluster Information

Enabled: true

Current Local Time: 2017-11-09T16:49:06Z

Local Node UUID: 593f27d9-a240-e79a-1701-0cc47ad3f8ca

Local Node Type: NORMAL

Local Node State: MASTER

Local Node Health State: HEALTHY

Sub-Cluster Master UUID: 593f27d9-a240-e79a-1701-0cc47ad3f8ca

Sub-Cluster Backup UUID:

Sub-Cluster UUID: 52aac3f6-0daf-b3df-2ab7-f444ee7a223a

Sub-Cluster Membership Entry Revision: 0

Sub-Cluster Member Count: 1

Sub-Cluster Member UUIDs: 593f27d9-a240-e79a-1701-0cc47ad3f8ca

Sub-Cluster Membership UUID: 95c4035a-a0ef-5297-cd96-0cc47ad3f8ca

Unicast Mode Enabled: true

Maintenance Mode State: OFF

Config Generation: c4f3fb52-9373-4039-932e-a62dce8cc020 1 2017-11-08T23:49:15.643

All

VSAN stuck in inconsistent state