wsanders11
Enthusiast
Enthusiast

VSAN stuck in inconsistent state

Our VSAN cluster has been stuck for several days showing a warning "vSphere HA detected an HA cluster state version inconsistency in cluster DEV in datacenter X". A task "Edit VSAN iSCSI target service in cluster" has been stuck at 50% completion for two days.  I am unable to view any VSAN status or configuration info in the vSphere client (all selections hang), and eventually an error pops up "The query execution timed out because of a back-end property provider 'com.vmware.vsphere.client.vsan.health.VsanHealthPropertyProvider' which took more than 120 seconds". vCenter does not seem to be able to contact one of the ESXi hosts and I can't put it in maintenance mode or manually migrate any VMs off it..

One weird thing I am seeing is /locker/var/core (which is ver small, only 256MB) is filling up with core files named python-zdump.000, .001. (there is only room for 2)Any ideas what those are?

The VMs in the cluster seems to be running OK. How do I get control of VSAN bacK?

This is vSphere Version 6.5.0.10000 Build 6671409, ESXi 6.5.0, 5969303

Tags (1)
0 Kudos
5 Replies
TheBobkin
VMware Employee
VMware Employee

Hello wsanders11,

Are all the hosts on the same build version (5969303)?

Test if restarting vsanmgmtd on all hosts has any effect on web client info availability (this has no negative effects and does not break communication with vCenter):

#/etc/init.d/vsanmgmtd restart

Check if same visibility situation when going via the HTML client:

https://vCenterIP/ui

There is actually Health check for vSAN cluster available via direct connection to single host via HTML client in 6.6:

https://HostIP/ui

Host > Datastores > vsanDatastore > Monitor > Health (or something similar - no lab access at present to check, sorry)

If restarting vsanmgmtd has no effect then (depending on other factors), you could try rebooting vCenter (or restarting all of its services):

https://kb.vmware.com/s/article/2109881

SSH to the host that is disconnected from vCenter, the following will have no impact on activity of a host that is disconnected from vCenter:

# ps | grep hostd  (this checks open hostd processes)

#/etc/init.d/hostd status (checks hostd status e.g. 'running')

#/etc/init.d/hostd stop (stops hostd)

# ps | grep hostd  (if hostd processes are still open then hostd is hung and need to find out why)

Can try killing these hung processes but this will likely not work, rebooting host may be only option if hostd can not be remediated.

If the host can not be reconnected to vCenter (due to hostd issues) then unfortunately you can not vMotion anything off, however if the VMs are still up and functional (more likely than with non-vSAN storage) you can shut-down VMs from CLI or via HTML client at a time that is best and put host in MM with Ensure Accessibility and reboot (either via CLI or HTML client depending which is accessible).

The above points are not everything possible to do in this situation - just the 'Cliff's Notes' - if deeper analysis is required and available then always if in doubt open a Support Request with GSS.

Bob

0 Kudos
TheBobkin
VMware Employee
VMware Employee

Hello wsanders11,

Just a  short comment based on you updating your question:

Could those python-zdumps have been present before updating to 6.5 U1?

"If the disk is damaged, you can not activate vSAN or add ESXi host to vSAN cluster

If there is a corrupted storage device on the host, executing vSAN or adding a host to the vSAN cluster may cause the operation to fail. After doing this operation, Python zdump will be present on the host, the vdq -qcommand will fail and a core dump will be created on the relevant host.

This issue has been fixed in this release."

https://docs.vmware.com/en/VMware-vSphere/6.5/rn/vsphere-esxi-651-release-notes.html

Check the time stamps on them:

# ls -lah /locker/var/core

Dumps are safe to move to another location if you wish to retain them and are not active - though if they are not doing anything untoward like filling up a RAMdisk then moving is unlikely to change the situation.

Bob

0 Kudos
wsanders11
Enthusiast
Enthusiast

Attempting to restart vsanmgtd caused all kins of havoc:

[root] /etc/init.d/vsanmgmtd restart

watchdog-vsanperfsvc: Terminating watchdog process with PID 68918

stopping timed out

Failed to clear memory reservation to resource pool 'vsanperfsvc'

vsanperfsvc is running

vSAN health alarm 'Hosts with connectivity issues'

vSAN health alarm 'Physical disk health retrieval issues'

vSAN health alarm 'vSphere cluster members do not match vSAN cluster members'

vSAN health alarm 'vSAN CLOMD liveness'

The core dumps are brand new and get recreated each time I stop and restart HA.

0 Kudos
TheBobkin
VMware Employee
VMware Employee

Hello wsanders11,

Restarting that service may be allowing retrieval of information that was not being communicated until now (potential cause of alerts).

I think it is necessary to establish whether there is a vSAN-cluster issue or a vCenter-to-host communication issue:

check cluster membership that all nodes are participating in cluster:

# esxcli vsan cluster get

Check the reported decom state for all nodes (from any host):

#cmmds-tool find -t NODE_DECOM_STATE -f json

Check clomd service is running on all hosts (should not be running on Witness if stretched-cluster)

# /etc/init.d/clomd status

Are the alarms you are seeing for the node that was disconnected from vCenter or another node?

Bob

0 Kudos
wsanders11
Enthusiast
Enthusiast

Time to call support. Th host has disappeared from vcenter entirely and is partitioned off fro the rest of the cluster although all the VMs on it seems to be up....

[root] vim-cmd vmsvc/getallvms

Skipping invalid VM '1'

Skipping invalid VM '10'

Skipping invalid VM '11'

Skipping invalid VM '12'

Skipping invalid VM '2'

Skipping invalid VM '3'

Skipping invalid VM '37'

Skipping invalid VM '38'

Skipping invalid VM '4'

Skipping invalid VM '5'

Skipping invalid VM '6'

Skipping invalid VM '7'

Skipping invalid VM '9'

Vmid   Name   File   Guest OS   Version   Annotation

[root] esxcli vsan cluster get

Cluster Information

   Enabled: true

   Current Local Time: 2017-11-09T16:49:06Z

   Local Node UUID: 593f27d9-a240-e79a-1701-0cc47ad3f8ca

   Local Node Type: NORMAL

   Local Node State: MASTER

   Local Node Health State: HEALTHY

   Sub-Cluster Master UUID: 593f27d9-a240-e79a-1701-0cc47ad3f8ca

   Sub-Cluster Backup UUID:

   Sub-Cluster UUID: 52aac3f6-0daf-b3df-2ab7-f444ee7a223a

   Sub-Cluster Membership Entry Revision: 0

   Sub-Cluster Member Count: 1

   Sub-Cluster Member UUIDs: 593f27d9-a240-e79a-1701-0cc47ad3f8ca

   Sub-Cluster Membership UUID: 95c4035a-a0ef-5297-cd96-0cc47ad3f8ca

   Unicast Mode Enabled: true

   Maintenance Mode State: OFF

   Config Generation: c4f3fb52-9373-4039-932e-a62dce8cc020 1 2017-11-08T23:49:15.643

0 Kudos