Solved: Re: vSAN - Cluster reboot 0% Free /vmfs/volumes/v...

JeremeyWise · ‎09-14-2021

Cluster reboot. Three nodes. All seem to correlate as to cluster ID... and IDs of three nodes that SHOULD be in the cluster. Error in logs about 0 bytes free

## ODIN

[root@odin:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2021-09-14T17:23:43Z
Local Node UUID: 60f58b67-5dc7-eada-b179-a0423f35e8ee
Local Node Type: NORMAL
Local Node State: BACKUP
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 60f591d6-36a6-1390-8715-98be9459fea0
Sub-Cluster Backup UUID: 60f58b67-5dc7-eada-b179-a0423f35e8ee
Sub-Cluster UUID: 52f2089b-1819-0833-66e0-5c9b09f7312b
Sub-Cluster Membership Entry Revision: 2
Sub-Cluster Member Count: 3
Sub-Cluster Member UUIDs: 60f591d6-36a6-1390-8715-98be9459fea0, 60f58b67-5dc7-eada-b179-a0423f35e8ee, 60f584a0-1d04-3c42-154b-a0423f377a7e
Sub-Cluster Member HostNames: medusa.penguinpages.local, odin.penguinpages.local, thor.penguinpages.local
Sub-Cluster Membership UUID: 5ad84061-5bea-ca79-54f5-98be9459fea0
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: 60f58b67-5dc7-eada-b179-a0423f35e8ee 2 2021-08-31T20:34:19.0
Mode: REGULAR
[root@odin:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name Cert Thumbprint SubClusterUuid
------------------------------------ --------- ---------------- -------------- ----- ---------- --------------- --------------
60f584a0-1d04-3c42-154b-a0423f377a7e 0 true 172.16.101.101 12321 52f2089b-1819-0833-66e0-5c9b09f7312b
60f591d6-36a6-1390-8715-98be9459fea0 0 true 172.16.101.103 12321 52f2089b-1819-0833-66e0-5c9b09f7312b
[root@odin:~] /etc/init.d/vsanmgmtd restart
Terminating watchdog process with PID 2098739
vsanperfsvc stopped.
vsanperfsvc started.
[root@odin:~] tail /var/log/syslog.log
2021-09-14T17:30:00Z root: There are 1 /usr/lib/vmware/vsan/bin/vsanObserver.sh running ...
2021-09-14T17:30:00Z root: Cannot parse UUID from /vsantraces
2021-09-14T17:30:00Z root: Failed to get freeMB from UUID. Roll back.
2021-09-14T17:30:01Z root: Calc for ramdisk mounted on /, freeMB:28
2021-09-14T17:30:01Z root: Calc for ramdisk mounted on /vsantraces, freeMB:281
2021-09-14T17:30:01Z root: CalcFreeSpace sizeKB: 112, freeMB: 281
2021-09-14T17:30:37Z /etc/init.d/vsanmgmtd: Terminating watchdog process with PID 2098739
2021-09-14T17:30:37Z watchdog-vsanperfsvc: [2098739] Signal received: exiting the watchdog
2021-09-14T17:30:40Z watchdog-vsanperfsvc: [2102925] Begin 'vsanmgmtd -c /etc/vmware/vsan/vsanmgmt-config.xml', min-uptime = 60, max-quick-failures = 1, max-total-failures = 1000000, bg_pid_file = '', reboot-flag = '0'
2021-09-14T17:30:40Z watchdog-vsanperfsvc: Executing 'vsanmgmtd -c /etc/vmware/vsan/vsanmgmt-config.xml'
[root@odin:~] df -h
Filesystem Size Used Available Use% Mounted on
VMFS-6 348.8G 125.3G 223.4G 36% /vmfs/volumes/local_vmfs_odin
VMFS-L 6.2G 1.4G 4.8G 22% /vmfs/volumes/LOCKER-60f58c48-f9584be0-2761-a0423f35e8ee
VFFS 127.8G 1.7G 126.1G 1% /vmfs/volumes/OSDATA-611be14d-ff327e68-c05d-a0423f35e8ee
vfat 499.7M 204.6M 295.1M 41% /vmfs/volumes/BOOTBANK1
vfat 499.7M 204.6M 295.1M 41% /vmfs/volumes/BOOTBANK2
vsan 0.0B 0.0B 0.0B 0% /vmfs/volumes/vsanDatastore
[root@odin:~] vim-cmd vmsvc/getallvms
Skipping invalid VM '119'
Skipping invalid VM '121'
Skipping invalid VM '95'
Skipping invalid VM '97'
Vmid Name File Guest OS Version Annotation
106 os01-n97w2-master-1 [local_vmfs_odin] os01-n97w2-master-1/os01-n97w2-master-1.vmx rhel8_64Guest vmx-19
109 os01-n97w2-worker-8jb9r [local_vmfs_odin] os01-n97w2-worker-8jb9r/os01-n97w2-worker-8jb9r.vmx rhel8_64Guest vmx-19 os01-n97w2-worker-8jb9r

## THOR

[root@thor:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2021-09-14T17:23:45Z
Local Node UUID: 60f584a0-1d04-3c42-154b-a0423f377a7e
Local Node Type: NORMAL
Local Node State: AGENT
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 60f591d6-36a6-1390-8715-98be9459fea0
Sub-Cluster Backup UUID: 60f58b67-5dc7-eada-b179-a0423f35e8ee
Sub-Cluster UUID: 52f2089b-1819-0833-66e0-5c9b09f7312b
Sub-Cluster Membership Entry Revision: 2
Sub-Cluster Member Count: 3
Sub-Cluster Member UUIDs: 60f591d6-36a6-1390-8715-98be9459fea0, 60f58b67-5dc7-eada-b179-a0423f35e8ee, 60f584a0-1d04-3c42-154b-a0423f377a7e
Sub-Cluster Member HostNames: medusa.penguinpages.local, odin.penguinpages.local, thor.penguinpages.local
Sub-Cluster Membership UUID: 5ad84061-5bea-ca79-54f5-98be9459fea0
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: 60f584a0-1d04-3c42-154b-a0423f377a7e 2 2021-08-31T20:55:13.0
Mode: REGULAR
[root@thor:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name Cert Thumbprint SubClusterUuid
------------------------------------ --------- ---------------- -------------- ----- ---------- --------------- --------------
60f58b67-5dc7-eada-b179-a0423f35e8ee 0 true 172.16.101.102 12321 52f2089b-1819-0833-66e0-5c9b09f7312b
60f591d6-36a6-1390-8715-98be9459fea0 0 true 172.16.101.103 12321 52f2089b-1819-0833-66e0-5c9b09f7312b

[root@thor:~] vim-cmd vmsvc/getallvms

Skipping invalid VM '465'
Skipping invalid VM '481'
Skipping invalid VM '482'
Skipping invalid VM '483'
Skipping invalid VM '495'
Skipping invalid VM '498'
Vmid Name File Guest OS Version Annotation
476 os01-n97w2-master-2 [local_vmfs_thor] os01-n97w2-master-2/os01-n97w2-master-2.vmx rhel8_64Guest vmx-19
479 os01-n97w2-worker-pdpws [local_vmfs_thor] os01-n97w2-worker-pdpws/os01-n97w2-worker-pdpws.vmx rhel8_64Guest vmx-19 os01-n97w2-worker-pdpws
492 ns01 [local_vmfs_thor] ns01/ns01.vmx rhel8_64Guest vmx-19
[root@thor:~]

## Medusa

[root@medusa:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2021-09-14T17:23:38Z
Local Node UUID: 60f591d6-36a6-1390-8715-98be9459fea0
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 60f591d6-36a6-1390-8715-98be9459fea0
Sub-Cluster Backup UUID: 60f58b67-5dc7-eada-b179-a0423f35e8ee
Sub-Cluster UUID: 52f2089b-1819-0833-66e0-5c9b09f7312b
Sub-Cluster Membership Entry Revision: 2
Sub-Cluster Member Count: 3
Sub-Cluster Member UUIDs: 60f591d6-36a6-1390-8715-98be9459fea0, 60f58b67-5dc7-eada-b179-a0423f35e8ee, 60f584a0-1d04-3c42-154b-a0423f377a7e
Sub-Cluster Member HostNames: medusa.penguinpages.local, odin.penguinpages.local, thor.penguinpages.local
Sub-Cluster Membership UUID: 5ad84061-5bea-ca79-54f5-98be9459fea0
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: 60f591d6-36a6-1390-8715-98be9459fea0 2 2021-08-31T20:34:35.0
Mode: REGULAR
[root@medusa:~] /etc/init.d/vsanmgmtd restart
Terminating watchdog process with PID 1050247
vsanperfsvc stopped.
vsanperfsvc started.
[root@medusa:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name Cert Thumbprint SubClusterUuid
------------------------------------ --------- ---------------- -------------- ----- ---------- --------------- --------------
60f584a0-1d04-3c42-154b-a0423f377a7e 0 true 172.16.101.101 12321 52f2089b-1819-0833-66e0-5c9b09f7312b
60f58b67-5dc7-eada-b179-a0423f35e8ee 0 true 172.16.101.102 12321 52f2089b-1819-0833-66e0-5c9b09f7312b
[root@medusa:~] vim-cmd vmsvc/getallvms
Skipping invalid VM '69'
Skipping invalid VM '70'
Vmid Name File Guest OS Version Annotation
74 os01-n97w2-master-0 [local_vmfs_medusa] os01-n97w2-master-0/os01-n97w2-master-0.vmx rhel8_64Guest vmx-19
77 os01-n97w2-worker-jzq9s [local_vmfs_medusa] os01-n97w2-worker-jzq9s/os01-n97w2-worker-jzq9s.vmx rhel8_64Guest vmx-19 os01-n97w2-worker-jzq9s

I did a ping test between each host on the dedicated 10Gb interfaces from each node to others and works fine.

Nerd needing coffee

TheBobkin · ‎09-17-2021

@JeremeyWise

"Any other logs I can look at where these two failures are noted.. but disk shows online and ok... and ping and I think multicast are working .. so not sure what else to check to clear those two errors"

It would appear you have checked a lot of not-particularly helpful logs for the issue at hand (data loss after 2/3 DGs failed) - the best log to start with is almost always vmkernel.log, if you are unsure whether an event (specific events as this doesn't log anything that isn't a VOB) is covered in vmkernel.log then validate with vobd.log.
If vmkernel.log doesn't cover the time in question and you can't replicate the event (e.g. attempt mount of a broken DG) then vmkwarning.log tells some things but not specifics (e.g. it will tell you a disk was dropped or became PDL but not the SCSI Sense codes that inform of why).

vsandevicemonitord.log will only log for specific kind of events (e.g. prolonged latency on a device).

The devices you are using here appear to be some form of WD Blue M.2 and probably the cheapest Kingston SSD/NVMe you can find - running vSAN on these (and especially anything with somewhat complex Cache to Capacity interplay in both directions e.g. deduplication) is just asking for trouble - Enterprise devices (e.g. ones on the vSAN HCL) are not relatively expensive for no reason, these have sophisticated error-checking and failsafe mechanisms that mean they don't mangle IOs during things like a power outage. If you are going to continue using this cluster for vSAN then either get some form of supported devices or at the very least use least-complex configuration (e.g. no dedupe, no compression, no encryption, RAID1 throughout, non-Stretched).
It is quite possible that these DGs are corrupt as what is stored/mapped from Cache data and/or metadata doesn't match what is on the disks (particularly dedupe mappings) - the only thing I would try is cold (e.g. all the way to off) power-cycling the nodes and then seeing is it able to mount the DGs or just failing due to something like the above.

View solution in original post

JeremeyWise · ‎09-17-2021

I think the drives were mangled due to power issue beyond repair.. and as you noted... with cheaper parts... I will avoid dedupe and compression complexity on volumes where data loss matters.

Thanks for response as always.

Nerd needing coffee

View solution in original post

JeremeyWise · ‎09-14-2021

<<UPdate>>>

It has now been a bit over an hour.. and the state of the VM list on each host i saw change... Almost like vSAN is catching up on replication backlog .. or repair job is working to get things back fully up.

## THOR

[root@thor:~] vim-cmd vmsvc/getallvms
Skipping invalid VM '465'
Skipping invalid VM '481'
Vmid Name File Guest OS Version Annotation
476 os01-n97w2-master-2 [local_vmfs_thor] os01-n97w2-master-2/os01-n97w2-master-2.vmx rhel8_64Guest vmx-19
479 os01-n97w2-worker-pdpws [local_vmfs_thor] os01-n97w2-worker-pdpws/os01-n97w2-worker-pdpws.vmx rhel8_64Guest vmx-19 os01-n97w2-worker-pdpws
482 rhel8 [vsanDatastore] c5c62e61-a0ca-04f5-4210-a0423f35e8ee/rhel8.vmx rhel8_64Guest vmx-19
483 web02 [vsanDatastore] b5653f61-2773-11da-c7ca-a0423f35e8ee/web02.vmx rhel8_64Guest vmx-19
492 ns01 [local_vmfs_thor] ns01/ns01.vmx rhel8_64Guest vmx-19
495 ansible00 [vsanDatastore] 82b04061-1404-cdfc-a0bc-a0423f377a7e/ansible00.vmx rhel8_64Guest vmx-19
498 apc01 [vsanDatastore] 82b04061-e01d-b7fc-9613-a0423f377a7e/apc01.vmx centos64Guest vmx-19
499 vCLS (2741) [vsanDatastore] e8c52e61-76e2-f22e-b467-98be9459fea0/vCLS (2741).vmx other3xLinux64Guest vmx-11 vSphere Cluster Service VM is deployed from an OVA with a minimal installed profile of PhotonOS. vSphere Cluster Service manages the resources, power state and availability of these VMs. vSphere Cluster Service VMs are required for maintaining the health and availability of vSphere Cluster Service. Any impact on the power state or resources of these VMs might degrade the health of the vSphere Cluster Service and cause vSphere DRS to cease operation for the cluster.
500 ns02 [vsanDatastore] 66b04061-6c5e-6792-b7ac-a0423f35e8ee/ns02.vmx rhel8_64Guest vmx-19
[root@thor:~]

## ODIN

[root@odin:~] vim-cmd vmsvc/getallvms
Skipping invalid VM '119'
Skipping invalid VM '121'
Skipping invalid VM '97'
Vmid Name File Guest OS Version Annotation
106 os01-n97w2-master-1 [local_vmfs_odin] os01-n97w2-master-1/os01-n97w2-master-1.vmx rhel8_64Guest vmx-19
109 os01-n97w2-worker-8jb9r [local_vmfs_odin] os01-n97w2-worker-8jb9r/os01-n97w2-worker-8jb9r.vmx rhel8_64Guest vmx-19 os01-n97w2-worker-8jb9r
122 web01 [vsanDatastore] 9f653f61-94e5-e76d-99be-98be9459fea0/web01.vmx rhel8_64Guest vmx-19
95 vCLS (2742) [vsanDatastore] b3d62e61-13ec-1442-cbbc-a0423f35e8ee/vCLS (2742).vmx other3xLinux64Guest vmx-11 vSphere Cluster Service VM is deployed from an OVA with a minimal installed profile of PhotonOS. vSphere Cluster Service manages the resources, power state and availability of these VMs. vSphere Cluster Service VMs are required for maintaining the health and availability of vSphere Cluster Service. Any impact on the power state or resources of these VMs might degrade the health of the vSphere Cluster Service and cause vSphere DRS to cease operation for the cluster.
[root@odin:~]

## Medusa

[root@medusa:~] vim-cmd vmsvc/getallvms
Skipping invalid VM '69'
Vmid Name File Guest OS Version Annotation
70 ados [vsanDatastore] a2092561-d3f4-12de-e030-98be9459fea0/ados.vmx windows2019srvNext_64Guest vmx-19
74 os01-n97w2-master-0 [local_vmfs_medusa] os01-n97w2-master-0/os01-n97w2-master-0.vmx rhel8_64Guest vmx-19
77 os01-n97w2-worker-jzq9s [local_vmfs_medusa] os01-n97w2-worker-jzq9s/os01-n97w2-worker-jzq9s.vmx rhel8_64Guest vmx-19 os01-n97w2-worker-jzq9s
78 vCLS (2740) [vsanDatastore] b1c52e61-1433-5c43-5a3c-a0423f377a7e/vCLS (2740).vmx other3xLinux64Guest vmx-11 vSphere Cluster Service VM is deployed from an OVA with a minimal installed profile of PhotonOS. vSphere Cluster Service manages the resources, power state and availability of these VMs. vSphere Cluster Service VMs are required for maintaining the health and availability of vSphere Cluster Service. Any impact on the power state or resources of these VMs might degrade the health of the vSphere Cluster Service and cause vSphere DRS to cease operation for the cluster.
[root@medusa:~]

Without vCenter vm able to start.. not sure all what is going on... but my ask:

[root@medusa:~] tail /var/log/syslog.log
2021-09-14T18:06:01Z root: There are 1 /usr/lib/vmware/vsan/bin/vsanObserver.sh running ...
2021-09-14T18:06:01Z root: Cannot parse UUID from /vsantraces
2021-09-14T18:06:01Z root: Failed to get freeMB from UUID. Roll back.
2021-09-14T18:06:01Z root: Calc for ramdisk mounted on /, freeMB:28
2021-09-14T18:06:01Z root: Calc for ramdisk mounted on /vsantraces, freeMB:139
2021-09-14T18:06:01Z root: CalcFreeSpace sizeKB: 448, freeMB: 139
2021-09-14T18:06:02Z lsud: [info] fd5:SETLEDSTATE dev name = t10.ATA_____WDC__WDS100T2B0B2D00YS70_________________19106A800900________, state = 2, duration = 250
2021-09-14T18:06:02Z lsud: [info] fd6:SETLEDSTATE dev name = t10.ATA_____MTFDDAV256TBN2D1AR15ABHA_______________________UGXVR01N4AM2B6, state = 2, duration = 250
2021-09-14T18:06:58Z lsud: [info] fd5:SETLEDSTATE dev name = t10.ATA_____WDC__WDS100T2B0B2D00YS70_________________19106A800900________, state = 2, duration = 250
2021-09-14T18:06:58Z lsud: [info] fd6:SETLEDSTATE dev name = t10.ATA_____WDC__WDS100T2B0B2D00YS70_________________19106A800900________, state = 2, duration = 250
[root@medusa:~]

[root@medusa:~] tail -10 /var/log/vsanEsxcli.log
2021-09-02T01:14:03.892Z INFO esxcli [VsanHealthUtil::RunCmd] Cmd: ['/usr/lib/vmware/osfs/bin/objtool', 'getAttr', '-u', 'e8c52e61-2b8d-ae85-c3d2-98be9459fea0', '-z', 'json']
2021-09-02T01:14:03.892Z INFO esxcli [runcommand::runcommand] runcommand called with: args = '['/usr/lib/vmware/osfs/bin/objtool', '++group=host/vim/tmp', 'getAttr', '-u', 'e8c52e61-2b8d-ae85-c3d2-98be9459fea0', '-z', 'json']', outfile = 'None', returnoutput = 'True', timeout = '0'.
2021-09-02T01:14:03.940Z INFO esxcli [VsanHealthUtil::RunCmd] Err: []
2021-09-02T01:14:03.940Z INFO esxcli [VsanHealthUtil::RunCmd] Err: []
2021-09-02T01:14:03.940Z INFO esxcli [VsanHealthUtil::RunCmd] Got: [{ "UUID": "7c1ff760-3a9e-1ac6-4149-a0423f377a7e", "Object type": "vsan", "Object size": "10737418240", "User friendly name": "(null)", "HA metadata": "(null)", "Allocation type": "Thin", "Policy": "((\"stripeWidth\" i1) (\"cacheReservation\" i0) (\"proportionalCapacity\" i0) (\"hostFailuresToTolerate\" i1) (\"forceProvisioning\" i0) (\"spbmProfileId\" \"aa6d5a82-1c88-45da-85d3-3d74b91a5bad\") (\"spbmProfileGenerationNumber\" l+0) (\"spbmProfileName\" \"vSAN Default Storage Policy\"))", "Object class": "vdisk", "Object capabilities": "NONE", "Object path": "/vmfs/volumes/vsan:52f2089b18190833-66e05c9b09f7312b/7a1ff760-e489-6665-f841-a0423f377a7e/vcenter01_7.vmdk", "Group uuid": "7a1ff760-e489-6665-f841-a0423f377a7e", "Container uuid": "(null)", "IsSparse": "0", "Filesystem Type": "vmfs5d" } ]
2021-09-02T01:14:03.940Z INFO esxcli [VsanHealthUtil::RunCmd] Got: [{ "UUID": "7c1ff760-bcaa-18cc-0bcd-a0423f377a7e", "Object type": "vsan", "Object size": "16106127360", "User friendly name": "(null)", "HA metadata": "(null)", "Allocation type": "Thin", "Policy": "((\"stripeWidth\" i1) (\"cacheReservation\" i0) (\"proportionalCapacity\" i0) (\"hostFailuresToTolerate\" i1) (\"forceProvisioning\" i0) (\"spbmProfileId\" \"aa6d5a82-1c88-45da-85d3-3d74b91a5bad\") (\"spbmProfileGenerationNumber\" l+0) (\"spbmProfileName\" \"vSAN Default Storage Policy\"))", "Object class": "vdisk", "Object capabilities": "NONE", "Object path": "/vmfs/volumes/vsan:52f2089b18190833-66e05c9b09f7312b/7a1ff760-e489-6665-f841-a0423f377a7e/vcenter01_6.vmdk", "Group uuid": "7a1ff760-e489-6665-f841-a0423f377a7e", "Container uuid": "(null)", "IsSparse": "0", "Filesystem Type": "vmfs5d" } ]
2021-09-02T01:14:03.952Z INFO esxcli [VsanHealthUtil::RunCmd] Err: []
2021-09-02T01:14:03.953Z INFO esxcli [VsanHealthUtil::RunCmd] Got: [{ "UUID": "6c272a61-c840-a1ec-7281-a0423f377a7e", "Object type": "vsan", "Object size": "12884901888", "User friendly name": "(null)", "HA metadata": "(null)", "Allocation type": "Thin", "Policy": "((\"stripeWidth\" i1) (\"proportionalCapacity\" (i0 i100)) (\"hostFailuresToTolerate\" i1) (\"forceProvisioning\" i1))", "Object class": "vmswap", "Object capabilities": "NONE", "Object path": "/vmfs/volumes/vsan:527f51ca791241e1-ff9ba4482942519d/51da2861-4a8d-ceea-c0d5-a0423f377a7e/vcenter02-ff0dd788.vswp", "Group uuid": "51da2861-4a8d-ceea-c0d5-a0423f377a7e", "Container uuid": "(null)", "IsSparse": "0", "Filesystem Type": "vmfs5d" } ]
2021-09-02T01:14:03.965Z INFO esxcli [VsanHealthUtil::RunCmd] Err: []
2021-09-02T01:14:03.965Z INFO esxcli [VsanHealthUtil::RunCmd] Got: [{ "UUID": "e8c52e61-2b8d-ae85-c3d2-98be9459fea0", "Object type": "vsan", "Object size": "2147483648", "User friendly name": "(null)", "HA metadata": "(null)", "Allocation type": "Thin", "Policy": "((\"stripeWidth\" i1) (\"cacheReservation\" i0) (\"proportionalCapacity\" i0) (\"hostFailuresToTolerate\" i1) (\"forceProvisioning\" i0) (\"spbmProfileId\" \"aa6d5a82-1c88-45da-85d3-3d74b91a5bad\") (\"spbmProfileGenerationNumber\" l+0) (\"spbmProfileName\" \"vSAN Default Storage Policy\"))", "Object class": "vdisk", "Object capabilities": "NONE", "Object path": "/vmfs/volumes/vsan:52f2089b18190833-66e05c9b09f7312b/e8c52e61-76e2-f22e-b467-98be9459fea0/vCLS (2741).vmdk", "Group uuid": "e8c52e61-76e2-f22e-b467-98be9459fea0", "Container uuid": "(null)", "IsSparse": "0", "Filesystem Type": "vmfs5d" } ]
[root@medusa:~]

1) How do I run listing of vSAN repair / recovery jobs "in progress" ( besides watch timestamps on /var/log/vsanEsxcli.log or other log files randomly)

2) How do I check if a service is stuck. Is their a particular log that shows pending repair state..

I am ok with "wait... I am working on it" .. but I need to know that vs "I am stuck... and need some one to do something or make a change/decision"

Nerd needing coffee

JeremeyWise · ‎09-14-2021

<< More updates >>

vSAN device 521b0b8f-2b11-dd1d-fcc6-87b1849bb0f1 is being repaired due to I/O failures, and will be out of service until the repair is complete. If the device is part of a dedup disk group, the entire disk group will be out of service until the repair is complete.

I poked around for a while watching fairly unhelpful event scrolling by from logs:

# Main system logs

tail -10 /var/log/syslog.log

# Various vSAN logs

[root@medusa:~] tail -10 /var/log/vsan
vsanEsxcli.log vsandevicemonitord.log vsanfs.mgmt.log vsanmgmt-tools-1.log vsanmgmt.log vsantraces/
vsananalyticsevents.log vsanfs.configdump.log vsanfs.vdfsop.log vsanmgmt-tools.log vsansystem.log

But what was helpful is the GUI I found the image conveying what is happening..

#odin
vSAN device 521b0b8f-2b11-dd1d-fcc6-87b1849bb0f1 is under propagated permanent error.
vSAN device 521b0b8f-2b11-dd1d-fcc6-87b1849bb0f1 is being repaired due to I/O failures, and will be out of service until the repair is complete. If the device is part of a dedup disk group, the entire disk group will be out of service until the repair is complete.

# thor
vSAN device 52847854-a96b-c757-e27c-bb0a0f9cae92 is under permanent failure.
vSAN device 52d950ac-93f1-4654-6dad-02431ade7a32 is under propagated permanent error.
vSAN device 52d950ac-93f1-4654-6dad-02431ade7a32 is being repaired due to I/O failures, and will be out of service until the repair is complete. If the device is part of a dedup disk group, the entire disk group will be out of service until the repair is complete.

#medusa
vSAN device 52847854-a96b-c757-e27c-bb0a0f9cae92 is under permanent failure.
vSAN device 52b32fd7-2a2a-503c-dda0-e5f420513552 is under propagated permanent error.
vSAN device 52d950ac-93f1-4654-6dad-02431ade7a32 is being repaired due to I/O failures, and will be out of service until the repair is complete. If the device is part of a dedup disk group, the entire disk group will be out of service until the repair is complete.

What I now lack is how to

1) Know it is "going and not stuck"...

2) Know any form of ETA. even if it is rough.

It is hard to stand around and say... don't touch.. when you have no point of reference that things are not stuck. We just need some means to note ETA till data back online. Any idea how to get that? (number of blocks / chunklets out of sync.. % of file being replicated etc..)

Nerd needing coffee

JeremeyWise · ‎09-16-2021

<poke>

Any ideas on next steps?

Two server just seem to keep rotation of events:

2021-09-16T14:20:04.498Z info vsansystem[1050129] [vSAN@6876 sub=Libs] VsanInfoImpl: Loading 1 dit subclusters from config store on normal node
2021-09-16T14:20:04.498Z info vsansystem[1050129] [vSAN@6876 sub=Libs] VsanConfigStore: Get DatastoreName
2021-09-16T14:20:04.498Z info vsansystem[1050129] [vSAN@6876 sub=Libs] VsanConfigStore: Found datastore name - vsanDatastore
2021-09-16T14:20:09.506Z info vsansystem[1050129] [vSAN@6876 sub=Libs] VsanInfoImpl: Refresh config generation
2021-09-16T14:20:09.515Z info vsansystem[1050129] [vSAN@6876 sub=Libs] VsanInfoImpl: vSan mode is set to : Mode_None
2021-09-16T14:20:09.516Z info vsansystem[1050129] [vSAN@6876 sub=Libs] VsanConfigStore: Get subCluster config count=1
2021-09-16T14:20:09.516Z info vsansystem[1050129] [vSAN@6876 sub=Libs] VsanConfigStore: Get subCluster config count=1
2021-09-16T14:20:09.516Z info vsansystem[1050129] [vSAN@6876 sub=Libs] VsanInfoImpl: Loading 1 dit subclusters from config store on normal node
2021-09-16T14:20:09.516Z info vsansystem[1050129] [vSAN@6876 sub=Libs] VsanConfigStore: Get DatastoreName
2021-09-16T14:20:09.516Z info vsansystem[1050129] [vSAN@6876 sub=Libs] VsanConfigStore: Found datastore name - vsanDatastore

vs medusa server has some weird note about disk on boot process... but all disk I would expect to see.. show and are noted as "ok"

2021-09-16T14:20:09.516Z info vsansystem[1050129] [vSAN@6876 sub=Libs] VsanConfigStore: Found datastore name - vsanDatastore
2021-09-16T14:20:14.129Z info vsansystem[1050136] [vSAN@6876 sub=VsanSystem opId=03535038]
--> CmmdsUpdateHistoBuckets(100us) <1 <2 <4 <8 <16 <32 <64 <128 <256 <512 <1024 <2048 <4096 <8192 <16384 <infinite
--> disk ok 10 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
--> disk not found 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
--> diskStatus ok 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
--> diskStatus not found 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
--> diskUsage ok 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
--> diskUsage not found 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
--> healthStatus ok 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
--> healthStatus not found 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
--> nodeDecomState ok 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-->
--> EoverflowCount: 0
-->
--> PublishHostCapacity: 8
-->
--> DiskEntryCount: 0 DiskGroupEntryCount: 0
-->
--> DiskCacheCount: 0 DiskGroupCacheCount: 0
-->
2021-09-16T14:20:14.524Z info vsansystem[1050129] [vSAN@6876 sub=Libs] VsanInfoImpl: Refresh config generation
2021-09-16T14:20:14.534Z info vsansystem[1050129] [vSAN@6876 sub=Libs] VsanInfoImpl: vSan mode is set to : Mode_None
2021-09-16T14:20:14.535Z info vsansystem[1050129] [vSAN@6876 sub=Libs] VsanConfigStore: Get subCluster config count=1
2021-09-16T14:20:14.535Z info vsansystem[1050129] [vSAN@6876 sub=Libs] VsanConfigStore: Get subCluster config count=1
2021-09-16T14:20:14.535Z info vsansystem[1050129] [vSAN@6876 sub=Libs] VsanInfoImpl: Loading 1 dit subclusters from config store on normal node
2021-09-16T14:20:14.535Z info vsansystem[1050129] [vSAN@6876 sub=Libs] VsanConfigStore: Get DatastoreName

The difference on this vSAN build is I enabled de-dupe and compression on the vSAN volume.

Questions:

1) Is their some heathcheck repliation service I can view logs on to see where it is stuck on repair?

2) Because I only have three nodes and so each in its own replication group... goal of HA more then capacity.. Are their steps to break it apart and get disk to just come online solo.. so I can extract VMs to my local VMFS datastore.

3) I was hoping to use dedupe and compression to save on cost to maintain, but wondering if this is such a good idea. Is their a means to check logs to see if that was part of this failure?

Nerd needing coffee

JeremeyWise · ‎09-16-2021

< Update >

Found this site: Tips and Tricks for vSAN troubleshooting – virtualFrog (wordpress.com)

Two issues. Disk and Network.

[root@odin:/var/log] esxcli vsan health cluster list
Health Test Name Status
-------------------------------------------------- ------
Overall health red (Network misconfiguration)
Network red
Hosts with connectivity issues red
vSAN cluster partition green
All hosts have a vSAN vmknic configured green
vSAN: Basic (unicast) connectivity check green
vSAN: MTU check (ping with large packet size) green
vMotion: Basic (unicast) connectivity check green
vMotion: MTU check (ping with large packet size) green
Network latency check green
Multicast assessment based on other checks green
All hosts have matching multicast settings green
Physical disk red
Operation health red
Disk capacity green
Congestion green
Component limit health green
Component metadata health green
Memory pools (heaps) green
Memory pools (slabs) green
Performance service yellow
Performance service status yellow
Data green
vSAN object health green
vSAN object format health green
Cluster green
Advanced vSAN configuration in sync green
vSAN daemon liveness green
vSAN Disk Balance green
Resync operations throttling green
Software version compatibility green
Disk format version green
Capacity utilization green
Storage space green
Read cache reservations green
Component green
What if the most consumed host fails green

[root@thor:/var/log] esxcli vsan debug resync summary get
Total Number Of Resyncing Objects: 0
Total Bytes Left To Resync: 0
Total GB Left To Resync: 0.00

I have done ping between each / every host. with MTU test and it all works..

Ex: From thor 172.16.101.101 (10Gb NIC dedicate to VSAN replication on standard vswitch "vswitch1" with vmk1 interface)

[root@thor:/var/log] vmkping -I vmk1 172.16.101.101 -d -s 8972
PING 172.16.101.101 (172.16.101.101): 8972 data bytes
8980 bytes from 172.16.101.101: icmp_seq=0 ttl=64 time=0.087 ms
8980 bytes from 172.16.101.101: icmp_seq=1 ttl=64 time=0.066 ms

--- 172.16.101.101 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.066/0.076/0.087 ms

[root@thor:/var/log] vmkping -I vmk1 172.16.101.102 -d -s 8972
PING 172.16.101.102 (172.16.101.102): 8972 data bytes
8980 bytes from 172.16.101.102: icmp_seq=0 ttl=64 time=0.387 ms
8980 bytes from 172.16.101.102: icmp_seq=1 ttl=64 time=0.361 ms

--- 172.16.101.102 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.361/0.374/0.387 ms

[root@thor:/var/log] vmkping -I vmk1 172.16.101.103 -d -s 8972
PING 172.16.101.103 (172.16.101.103): 8972 data bytes
8980 bytes from 172.16.101.103: icmp_seq=0 ttl=64 time=0.242 ms
8980 bytes from 172.16.101.103: icmp_seq=1 ttl=64 time=0.219 ms

--- 172.16.101.103 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.219/0.230/0.242 ms

So not sure what else to go through.

Nerd needing coffee

JeremeyWise · ‎09-16-2021

Clawing forward..

esxcli vsan health cluster list -w ## this give you heath check list with short name you can use for better detail of heath failures

[root@thor:~] esxcli vsan health cluster list -w |grep red
Overall health red (Network misconfiguration)
Network red
Hosts with connectivity issues (hostconnectivity) red
All hosts have a vSAN vmknic configured (vsanvmknic) green
Physical disk red
Operation health (physdiskoverall) red
Component metadata health (componentmetadata) red

# Now look at each red event with details (well or that is the hope)

[root@thor:~] esxcli vsan health cluster list -w |grep red
Overall health red (Network misconfiguration)
Network red
Hosts with connectivity issues (hostconnectivity) red
All hosts have a vSAN vmknic configured (vsanvmknic) green
Physical disk red
Operation health (physdiskoverall) red
Component metadata health (componentmetadata) red
[root@thor:~] esxcli vsan health cluster get -t hostconnectivity
Hosts with connectivity issues red

Checks if API calls from VC to a host are failing while the host is in connected state.
Ask VMware: http://www.vmware.com/esx/support/askvmware/index.php?eventtype=com.vmware.vsan.health.test.hostconn...

Hosts with communication issues
Host
-------------------
172.16.101.102
172.16.101.103

[root@thor:~] esxcli vsan health cluster get -t componentmetadata
Component metadata health red

Checks whether vSAN has encountered an integrity issue of the metadata of a component on this disk.
Ask VMware: http://www.vmware.com/esx/support/askvmware/index.php?eventtype=com.vmware.vsan.health.test.componen...

Components with issues
Host Component Health Notes
-----------------------------------------------------------------------------------------
172.16.101.101 7eb04061-67b4-b70b-1995-a0423f35e8ee red Invalid state
172.16.101.101 a5b84061-a38c-0943-9acb-a0423f35e8ee red Invalid state

[root@thor:~] esxcli vsan health cluster get -t physdiskoverall
Operation health red

Checks the operation health of the physical disks for all hosts in the vSAN cluster.
Ask VMware: http://www.vmware.com/esx/support/askvmware/index.php?eventtype=com.vmware.vsan.health.test.physdisk...

Disks with issues
Host Disk Overall health Metadata health Operational health In CMMDS/VSI Operational State Description Recommendation UUID
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
172.16.101.101 Local ATA Disk (t10.ATA_____KINGSTON_SA400S37120G___________________50026B77838F133D____) red green red Yes/Yes Propagated permanent disk failure in disk group Please unmount the disk group 52d950ac-93f1-4654-6dad-02431ade7a32
172.16.101.101 Local ATA Disk (t10.ATA_____WDC__WDS100T2B0B2D00YS70_________________192490801828________) red green red No/Yes Permanent disk failure Please remove the disk 52847854-a96b-c757-e27c-bb0a0f9cae92

Which is what I see in GUI but not much help.

I already know I can ping with 8kmtu. Switch has not changed but .. testing mutlicast to see if that is / somehow changed and blocking things:

[root@thor:~] tcpdump-uw -c 10 -n
tcpdump-uw: verbose output suppressed, use -v or -vv for full protocol decode
listening on vmk0, link-type EN10MB (Ethernet), capture size 262144 bytes
20:32:39.742910 IP 172.16.100.101.22 > 172.16.100.144.1766: Flags [P.], seq 1849270085:1849270293, ack 1990220337, win 128, length 208
20:32:39.743016 IP 172.16.100.101.22 > 172.16.100.144.1766: Flags [P.], seq 208:400, ack 1, win 128, length 192
20:32:39.743057 IP 172.16.100.101.22 > 172.16.100.144.1766: Flags [P.], seq 400:576, ack 1, win 128, length 176
20:32:39.743095 IP 172.16.100.101.22 > 172.16.100.144.1766: Flags [P.], seq 576:752, ack 1, win 128, length 176
20:32:39.743131 IP 172.16.100.101.22 > 172.16.100.144.1766: Flags [P.], seq 752:928, ack 1, win 128, length 176
20:32:39.743167 IP 172.16.100.101.22 > 172.16.100.144.1766: Flags [P.], seq 928:1104, ack 1, win 128, length 176
20:32:39.743203 IP 172.16.100.101.22 > 172.16.100.144.1766: Flags [P.], seq 1104:1280, ack 1, win 128, length 176
20:32:39.743239 IP 172.16.100.101.22 > 172.16.100.144.1766: Flags [P.], seq 1280:1456, ack 1, win 128, length 176
20:32:39.743275 IP 172.16.100.101.22 > 172.16.100.144.1766: Flags [P.], seq 1456:1632, ack 1, win 128, length 176
20:32:39.743310 IP 172.16.100.101.22 > 172.16.100.144.1766: Flags [P.], seq 1632:1808, ack 1, win 128, length 176
10 packets captured
10 packets received by filter
0 packets dropped by kernel
[root@thor:~]

Any other logs I can look at where these two failures are noted.. but disk shows online and ok... and ping and I think multicast are working .. so not sure what else to check to clear those two errors

Nerd needing coffee

TheBobkin · ‎09-17-2021

@JeremeyWise

"Any other logs I can look at where these two failures are noted.. but disk shows online and ok... and ping and I think multicast are working .. so not sure what else to check to clear those two errors"

It would appear you have checked a lot of not-particularly helpful logs for the issue at hand (data loss after 2/3 DGs failed) - the best log to start with is almost always vmkernel.log, if you are unsure whether an event (specific events as this doesn't log anything that isn't a VOB) is covered in vmkernel.log then validate with vobd.log.
If vmkernel.log doesn't cover the time in question and you can't replicate the event (e.g. attempt mount of a broken DG) then vmkwarning.log tells some things but not specifics (e.g. it will tell you a disk was dropped or became PDL but not the SCSI Sense codes that inform of why).

vsandevicemonitord.log will only log for specific kind of events (e.g. prolonged latency on a device).

The devices you are using here appear to be some form of WD Blue M.2 and probably the cheapest Kingston SSD/NVMe you can find - running vSAN on these (and especially anything with somewhat complex Cache to Capacity interplay in both directions e.g. deduplication) is just asking for trouble - Enterprise devices (e.g. ones on the vSAN HCL) are not relatively expensive for no reason, these have sophisticated error-checking and failsafe mechanisms that mean they don't mangle IOs during things like a power outage. If you are going to continue using this cluster for vSAN then either get some form of supported devices or at the very least use least-complex configuration (e.g. no dedupe, no compression, no encryption, RAID1 throughout, non-Stretched).
It is quite possible that these DGs are corrupt as what is stored/mapped from Cache data and/or metadata doesn't match what is on the disks (particularly dedupe mappings) - the only thing I would try is cold (e.g. all the way to off) power-cycling the nodes and then seeing is it able to mount the DGs or just failing due to something like the above.

JeremeyWise · ‎09-17-2021

I think the drives were mangled due to power issue beyond repair.. and as you noted... with cheaper parts... I will avoid dedupe and compression complexity on volumes where data loss matters.

Thanks for response as always.

Nerd needing coffee

All

vSAN - Cluster reboot 0% Free /vmfs/volumes/vsanDatastore