Re: one host could not enter maintenance mode with...

wfjiang · ‎12-21-2021

The vsan cluster is version 6.5.

I've tried to let one host enter maintenance mode due to one disk failed. But the host could not enter maintenance mode with error "Operation already in progress".

I have no idea what I should to do for further check.

Thanks for any tips.

bryanvaneeden · ‎12-22-2021

If the task is still running at this time, you could try to restart the management agents on the host which is having the "stuck" task.

Also, I would check the hostd.log and vpxa.log, and if needed the vSAN logs in the clomd.log, osfsd.log and vsanvpd.log. You can also check the logs on the vCenter in the vpxd.log to try and find a specific reason there is already a task running.

If all else, fails, I suggest to create a case with VMware GSS.

Visit my blog at https://vcloudvision.com!

TheBobkin · ‎12-22-2021

@wfjiang Can you share the output of this run on the host in question? (feel free to obscure/change hostnames for privacy etc.)

# cmmds-tool find -t NODE_DECOM_STATE

wfjiang · ‎12-22-2021

here's the outputs:

owner=5dee3230-0d93-9310-a5b0-941882eb1e70(Health: Healthy) uuid=5dee3230-0d93-9310-a5b0-941882eb1e70 type=NODE_DECOM_STATE rev=6 minHostVer=0 [content = (i0 i0 UUID_NULL i0 [ ] i0 i0 i0)], errorStr=(null)

owner=5b179190-d6da-ff10-7b31-94f128c44ce0(Health: Healthy) uuid=5b179190-d6da-ff10-7b31-94f128c44ce0 type=NODE_DECOM_STATE rev=6 minHostVer=0 [content = (i0 i0 UUID_NULL i0 [ ] i0 i0 i0)], errorStr=(null)

owner=5ad47297-0bd3-4230-67d5-70106fec0e48(Health: Healthy) uuid=5ad47297-0bd3-4230-67d5-70106fec0e48 type=NODE_DECOM_STATE rev=7 minHostVer=0 [content = (i0 i0 UUID_NULL i0 [ ] i0 i0 i0)], errorStr=(null)

owner=5addd128-b5da-6848-d7ae-94f128c543a0(Health: Healthy) uuid=5addd128-b5da-6848-d7ae-94f128c543a0 type=NODE_DECOM_STATE rev=9 minHostVer=0 [content = (i0 i0 UUID_NULL i0 [ ] i0 i0 i0)], errorStr=(null)

owner=5addce4a-0bd9-cb60-c516-941882eb3ec0(Health: Healthy) uuid=5addce4a-0bd9-cb60-c516-941882eb3ec0 type=NODE_DECOM_STATE rev=7 minHostVer=0 [content = (i0 i0 UUID_NULL i0 [ ] i0 i0 i0)], errorStr=(null)

owner=5bc00a98-a917-cb78-d34e-70106feb6bb0(Health: Healthy) uuid=5bc00a98-a917-cb78-d34e-70106feb6bb0 type=NODE_DECOM_STATE rev=7 minHostVer=0 [content = (i0 i0 UUID_NULL i0 [ ] i0 i0 i0)], errorStr=(null)

owner=5def57d3-1ae7-cd8c-97aa-ecebb883f100(Health: Healthy) uuid=5def57d3-1ae7-cd8c-97aa-ecebb883f100 type=NODE_DECOM_STATE rev=7 minHostVer=0 [content = (i0 i0 UUID_NULL i0 [ ] i0 i0 i0)], errorStr=(null)

owner=5b23591e-c72a-4598-65f4-70106fec1e18(Health: Healthy) uuid=5b23591e-c72a-4598-65f4-70106fec1e18 type=NODE_DECOM_STATE rev=71 34 minHostVer=0 [content = (i4 i2 c6a74bdc-a3b8-7262-e9f8-febad71caa66 i83 [ 6cd38a5d-48ec-5bb1-9d15-94f128c543a0] i0 i1 i5)], er rorStr=(null)

owner=5ad49331-abbc-20b8-d7f8-941882eb1eb0(Health: Healthy) uuid=5ad49331-abbc-20b8-d7f8-941882eb1eb0 type=NODE_DECOM_STATE rev=1 minHostVer=0 [content = (i0 i0 UUID_NULL i0 [ ] i0 i0 i0)], errorStr=(null)

owner=5ad492dd-058e-b9ec-6b6f-941882eb2ef0(Health: Healthy) uuid=5ad492dd-058e-b9ec-6b6f-941882eb2ef0 type=NODE_DECOM_STATE rev=4 minHostVer=0 [content = (i0 i0 UUID_NULL i0 [ ] i0 i0 i0)], errorStr=(null)

TheBobkin · ‎12-22-2021

@wfjiang

owner=5b23591e-c72a-4598-65f4-70106fec1e18(Health: Healthy) uuid=5b23591e-c72a-4598-65f4-70106fec1e18 type=NODE_DECOM_STATE rev=71 34 minHostVer=0 [content = (i4 i2 c6a74bdc-a3b8-7262-e9f8-febad71caa66 i83 [ 6cd38a5d-48ec-5bb1-9d15-94f128c543a0] i0 i1 i5)], er rorStr=(null)

A node in this cluster is currently entering (i4) MM with 'Full Data Migration' option (i2) selected - you can confirm which node this is with:
# cmmds-tool find -t HOSTNAME -u 5b23591e-c72a-4598-65f4-70106fec1e18

If you use the json format it gives more details/human-readable e.g. which Objects it is resyncing as part of the evacuation and bytes left to sync etc.:
# cmmds-tool find -t NODE_DECOM_STATE -u 5b23591e-c72a-4598-65f4-70106fec1e18 -f json

wfjiang · ‎12-22-2021

Hi, bro

I tried to restart hostd and vpxa service, but the result was the same shows "Operation already in progress"..

following is the output cutting from hostd.log during I put the host into maintenance mode, it show some error, but I couldn't understand what happened.

2021-12-22T11:43:51.524Z verbose hostd[2ED81B70] [Originator@6876 sub=PropertyProvider opID=6efaffdc-631c-11ec-6bf7 user=dcui:vsanmgmtd] RecordOp ASSIGN: info, haTask-ha-host-vim.Task.setState-126761269. Applied change to temp map.
2021-12-22T11:43:51.524Z info hostd[2ED81B70] [Originator@6876 sub=Vimsvc.TaskManager opID=6efaffdc-631c-11ec-6bf7 user=dcui:vsanmgmtd] Task Completed : vmodlTask-ha-host-126761265 Status error
2021-12-22T11:43:51.524Z verbose hostd[2ED81B70] [Originator@6876 sub=PropertyProvider opID=6efaffdc-631c-11ec-6bf7 user=dcui:vsanmgmtd] RecordOp ASSIGN: info, vmodlTask-ha-host-126761265. Applied change to temp map.
2021-12-22T11:43:51.524Z info hostd[2E340B70] [Originator@6876 sub=Hostsvc.ModeMgr opID=6efaffdc-631c-11ec-6bf7 user=dcui:vsanmgmtd] Task failed
2021-12-22T11:43:51.524Z verbose hostd[2ED81B70] [Originator@6876 sub=PropertyProvider opID=6efaffdc-631c-11ec-6bf7 user=dcui:vsanmgmtd] RecordOp ASSIGN: info, haTask-ha-host-vim.Task.setState-126761269. Applied change to temp map.
2021-12-22T11:43:51.524Z verbose hostd[2EE44B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: disabledMethod, ha-root-pool. Sent notification immediately.
2021-12-22T11:43:51.524Z verbose hostd[2EE44B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: disabledMethod, ha-folder-vm. Sent notification immediately.
2021-12-22T11:43:51.525Z info hostd[2EE85B70] [Originator@6876 sub=SysCommandPosix] ForkExec(/usr/lib/vmware/vob/bin/addvob) 3588574
2021-12-22T11:43:51.531Z info hostd[2E340B70] [Originator@6876 sub=Hostsvc.VmkVprobSource] VmkVprobSource::Post event: (vim.event.EventEx) {
--> createdTime = "1970-01-01T00:00:00Z",
--> userName = "",
--> datacenter = (vim.event.DatacenterEventArgument) null,
--> computeResource = (vim.event.ComputeResourceEventArgument) null,
--> host = (vim.event.HostEventArgument) {
--> name = "node02",
--> host = 'vim.HostSystem:ha-host'
--> },
--> vm = (vim.event.VmEventArgument) null,
--> ds = (vim.event.DatastoreEventArgument) null,
--> net = (vim.event.NetworkEventArgument) null,
--> dvs = (vim.event.DvsEventArgument) null,
--> fullFormattedMessage = <unset>,
--> changeTag = <unset>,
--> eventTypeId = "esx.audit.maintenancemode.failed",
--> severity = <unset>,
--> message = <unset>,
--> arguments = <unset>,
--> objectId = "ha-host",
--> objectType = "vim.HostSystem",
--> objectName = <unset>,
--> fault = (vmodl.MethodFault) null
--> }
2021-12-22T11:43:51.532Z verbose hostd[2E340B70] [Originator@6876 sub=PropertyProvider] RecordOp REMOVE: latestPage[11], session[526d0735-9095-5133-7040-aded64b1b707]527f0842-8a50-3604-d0e6-9f46ee0ac65a. Applied change to temp map.
2021-12-22T11:43:51.532Z verbose hostd[2E340B70] [Originator@6876 sub=PropertyProvider] RecordOp ADD: latestPage[21], session[526d0735-9095-5133-7040-aded64b1b707]527f0842-8a50-3604-d0e6-9f46ee0ac65a. Applied change to temp map.
2021-12-22T11:43:51.532Z verbose hostd[2E340B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: latestEvent, ha-eventmgr. Applied change to temp map.
2021-12-22T11:43:51.532Z info hostd[2E340B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 21 : The host has failed entering maintenance mode.

wfjiang · ‎12-22-2021

Hi, TheBobkin

Just the node which encounter this problem, I put the node into maintenance mode with full data migration, but it spent almost 20hours and it stucked at 89%..So i cancelled the task manually.

[root@node02:/var/log] cmmds-tool find -t HOSTNAME -u 5b23591e-c72a-4598-65f4-70106fec1e18

owner=5b23591e-c72a-4598-65f4-70106fec1e18(Health: Healthy) uuid=5b23591e-c72a-4598-65f4-70106fec1e18 type=HOSTNAME rev=0 minHostVer=0 [content = ("node02")], errorStr=(null)

[root@shvsannode02:/var/log] cmmds-tool find -t NODE_DECOM_STATE -u 5b23591e-c72a-4598-65f4-70106fec1e18 -f json
{
"entries":
[
{
"uuid": "5b23591e-c72a-4598-65f4-70106fec1e18",
"owner": "5b23591e-c72a-4598-65f4-70106fec1e18",
"health": "Healthy",
"revision": "7164",
"type": "NODE_DECOM_STATE",
"flag": "2",
"minHostVersion": "0",
"md5sum": "0b7d9be94e12e41e4248285b21ef43da",
"valueLen": "96",
"content": {"decomState": 4, "decomJobType": 2, "decomJobUuid": "c6a74bdc-a3b8-7262-e9f8-febad71caa66", "progress": 83, "affObjList": [ "6cd38a5d-48ec-5bb1-9d15-94f128c543a0"], "errorCode": 0, "updateNum": 1, "majorVersion": 5},
"errorStr": "(null)"

TheBobkin · ‎12-22-2021

@wfjiang If there is some issue with that node (e.g. disk issue you mentioned) then doing MM with FDM option mightn't be the best idea (e.g. reading from components on disks that are in a problematic state isn't necessarily going to work or may take a long time).

From the output you provided, that vSAN-Decom task is still running regardless of whether the vSphere-side MM task has been cancelled - if you want to cancel this (e.g. to attempt placing the node in MM with Ensure Accessibility option) then this should be possible using:
# localcli vsan maintenancemode cancel

Are there any other red health alerts here e.g. for data or disk (congestion etc.)?
Cluster > Monitor > vSAN > Skyline/vSAN Health > Retest

wfjiang · ‎12-22-2021

Hi， Bob

I‘ve cannelled the hidden MM task by the command you provided. Thanks.

And then I started another task to put this node into MM with Ensure-accessibility, but it processed normally and reach 87% in 5 minutes, but 2 hours past, the progress is still hanging in 87%.

following are the output from cmmds:

[root@node02:~] cmmds-tool find -t NODE_DECOM_STATE ###I cutted content of other nodes with normal state

owner=5b23591e-c72a-4598-65f4-70106fec1e18(Health: Healthy) uuid=5b23591e-c72a-4598-65f4-70106fec1e18 type=NODE_DECOM_STATE rev=8095 minHostVer=0 [content = (i4 i1 821c163b-abcf-86b7-a9bd-50e2d4bdea01 i83 [ 6cd38a5d-48ec-5bb1-9d15-94f128c543a0] i0 i1 i5)], errorStr=(null)

[root@node02:~] cmmds-tool find -t HOSTNAME -u 5b23591e-c72a-4598-65f4-70106fec1e18

owner=5b23591e-c72a-4598-65f4-70106fec1e18(Health: Healthy) uuid=5b23591e-c72a-4598-65f4-70106fec1e18 type=HOSTNAME rev=0 minHostVer=0 [content = ("node02")], errorStr=(null)

[root@node02:~] cmmds-tool find -t NODE_DECOM_STATE -u 5b23591e-c72a-4598-65f4-70106fec1e18 -f json
{
"entries":
[
{
"uuid": "5b23591e-c72a-4598-65f4-70106fec1e18",
"owner": "5b23591e-c72a-4598-65f4-70106fec1e18",
"health": "Healthy",
"revision": "8097",
"type": "NODE_DECOM_STATE",
"flag": "2",
"minHostVersion": "0",
"md5sum": "f42b9638d874b0a9374b6df67f97236c",
"valueLen": "96",
"content": {"decomState": 4, "decomJobType": 1, "decomJobUuid": "821c163b-abcf-86b7-a9bd-50e2d4bdea01", "progress": 83, "affObjList": [ "6cd38a5d-48ec-5bb1-9d15-94f128c543a0"], "errorCode": 0, "updateNum": 1, "majorVersion": 5},
"errorStr": "(null)"

Looking forward to your great help..tks

wfjiang · ‎12-22-2021

In skyline monitor, there's red alarm of 'Absent vsan disk' from node02

and bty, there're two vm objects whhich from other hosts are resyncing forever, the left resyncing size from much to less and then turns to much again.. Seems like the repeat recyncing is the root problem??

wfjiang · ‎12-23-2021

I tried search UUID(6cd38a5d-48ec-5bb1-9d15-94f128c543a0) of one warning disk.

And I got those messages:

clomd.log:2021-12-23T08:55:42.252Z 67124 (30395205920)(opID:0)CLOMDecomAffObjCb: Obj 6cd38a5d-48ec-5bb1-9d15-94f128c543a0 is not ready. configCSN:1310675 policyCSN:1310676 stateCSN:1310676 configSCSN: 1226555 statsSCSN: 1226555

cmmdsTimeMachineDump.log:1640249966.029458,5b23591e-c72a-4598-65f4-70106fec1e18,24,8422,5b23591e-c72a-4598-65f4-70106fec1e18,2,{"decomState": 4, "decomJobType": 1, "decomJobUuid": "821c163b-abcf-86b7-a9bd-50e2d4bdea01", "progress": 83, "affObjList": [ "6cd38a5d-48ec-5bb1-9d15-94f128c543a0"], "errorCode": 0, "updateNum": 1, "majorVersion": 5}\q

TheBobkin · ‎12-23-2021

@wfjiang That's not a good sign - 'looping' resync is typically indicative of unrecoverable checksum errors on some components of the objects in question e.g. you have missing components in one data-replica due to the disk issue and the other data-replica has bad blocks resulting in the metadata not matching the data and hence checksum errors.
This can result in enter MM not functioning as expected.

I would advise starting with identifying the objects that are looping and figuring out what to do with them (e.g. backup/restore) - you can probably see them in the resyncing objects tab in the vSphere client but otherwise can just check the one that was in "affObjList" from any node:
# /usr/lib/vmware/osfs/bin/objtool getAttr -u 6cd38a5d-48ec-5bb1-9d15-94f128c543a0

Note that Objects in such a state may have other challenges e.g. can't take/consolidate snapshot, can't clone, can't SvMotion for the same reasons they can't complete resync (some data blocks are bad and cannot be read properly), this can be worked-around by disabling checksum on the storage policy of just this vmdk object (e.g. clone the current policy, on the new policy disable checksum and then apply the new policy to just that vmdk), note this won't fix the impairment of the data, it merely tells vSAN not to check metadata vs data checksum (e.g. ignore any possible corruption and just faithfully use whatever data is in it) and thus you should be using something within Guest-OS of this VM to validate are some files/folders not usable.

wfjiang · ‎12-23-2021

Hi, Bob

Thank you very much for you help. And Merry Christmas..

You are totally right, I failed to clone those VMs. Follow suggestion I applid a cloned policy without checksum to the disk objects, right now those ojects' state is re-configuration.

I have no privilege to login guest os, I don't know whether the data is accessible or not. Luckily, the OS onwer doesn't report data loss to me yet.

TheBobkin · ‎12-24-2021

@wfjiang A Merry Christmas to you also!

If for some reason they continue to loop in resync then you should validate whether snapshot+backup or clone of the VM now works - if it is work with the VM-owner for feasibility of doing backup+restore of the VM or if they are okay with switching to the cloned-VM, if they are okay with this then after the original VM/vmdk are no longer needed, delete the problematic original vmdk objects - enter MM and anything else should be fine either now (if they completed resync) or after doing that as these objects were/are likely the only thing preventing this.

With regard to the data, while the data-impact of this can of course vary widely, feedback from customers who have encountered similar situations have generally indicated not much impairment (e.g. a few small files/folders unable to be accessed), and I have had others state no evidence of impairment (e.g. could be bad blocks in something no longer used/deleted in Guest-OS). Either way, well worth validating this in-depth.

All

one host could not enter maintenance mode with error "Operation already in progress"