VMware Cloud Community
ShaunBryant
Contributor
Contributor

Issue after multiple host failure with RVC

One of my clients had 3 out of 10 hosts fail with a purple screen after applying the latest dell 5.5 update.

We have recovered the systems and as you would guess vsan is doing a major rebuild (around 122 TBs). 5 hours into that rebuild, rvc has stopped working for a set of commands (see below).  Any idea on how to bring these back to life? On a side note, the resync is still happening as I can see it for individual guests.

Thanks

*** command>vsan.resync_dashboard XXXX

2018-05-22 16:39:44 +0000: Querying all VMs on VSAN ...

2018-05-22 16:39:44 +0000: Querying all objects in the system from XXXXXX ...

vsan.resync_dashboard XXXX hit an error.

Server failed to query syncing objects: JSON = ''

*** command>vsan.disk_object_info MileHigh, disk_uuids

vsan.disk_object_info XXXX, disk_uuids hit an error

SystemError: A general system error occurred: Runtime fault

Total time taken - 141.331096748 seconds

>

0 Kudos
4 Replies
TheBobkin
Champion
Champion

Hello ShaunBryant​,

Welcome to Communities! Some useful info on participating here:

https://communities.vmware.com/docs/DOC-12286

Is there a very very good reason that this cluster is still running 5.5?

Really missing out on a ton of improvements to performance, stability, resiliency and manageability that exist in more recent versions.

Have you tried logging out of RVC, closing the SSH session (if vCSA), opening a new one and trying via RVC again?

Do other commands succeed? e.g. vsan.disks_stats <pathToCluster>

I can't say whether it will work on 5.5 (as a lot of syntax has changed) but you can check from a host what the current resync is:

(run from /tmp/)

# while true;do echo "" > ./resyncStats.txt ;cmmds-tool find -t DOM_OBJECT -f json |grep uuid |awk -F \" '{print $4}' |while read i;do pendingResync=$(cmmds-tool find -t DOM_OBJECT -f json -u $i|grep -o "\"bytesToSync\": [0-9]*,"|awk -F " |," '{sum+=$2} END{print sum / 1024 / 1024 / 1024;}');echo "$i: $pendingResync GiB";done |tee -a ./resyncStats.txt;total=$(cat resyncStats.txt |awk '{sum+=$2} END{print sum}');echo "Total: $total GiB" |tee -a ./resyncStats.txt;total=$(cat ./resyncStats.txt |grep Total);totalObj=$(cat ./resyncStats.txt|grep -vE " 0 GiB|Total"|wc -l);echo "`date +%Y-%m-%dT%H:%M:%SZ` $total ($totalObj objects)" >> ./totalHistory.txt; sleep 120;done

You should be able to also check how many Objects are still resyncing (state: 15):

# cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep -o 'state\\\":\ [0-9]*' | sort | uniq -c

Bob

0 Kudos
ShaunBryant
Contributor
Contributor

The customer has a very large issue with moving past what they already "know" so they are stuck in the 5.5 world.

In order:

I have tried logging out and restarting rvc.

I have even restarted the vcenter appliance

Disk stats works fine, it is just those two commands that don't work anymore, the "script" you gave worked to get numbers btw. This really seems like some process has died somewhere.

0 Kudos
TheBobkin
Champion
Champion

Hello ShaunBryant​,

"The customer has a very large issue with moving past what they already "know" so they are stuck in the 5.5 world."

That hurts to hear! Let me know if you need any points to help sell the benefits (though I'm betting you need no help there) - 5.5 is End of support soon.

Try giving vsanvpd and clomd a restart on all the nodes (this shouldn't cause any negative impact):

# /etc/init.d/vsanvpd restart

Provisioning tasks of VMs can fail if currently running by restarting clomd so check that none of these tasks running first (kb.vmware.com/s/article/2075456):

# /etc/init.d/clomd restart

Glad to hear that script got the info for now anyway - is the resync still progressing at an expected rate?

Bob

0 Kudos
ShaunBryant
Contributor
Contributor

I already tried restarting them earlier seems like a strange issue.

It is moving at 45GB a minute which is "ok". All of the servers are F2S5 so there is a lot of resyncs to do since the hosts that failed were down for ~4 hours as we had to revert both the update and the bios to stop then from purple screening under load. 

Thank you for the help, it sounds like after the sync is complete we need to do a rolling reboot of the hosts to fix whatever is wedged.

Thank you again. 

0 Kudos