Hello everybody.
We have a cluster of 4 hosts esxi 201704001-5310538 and VCSA 6.5.0-5705665.
At one time (we do not know why) on two hosts stopped clomd liveness. The command "/etc/init.d/clomd start" starts clomd, but after a minute or two it "is not running" again. We tried to reboot the hosts and VCSA - it does not work.
Any help?
Hello malefik,
My best bet would be that you are hitting a known issue (fixed in 6.5 U1) where trying to repair zero-size Swap Objects crashes clomd:
kb.vmware.com/kb/2149968
Restart clomd and look in the clomd log for the problematic Object UUID when it crashes.
You can then identify this Object using:
# cmmds-tool find -f json -t DOM_OBJECT -u <UUID From Logs>
Look for the Active component UUID (State: 5) and then from the host that is the DOM_OWNER of this Object run:
# /usr/lib/vmware/osfs/bin/objtool getAttr --bypassDom -u <Active Component UUID> -c
This will tell you which VMs .vswp it is (e.g. VMName.vswp).
Then you have some choices:
- Power-cycling the VM *should* remove the existing .vswp and create a new one.
- Applying FTT=0 to the Object via objtool.(triple-check that it is just a .vswp UUID beign changed)
- Power-down the VM and delete the Object. (triple-check that it is just a .vswp UUID that is being deleted)
Of course also change the VM memory allocation/reservation accordingly as per the kb so this cannot occur on power-cycling of VM:
"To avoid this issue, ensure that the VM memory allocation is always higher than the VM memory reservation so the swap objects created will always be non-zero sized."
Bob
Hello malefik,
My best bet would be that you are hitting a known issue (fixed in 6.5 U1) where trying to repair zero-size Swap Objects crashes clomd:
kb.vmware.com/kb/2149968
Restart clomd and look in the clomd log for the problematic Object UUID when it crashes.
You can then identify this Object using:
# cmmds-tool find -f json -t DOM_OBJECT -u <UUID From Logs>
Look for the Active component UUID (State: 5) and then from the host that is the DOM_OWNER of this Object run:
# /usr/lib/vmware/osfs/bin/objtool getAttr --bypassDom -u <Active Component UUID> -c
This will tell you which VMs .vswp it is (e.g. VMName.vswp).
Then you have some choices:
- Power-cycling the VM *should* remove the existing .vswp and create a new one.
- Applying FTT=0 to the Object via objtool.(triple-check that it is just a .vswp UUID beign changed)
- Power-down the VM and delete the Object. (triple-check that it is just a .vswp UUID that is being deleted)
Of course also change the VM memory allocation/reservation accordingly as per the kb so this cannot occur on power-cycling of VM:
"To avoid this issue, ensure that the VM memory allocation is always higher than the VM memory reservation so the swap objects created will always be non-zero sized."
Bob
Thank you very much for your help, Bob!
What if I just update to 6.5 U1 - will this help? Or do I still have to solve the problem manually?
Hello malefik,
Updating will stop it from occurring again but I am not sure if it can repair the broken Object from before update, safer just to manually remove/remediate it either way.
As long as no VMs have reservations higher than their memory allocation they can never be of zero-size.
Also in addition to the memory res/alloc, a data-component needs to become degraded or stale for this issue to occur (thus why you didn't encounter this issue sooner).
Can you upload/attach the clomd.log (/var/log/clomd.log) from when clomd is crashing so that I can verify that this is indeed the cause?
(There *can* be other causes but from the symptoms you described and the build version this is most likely)
Bob
Dear Bob, thank you so much for the help - the problem was solved exactly the way you wrote it!
Hello malefik,
Good to hear that was the issue and fix.
Just for note of anyone reading this in future or thinks they are hitting this issue: rebooting any vSAN node while can often resolve issues with controllers/disk-groups/drivers/firmware one should always look at the logging and try to figure out what is causing the issue first. Always try to put hosts in Maintenance Mode before doing this - If Ensure Accessibility is not possible due to Objects with issues then at least 'No Action' should shut things down cleaner (though any FTT=0 Objects on this node will become inaccessible until it comes back up), don't reboot hosts without checking Object availability and resync status first.
If anyone is running vSAN 6.5 GA (build:4564106) or up to any build before 6.5 U1 (build: 5969303) and hits this issue, if the steps/options in my previous comments are not clear or you are not comfortable with checking this then please open an SR with us to take a look.
(Note: other factors such as unsupported hardware can cause clomd issues so try to verify via vmkernel.log that this is the cause, PM me if in doubt).
Bob