VMware Cloud Community
Knowledgable
Enthusiast
Enthusiast
Jump to solution

CLOMD Service issue

Hi All,

I'm simulating a deployment of ESX 6.5U1 on our dev servers before attempting on the new production units.

I've tried installing basic vsan from the vcenter server in a 2 node setup. However, installation is partially successful as it fails under the "Virtual SAN CLOMD liveness" step which as i've read is critical for object creation. Looking at the VMWare KB it shows that the service should be running and if so restarted at the least. After several attempts at restarting the service, restarting the host servers, decommissioning and setting up the VSAN install, i can't seem to get pass this issue.

The host servers always have the CLOMD status as running, which is what is even more confusing. It's as if the vcenter server is unable to communicate with the process though they all (vcenter, 2 nodes) sit on the same subnet with no firewall between. I've attached some screenshots for clarity.


Grateful if any one can chime with suggestions.

Tags (1)
1 Solution

Accepted Solutions
TheBobkin
Champion
Champion
Jump to solution

Hello Knowledgable,

That cluster looks okay - I don't think it could be clustered if vSAN networking was not configured - a single membership update count means it is not flapping either.

The true test of clomd liveness is Object creation, as you only have 2-nodes here this won't be possible until you set-up a 3rd or Witness (I could have SWORN FTT=0 or 'Force provisioned' Object creation was still possible with 2-nodes but tested in HOL 6.6.1 lab and it fails, objtool create is also failing but likely for other reasons).

What is your plan regarding a 3rd host or Witness in this cluster? - I would advise sorting this out before working out what is not functioning as intended.

Note that some saw issues with clomd liveness resolved by removing and recreating the vSAN vmk interfaces (though it doesn't look like clomd is crashing here so maybe not applicable)

https://lab-rat.com.au/2017/05/26/vsan-6-5-to-6-6-upgrade-issues-clomd-liveness/

Are you using a substantially varying version of vCenter and host build version here? - this can potentially cause multiple aspects of the Health check to either report incorrectly or cease to function entirely.

What do the lower panes of each health check alert/warning say? (for most checks these give much more verbose information including the element(s) that are affected).

Bob

View solution in original post

6 Replies
TheBobkin
Champion
Champion
Jump to solution

Hello Knowledgable,

"I've tried installing basic vsan from the vcenter server in a 2 node setup"

Just in case you are not aware: despite the name, a 3rd node is required for '2-node' cluster, namely a Witness Appliance - this requires no license and can be installed as a VM running on any host that is not part of a vSAN cluster.

docs.vmware.com/en/VMware-vSphere/6.5/com.vmware.vsphere.virtualsan.doc/GUID-05C1737A-5FBA-4AEE-BDB8-3BF5DE569E0A.html

Network:

'All hosts have a Virtual SAN vmknic configured'

- Are all hosts that are intended to be in the vSAN cluster in a vCenter cluster object with no other nodes under this object?

- Do all hosts have a seperate vmknic interface (in the same subnet as other cluster-members) with vSAN traffic enabled on it? (note: having anything else such as vMotion enabled on this vmk can cause serious issues)

# esxcfg-vmknic -l

# esxcli vsan network list

- Check MTU size on all relevant components - this should be consistent end-to-end and in general 1500 is less hassle than jumbo-frames.

Can these communicate with one another?

# ping -I vmkX <IP Address of vSAN VMK on other host as per the above output>

'Hosts with Virtual SAN disabled'

- Do all hosts have vSAN enabled and refer to first point regarding other hosts not in vSAN cluster under same vCenter cluster object.

# esxcli vsan cluster get

Cluster:

' Disk Format Version'

- Likely indicates you have hosts with mixed build or versions - can also cause issues with regard to Multicast or Unicast being chosen for cluster-communication method.

'Software version compatibility'

- Likely indicates you have hosts with mixed build or versions - this is far from ideal and if there is enough of a codebase difference they may not function together or be erratic - make sure all hosts are on the same ESXi build.

'Virtual SAN CLOMD liveness'

- Likely a result of a combination of some of the above points, though daemons can be hung and still say they are running so test this by:

# ps | grep clomd   (take note of the active PIDs)

# /etc/init.d/clomd stop

# ps | grep clomd  (they should be all gone, if not then likely unkillable processes - reboot host)

If all stopped fine and # /etc/init.d/clomd status shows as not running, restart the service:

# /etc/init.d/clomd start

Hope this helps as a start to fixing this thing, feel free to post output from any of the above commands if it is not making sense.

Bob

0 Kudos
Knowledgable
Enthusiast
Enthusiast
Jump to solution

Thanks for the follow up.

Did use the same install medium across both servers for the installation, based on your input not sure why its still iffy. Please see attached output from recommended tests. A paste had it all over the place.

Let me know your thoughts.

0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello Knowledgable,

That cluster looks okay - I don't think it could be clustered if vSAN networking was not configured - a single membership update count means it is not flapping either.

The true test of clomd liveness is Object creation, as you only have 2-nodes here this won't be possible until you set-up a 3rd or Witness (I could have SWORN FTT=0 or 'Force provisioned' Object creation was still possible with 2-nodes but tested in HOL 6.6.1 lab and it fails, objtool create is also failing but likely for other reasons).

What is your plan regarding a 3rd host or Witness in this cluster? - I would advise sorting this out before working out what is not functioning as intended.

Note that some saw issues with clomd liveness resolved by removing and recreating the vSAN vmk interfaces (though it doesn't look like clomd is crashing here so maybe not applicable)

https://lab-rat.com.au/2017/05/26/vsan-6-5-to-6-6-upgrade-issues-clomd-liveness/

Are you using a substantially varying version of vCenter and host build version here? - this can potentially cause multiple aspects of the Health check to either report incorrectly or cease to function entirely.

What do the lower panes of each health check alert/warning say? (for most checks these give much more verbose information including the element(s) that are affected).

Bob

Knowledgable
Enthusiast
Enthusiast
Jump to solution

As per your observation, I have realized that I'm running 'VMware-vCenter-all-6.5.0-4602587' vs 'VMware-VMvisor-Installer-6.5.0.update01-5969303.x86_64' for the host installs. I have setup the witness server to add into the mix to see if that also works out.

At present the lower panes for health are blank and greyed out. Going to install the updated vcenter and report back.

Thanks for the assistance so far.

0 Kudos
Knowledgable
Enthusiast
Enthusiast
Jump to solution

Hi Bobkin,

Returned yesterday and rolled out the more recent vCenter server. As you suggested, the difference between the vCenter and ESXi versions was the issue.

Thanks much.

0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello Knowledgable,

Cheers for updating - makes it easier for others to find info on what resolves this or similar issues that don't initially make sense (X is reporting problem, but Y appears fine).

Any significant codebase level differences between closely knitted software (vCenter and ESXi/vSAN here) is always a good place to start with troubleshooting, especially if they are reporting different information or states - when in doubt, trust the component that has the least amount of layers between it and what you are testing (ESXi with the vSAN layers here).

Bob

0 Kudos