VMware Cloud Community
timalexanderINV
Enthusiast
Enthusiast

Compliance Check for VMs in cluster - Unknown

We have a vSAN stretched cluster running on ESXi 6.5u1e and vSAN 6.6.1.

Currently as we are moving VMs across to it the script is taking 20+ minutes to apply the storage policy.  Looking in to it further none of the VMs in the cluster have a valid state for SPBM - they all have "Unknown" listed.  I checked VASA and found one node had a IOFILTER marked as "Offline".  I removed this as I could not get it to scan, restarted the node and then upon reboot it has logged the provider correctly with an "Online" state.

This process seems to have cleared the errors in the sps.log on the VCSA for this node but I still cannot refresh compliance information for any VM in this cluster.  I get the following stream of errors in the sps.log:

[code]

2018-03-12T13:32:30.553Z [pool-32-thread-3] DEBUG opId=CheckVmRollupComplianceResolver-applyOnMultiEntity-1715716-ngc:70015178 com.vmware.vim.sms.provider.ProviderFactory - [getActiveProvider] Retry for the 24th time...

2018-03-12T13:32:30.578Z [pool-32-thread-3] ERROR opId=CheckVmRollupComplianceResolver-applyOnMultiEntity-1715716-ngc:70015178 com.vmware.vim.sms.provider.ProviderCache - [getProvider] No provider exists with uid: b8b50430-5aac-47ce-a1eb-20ffff99f497

2018-03-12T13:32:30.579Z [pool-32-thread-3] ERROR opId=CheckVmRollupComplianceResolver-applyOnMultiEntity-1715716-ngc:70015178 com.vmware.vim.sms.provider.ProviderFactory - [getActiveProvider] Failed to retrieve the provider!

(vim.fault.NotFound) {

   faultCause = null,

   faultMessage = null

}

        at com.vmware.vim.sms.provider.ProviderFactory.getProvider(ProviderFactory.java:348)

        at com.vmware.vim.sms.provider.ProviderFactory.getActiveProvider(ProviderFactory.java:390)

        at com.vmware.vim.sms.provider.ProviderFactory.getVasaClientRetryProxy(ProviderFactory.java:365)

        at com.vmware.vim.sms.policy.PolicyManagerImpl.queryComplianceResult(PolicyManagerImpl.java:174)

        at com.vmware.sps.pbm.impl.LocalSMSServiceImpl.queryComplianceResult(LocalSMSServiceImpl.java:68)

        at com.vmware.sps.pbm.compliance.ObjectStorageComplianceTask.run(ObjectStorageComplianceTask.java:160)

        at com.vmware.vim.storage.common.task.opctx.RunnableOpCtxDecorator.run(RunnableOpCtxDecorator.java:38)

        at com.vmware.vim.storage.common.task.opctx.RunnableOpCtxDecorator.run(RunnableOpCtxDecorator.java:38)

        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

        at java.util.concurrent.FutureTask.run(FutureTask.java:266)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

        at java.lang.Thread.run(Thread.java:748)

2018-03-12T13:32:30.992Z [pool-32-thread-1] DEBUG opId=CheckVmRollupComplianceResolver-applyOnMultiEntity-1715505-ngc:70015176 com.vmware.vim.sms.provider.ProviderFactory - [getActiveProvider] Retry for the 26th time...

2018-03-12T13:32:31.027Z [pool-32-thread-1] ERROR opId=CheckVmRollupComplianceResolver-applyOnMultiEntity-1715505-ngc:70015176 com.vmware.vim.sms.provider.ProviderCache - [getProvider] No provider exists with uid: b8b50430-5aac-47ce-a1eb-20ffff99f497

2018-03-12T13:32:31.028Z [pool-32-thread-1] ERROR opId=CheckVmRollupComplianceResolver-applyOnMultiEntity-1715505-ngc:70015176 com.vmware.vim.sms.provider.ProviderFactory - [getActiveProvider] Failed to retrieve the provider!

(vim.fault.NotFound) {

   faultCause = null,

   faultMessage = null

}

        at com.vmware.vim.sms.provider.ProviderFactory.getProvider(ProviderFactory.java:348)

        at com.vmware.vim.sms.provider.ProviderFactory.getActiveProvider(ProviderFactory.java:390)

        at com.vmware.vim.sms.provider.ProviderFactory.getVasaClientRetryProxy(ProviderFactory.java:365)

        at com.vmware.vim.sms.policy.PolicyManagerImpl.queryComplianceResult(PolicyManagerImpl.java:174)

        at com.vmware.sps.pbm.impl.LocalSMSServiceImpl.queryComplianceResult(LocalSMSServiceImpl.java:68)

        at com.vmware.sps.pbm.compliance.ObjectStorageComplianceTask.run(ObjectStorageComplianceTask.java:160)

        at com.vmware.vim.storage.common.task.opctx.RunnableOpCtxDecorator.run(RunnableOpCtxDecorator.java:38)

        at com.vmware.vim.storage.common.task.opctx.RunnableOpCtxDecorator.run(RunnableOpCtxDecorator.java:38)

        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

        at java.util.concurrent.FutureTask.run(FutureTask.java:266)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

        at java.lang.Thread.run(Thread.java:748)

[/code]

It will eventually reach 60 attempts and then the compliance check fails in the web client.  Anyone seen this behaviour before or know a different angle I can take with troubleshooting it?

Reply
0 Kudos
4 Replies
vpradeep01
VMware Employee
VMware Employee

Hello Alex,

Could you please let us know if Storage providers under - "vCenter server top-level object -> Manage/Configure -> Storage providers" are listed as online for ALL the ESXi hosts in the vSAN Cluster?

> If few of the hosts are missing or listed as offline, please follow the below steps:

1. Remove the Storage providers for all the affected ESXi hosts from this tab

2. Once removed, restart the below service on all the vSAN enabled ESXI hosts:

/etc/init.d/vsanvpd restart

/etc/init.d/vsanmgmtd restart

3. Now restart the SPS service on vCenter server

If appliance run this command:

service-control --list

service-control --stop vmware-sps

service-control --start vmware-sps

If Windows server, please restart the vpxd service which may restart all the service.

4. Under the storage providers, please refresh and make sure Storage providers for all the hosts are listed as "ONLINE" including IO Filters for all the hosts.

Note:

Unknown status on all the VMs are mostly listed due to policy named "Default policy" taken while cloning/Storage vMotion'ng/ Deploying VMs from template.


Steps to correct the policy:

1. Pick a VM; right click and select VM Storage policy and Edit the policy and select "vSAN Default Storage policy".

Note:
Do make sure the component placemnets are actually as per the vSAN Default Storage policy. This usually creates/places components in Raid 1 Configuration with Stripe width as 1. You may click on

VM -> Monitor -> Policies -> Physical disk placement to understand how may components and created. Please attach the screenshot in case you are not sure

Thanks

Pradeep

Reply
0 Kudos
timalexanderINV
Enthusiast
Enthusiast

Thanks for the response.  I can see them all listed (2 per host).  However, some do not have a "LastSyncTime" output or have one that is months old:

Name                 Status       VasaVersion LastSyncTime           Namespace            Url

----                 ------       ----------- ------------           ---------            ---

VSAN Provider inv... online       1.5         12/03/2018 13:09:50    VSAN                 https://i

IOFILTER Provider... online       1.5         12/03/2018 13:10:08    IOFILTERS            https://i

VSAN Provider inv... online       1.5         13/02/2018 16:20:32    VSAN                 https://i

IOFilter Provider... online       1.5         12/03/2018 13:09:51    IOFILTERS            https://i

VSAN Provider inv... online       1.5                                VSAN                 https://i

IOFilter Provider... online       1.5         12/03/2018 13:10:00    IOFILTERS            https://i

VSAN Provider inv... online       1.5         01/02/2018 17:35:36    VSAN                 https://i

.

.

.

Would it be best to restart the vsanvpd and vsanmgmtd on these boxes?  I am guessing this is non-disruptive restarting these services?

None of these VMs have the "vSAN Default Storage Policy".  We have a number of policies we have created for each vcenter and have changed the default policy for the datastore.  The script we use also stamps the required storage policy on the VM after it has been migrated - this is the bit that is taking 20+minutes because it cannot check the compliance of it.

Reply
0 Kudos
vpradeep01
VMware Employee
VMware Employee

Yes, you can restart the vsanvpd and vsanmgmtd service. This will not cause any harm.

In this cause, restarting is not going to solve as I see all the Storage providers on all the three hosts are online.

Please change/Edit the Storage policy for any VM to "vSAN default Storage policy" from Default policy. Compliance check should then work !

Thanks

Pradeep

Reply
0 Kudos
timalexanderINV
Enthusiast
Enthusiast

So it turns out that there was a stale storage provider record in the VCDB that was causing the whole process to fall over.  Colleague worked through the issue with support to get the record removed and now everything is responding to compliance checks

Reply
0 Kudos