VMware Cloud Community
billdossett
Hot Shot
Hot Shot
Jump to solution

VSAN 6.5 Storage Providers all offline

Got a strange one here...

I noticed some error messages during my backup and they pointed to no Storage Providers.

Check my vcenter -> configure -> storage providers and vsan providers on all my hosts are all offline.  I've googled this and VMware says its expired certificates..  Mine don't expire until 2026 and are all valid so that's not the problem.

I think this started when I was testing a script to shut down the VSAN... maybe,  it didn't complete work but I brought it back online and it seemed fine... Until I saw this.  Now I realize I can't provision new VMs, or clone or do a whole lot.

So today I shut the whole VSAN down cleanly and brought it back up hoping it would clear, but no luck.  Still all offline.  Everything started backup fine.

I've tried the synchronize button and that does nothing..

vsanvpd is running on all four hosts...

At this point I am pretty much out of ideas.  I have backups of all but one VM which I can lose if need be.

I can't seem to find anything else on google.  I was going to upgrade to 6.7 this weekend.  If I can't figure anything else out by then might just go ahead and see what happens, if I have to rebuild it I guess I have to rebuild anb restore, but I am kind of annoyed that I can't seem to fix this.

Thanks for reading and any ideas you might be able to offer.

Bill

Bill Dossett
0 Kudos
1 Solution

Accepted Solutions
TheBobkin
Champion
Champion
Jump to solution

Hello Bill,

"So there is a filter and a provider, do i delete both for one host and then what, rescan for datastores?"

Remove both for one host and then press the Resync providers button. Press the rescan button (datastore icon with green line under) and the resync storage providers button (datastore with orange circular arrows). If you don't see the host that was removed added back then likely is a certs (or less likely) a port issue (e.g. if you can access the version.xml of the hosts from the vCSA then I can't see anything other than Authentication blocking here).

If they don't appear then try manual registering these. Are the hosts registered with FQDN or just IP?

"I just exported my last VM as an ovf in case I destroy it, but the aim here is to be able to fix it.  I've started to realize how durable the vsan is now. "

As I said before, vSAN shouldn't care if vCenter and even VASA are unavailable - you can confirm the integrity of the data via the health check:

Cluster > Monitor > Health > Data (if you have anything inaccessible then do confirm what they are, assuming this is a lab there may stuff that has never been cleaned up e.g. .vswp)

You can also check the state of the data via cmmds-tool from any host in the cluster (all State: 7 is all good):

# cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep -o 'state\\\":\ [0-9]*' | sort | uniq -c

Worst case scenario (assuming this is a lab) you could spin up a new vCSA and attach the cluster to that.

"First few iterations of it wound up a mess"

Anything specific that caused issues? and again, is this a lab or test/dev?

"but now I want to know how to manage it down to the nitty gritty.  "

Read the documentation, play with it, use HOL to make and break clusters in mere minutes and see how this works. Get familiar with good tools for monitoring environments such as vSphere Performance graphs (in 6.6 especially), vSAN Observer, RVC.

Bob

View solution in original post

0 Kudos
10 Replies
TheBobkin
Champion
Champion
Jump to solution

Hello Bill,

"Check my vcenter -> configure -> storage providers and vsan providers on all my hosts are all offline."

Are you able to manually register a new one? Is it just vSAN VASA providers or others too?

"I've googled this and VMware says its expired certificates..  Mine don't expire until 2026 and are all valid so that's not the problem."

Certs on the hosts or certs on the vCenter? Check the SMS certs on the vCenter if not checked already:

https://kb.vmware.com/s/article/2126810

https://kb.vmware.com/s/article/2078070

https://kb.vmware.com/s/article/2120105

"vsanvpd is running on all four hosts..."

Has the vCenter been restarted? If not then check restarting the vmware-sps service on the vCenter.

https://kb.vmware.com/s/article/2109881

"I have backups of all but one VM which I can lose if need be."

More than most of our customers unfortunately! Anyone reading this that doesn't have backups (regardless of the platform) please take a page from Bill's book here.

Object resync and repair shouldn't be an issue here - if this was the case you would probably have all of your data out of sync following a shut down and none of it would be healthy - vSAN doesn't really rely on vCenter for this.

Anything funky in the sps.log from vCenter and vsanvpd.log on the hosts? The above kb articles have some easy finds to look for and/or if you attach them here or PM I will aim to take a look.

Bob

0 Kudos
billdossett
Hot Shot
Hot Shot
Jump to solution

Thanks for all the info Bob!   The vcenter has been restarted...  this is VSAN only, not sure what other storage provider I would add?  The certs that say they are good till 2026 I see on the vcenter->configure->strorage providers... page at the far right in each storage provider, is that correct or should I be looking elsewhere.  I will do some more digging when I get back on that network later today and check the sps logs, any other logs I should be looking in?  I've been using VCSA for a while now, but have not dug into the logs on it like I used to on the _old_ 5.5 vcenter windows server, so not completely up to speed with where everything is.  Thanks again, much appreciated.

Bill Dossett
0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello Bill,

"Thanks for all the info Bob!"

Happy to help - troubleshooting and informing on vSAN in my off-hours is apparently my hobby for the last year :smileygrin:

"page at the far right in each storage provider, is that correct or should I be looking elsewhere"

Check the vCenter SMS Certificate as described in the articles I noted:

(assuming vCSA, if Windows, check the kb)

# cd /usr/lib/vmware-vmafd/bin

# ./vecs-cli entry list --store SMS --text

"check the sps logs, any other logs I should be looking in? "

Well your issue looks to be with the Storage Provider Service so here and the corresponding logs on the ESXi are the best place to start.

"but have not dug into the logs on it like I used to on the _old_ 5.5 vcenter windows server, so not completely up to speed with where everything is"

They are in /var/log/vmware/ and then split as per the below article:

https://kb.vmware.com/s/article/2110014

Bob

billdossett
Hot Shot
Hot Shot
Jump to solution

ok sps logs full of

2018-05-25T12:44:34.239-06:00 [pool-12-thread-2] ERROR opId=sps-Main-182338-781 com.vmware.vim.sms.provider.vasa.alarm.AlarmDispatcher - Error: org.apache.axis2.AxisFault: self signed certificate occured as provider: https://terrapin-esxi04.terrapin.local:9080/version.xml is offline

2018-05-25T12:44:34.239-06:00 [pool-12-thread-5] ERROR opId=sps-Main-182338-781 com.vmware.vim.sms.provider.vasa.alarm.AlarmDispatcher - Error: org.apache.axis2.AxisFault: Transport error: 405 Error: Method Not Allowed occured as provider: https://terrapin-esxi04.terrapin.local:8080/version.xml is offline

2018-05-25T12:44:34.241-06:00 [pool-12-thread-3] ERROR opId=sps-Main-182338-781 com.vmware.vim.sms.provider.vasa.alarm.AlarmDispatcher - Error: org.apache.axis2.AxisFault: self signed certificate occured as provider: https://terrapin-esxi02.terrapin.local:9080/version.xml is offline

2018-05-25T12:44:34.241-06:00 [pool-12-thread-1] ERROR opId=sps-Main-182338-781 com.vmware.vim.sms.provider.vasa.alarm.AlarmDispatcher - Error: org.apache.axis2.AxisFault: self signed certificate occured as provider: https://terrapin-esxi01.terrapin.local:9080/version.xml is offline

so I can connect to ports 8080 and 8090 from my workstation, I don't have the cert so I have to bypass security.

I am not getting errors in SPS log like any of the KBs you suggest.  I don't have disconnected error, but you can see it says it is offline.

I listed the cert on the vcsa

./vecs-cli entry list --store SMS

but I don't see any expire date doing that I am loath to delete and regenerate certs until I see something that says the cert is out of date.

Still looking anyway...

Bill Dossett
0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello Bill

You have to have '--text' or it will not show this e.g.:

root@vCenterName [ /usr/lib/vmware-vmafd/bin ]# ./vecs-cli entry list --store SMS --text

Bob

0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello Bill,

There shouldn't be any issues with removing and recreating this cert (by restarting vmware-sps) if you have cert generation issues then there is something deeper afoot.

Can you test if you remove one of the offline VASA providers, when you rescan does it come back as online?

Asking as I managed to get my lab into a similar situation as yours just now after recreating the cert the VASA providers were offline until I manually removed them and resynced them.

Bob

0 Kudos
billdossett
Hot Shot
Hot Shot
Jump to solution

well that was my next step.. but I wasn't too sure about re-adding them...  so I held off till your sage advice!  So there is a filter and a provider, do i delete both for one host and then what, rescan for datastores?  Sorry I haven't had to do this with a vsan yet, any pointers on how to do this would be appreciated.  I just exported my last VM as an ovf in case I destroy it, but the aim here is to be able to fix it.  I've started to realize how durable the vsan is now.  First few iterations of it wound up a mess, but now I want to know how to manage it down to the nitty gritty.  Thanks

Bill

Bill Dossett
0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello Bill,

"So there is a filter and a provider, do i delete both for one host and then what, rescan for datastores?"

Remove both for one host and then press the Resync providers button. Press the rescan button (datastore icon with green line under) and the resync storage providers button (datastore with orange circular arrows). If you don't see the host that was removed added back then likely is a certs (or less likely) a port issue (e.g. if you can access the version.xml of the hosts from the vCSA then I can't see anything other than Authentication blocking here).

If they don't appear then try manual registering these. Are the hosts registered with FQDN or just IP?

"I just exported my last VM as an ovf in case I destroy it, but the aim here is to be able to fix it.  I've started to realize how durable the vsan is now. "

As I said before, vSAN shouldn't care if vCenter and even VASA are unavailable - you can confirm the integrity of the data via the health check:

Cluster > Monitor > Health > Data (if you have anything inaccessible then do confirm what they are, assuming this is a lab there may stuff that has never been cleaned up e.g. .vswp)

You can also check the state of the data via cmmds-tool from any host in the cluster (all State: 7 is all good):

# cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep -o 'state\\\":\ [0-9]*' | sort | uniq -c

Worst case scenario (assuming this is a lab) you could spin up a new vCSA and attach the cluster to that.

"First few iterations of it wound up a mess"

Anything specific that caused issues? and again, is this a lab or test/dev?

"but now I want to know how to manage it down to the nitty gritty.  "

Read the documentation, play with it, use HOL to make and break clusters in mere minutes and see how this works. Get familiar with good tools for monitoring environments such as vSphere Performance graphs (in 6.6 especially), vSAN Observer, RVC.

Bob

0 Kudos
billdossett
Hot Shot
Hot Shot
Jump to solution

Thanks Bob, removing the storage providers and filters and resyncing brought them all back online!  Excellent, very good experience and good to know for the future.

Just finished writing/testing a PS script that shuts the whole shebang down cleanly if I have a power fail.  UPS monitoring VM connected via USB to the first ESXi host runs the whole script and then dies as the host it is running on shuts down.  Seems to be working pretty good.

Again, thanks for the help, I hadn't had to do this before and now I've got  another tool in the belt!|

Bill

Bill Dossett
0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello Bill,

Happy to hear we got it fixed.

"Excellent, very good experience and good to know for the future."

VASA/SPBM issues are pretty much about troubleshooting the data and connections between the vCenter and the hosts which isn't always so clear as to which element is problematic (compared to something like a vSAN-network partition that can be narrowed down fast by testing the connection between vmks over specific ports (both ways), MTU etc.)

"Just finished writing/testing a PS script that shuts the whole shebang down cleanly if I have a power fail.  UPS monitoring VM connected via USB to the first ESXi host runs the whole script and then dies as the host it is running on shuts down.  Seems to be working pretty good."

The main two points going down are 1) get the data (or as much as possible) cold and 2) get the hosts into MM if possible.

These are pretty easily achieved using something simple/dirty like: for i in vim-cmd vmsvc/getallvms, where vmsvc/power.getstate=on, do vmsvc/power.off, followed by localcli/esxcli system maintenanceMode set -e true -m noAction. Could script something to check that all the hosts have completed boot before taking any of them out of MM for cleanliness and to run vsan.fix_renamed_vms via RVC if needed (or something more drastic if the cluster went down hard such as vsan.check_state option that unregisters and re-registers every VM in inventory).

Bob

0 Kudos