Hello Vmware friends,
We are facing a very strange issue which appeared out of no where.
The back story is one of our ESXi 6.7U1 hosts crashed and the OS was no longer usable. So we installed ESXi 6.7 U3 from scratch. Everything went fine we reconfigured it all is good.
VSan however kept spitting out an error that our ESX host did not have a valid VSAN traffic NIC enabled (it did, we presumed it was because somehow the VSAN config became corrupted)
So we transferred all of our data to our other Datastore and everything is working as normal. We turned off VSAN, removed the storage with the command
esxcli vsan storage remove --uuid UUID#
When we do a esxcli vsan storage list it shows no more VSAN storage
When we try to turn VSAN on again and we get to the claim disks we see all our disks present but we are unable to claim them as capacity disks, only as cache tier.
When we run a vdq -q we get this as an output
"Name" : "mpx.bla.bla.bla",
"VSANUUID" : "",
"State" : "Eligible for use by VSAN",
"Reason" : "None",
"IsSSD" : "1",
"IsPDL" : "0",
"Size(MB)" : "953869",
"FormatType" : "512n",
Anyone have any idea what could be causing this?
We are somewhat at a loss, we tried the following command esxcli vsan storage tag add -d mpx.bla.bla -t capacityFlash but it doesn't change much of anything.
We rebooted the vCenter it did nothing.
Is the vCenter on 6.7 U3 or still 6.7 U1? If so then this is a very common issue due to vCenter of lower update not being supported for managing nodes on a higher update version.
Upgrade the vC to 6.7 U3 or if this is not currently possible then reinstall the host with 6.7 U1 (or Shift+R if 6.7 U1 still in the altbootbank, verifiable from the /altbootbank/boot.cfg file).
You can validate that my assumption is correct here by creating the Disk-Group via the CLI:
# esxcli vsan storage add -s <Cache-tier naa> -d <Capacity-tier naa>
I checked and the vCenter was actually on the U3 server.
I put it on the U1 server and I still got them same issue (I know you suggested otherwise but I tried it anyways)
Also, in your opinion, would it be wise to upgrade the other hosts to U3 and have everything on the same version?
I upgraded the vCenter to the latest version via the automatic update in the vCenter console.
What host the vCenter is registered on has no bearing on this - I was saying that if your vCenter currently has a lower build installed (e.g. 6.7 GA/U1/U2) than the newly re-installed host then you may encounter the issue I mentioned, like so:
"Also, in your opinion, would it be wise to upgrade the other hosts to U3 and have everything on the same version?"
Yes, you should always have all vSAN nodes on the same build (including Witnesses if using Stretched-clusters which a lot of folk tend to neglect..).
Do note that there is actually a major version change of virsto (one of the vSAN partition types) from v7 to v10 between 6.7 U1 and 6.7 U3 hosts, so creating a Disk-Group on this node is actually not a good idea before updating the other hosts (and vCenter if not already on 6.7 U3).
Ok I see, well, what I'm going to do is move a few vms around in the environment, install the latest ESXi OS (U3) on all the hosts. Apply the relevant Service packs to the hosts, get everything to the latest level and try again.
I find it weird that everything worked perfectly fine up until one of the hosts crashed and I shut off the vSAN. I'll report back later tonight and give some updates. Also, for reference # esxcli vsan storage add -s <Cache-tier naa> -d <Capacity-tier naa> where <Cache-tier naa> -d <Capacity-tier naa> is something I need to replace, what would be the relevant command to get those values I assume vqd -q
My apologies I'm not connected to the hosts at the moment so I don't have the SSH sessions in front of me
Please if you could confirm the build number of vCenter in use before proceeding with this (check 'About vSphere' in one of the top-right tabs in vSphere Client or check the vC directly or tail -n 2 /var/log/vmware/vpxd/vpxd.log on the vCSA via CLI).
Creating a DG via the CLI is based on the naa's of the cache-tier device and however many capacity-tier devices like so:
# esxcli vsan storage add -s naa.xxxxxxxxxxxxxxxx -d naa.xxxxxxxxxxxxxxxx -d naa.xxxxxxxxxxxxxxxx
If you don't know the naa's then you can figure these out from vdq -q and/or esxcli storage core device list and/or browsing /dev/disks
But as I said, please hold back on this until you have updated all hosts (unless you are in a reduced-availability data state, in which case I would advise creating these on a format the 6.7 U1 nodes can also use by *temporarily* changing virsto legacy format to v7 VMware Knowledge Base )
"I find it weird that everything worked perfectly fine up until one of the hosts crashed and I shut off the vSAN."
From what you wrote above it would appear you didn't reconfigure the vSAN networking correctly following re-install and/or original node references were not cleaned up - installing a later build (with a lot of changes between these particular builds) only adds to the potential issues.
"shut off the vSAN" is not the appropriate response here, please next time just call me and my colleagues in GSS to advise on next steps, we are here to help.
Good evening Bob. Sorry I have not reported back however there are some developments.
As I was away we had another tech go in and they updated the hosts to U3. They were unsuccessful in reconnecting the datastore so they left it as is.
When I revisited the site I checked the vCenter and it was on version .3200. After an unsuccessful update to .4100 I simply downloaded the ISO of .4100 and reinstalled vCenter from scratch. I am happy to say that everything seems to have come back to normal.
My theory on what happened was the host died and vCenter was on that host, it became corrupted and for reasons that I cannot understand was unable to reconnect the vSAN datastore. Additionally, it seems that the newer vSAN builds require the disks to be formatted at version 10 which previously in my case were version 7.
I cannot see any other explanation as to what happened. In any case, everything is working after a complete rebuild.