VMware Cloud Community
vitaprimo
Enthusiast
Enthusiast
Jump to solution

vSAN VMkernels fail to be recognized on hosts

I have rebuilt the cluster several times, rewritten blank GPT tables on already-claimed disks, full wipe on all hosts, reusing a dSwitch, setting up a new dSwitch altogether--complete with host config wipe and partition table wipe. I keep getting stuck in the same place, trying to figure out why hosts won't identify their VMkernels for vSAN. I tried checking the vSAN checkbox off and back on on each host too (BTW.)

Screen_Shot_2020-04-18_at_13_32_40-2.png

It gets a little more confusing. The version thing says disks are on version 10. I just remediated the cluster last night, again this morning, to the current 6.x image, 6.7.0, 15160138, AKA update 3. From the information I read on the release notes, the matching vSAN version is 7, not 10. I don't think they pulled a Microsoft and jumped straight to version 10. Even so, I have not installed nor downloaded any component from vSphere 7. It doesn't make any sense.

Screen Shot 2020-04-19 at 00.28.52.png

It gets still a little more confusing. Right there in the cluster details, all disks appear okay-ish--if you ignore the connectivity warning on hyperserver1(shot below)--and vSAN datastore capacity actually diminished and then grew back a little when I added the last host, this was hyperserver3.

I have to do it in stages because vCenter is hosted in the cluster itself. To make it smoother/error-free, dSwitches and VMkernel adapters are done beforehand, vCenter is manually unregistered/registered from hosts, then preempt the final host by dropping it in a disposable cluster with same settings. With EVC and other settings matching, moving VMs and hosts itself across clusters goes without errors.

Screen Shot 2020-04-19 at 00.32.05.png

The vSAN datastore went from some terabyte number down to around 60 gigabytes, then a few minutes later grew its capacity to around half a terabyte. It continues at that size. The amount needed for our VMs is minuscule, probably less than 400GB and that's with future-proofing, but the raw capacity is more though so I'm not sure if it set itself at the min capacity all hosts can provide (like standard RAID arrays) or it's stalled or something else. I read in the documentation that doing certain tasks would invoke "a rolling reformat of every disk group in the cluster" so I figured that was what it was doing and why it grew earlier, if it's really doing that, it's definitely stalled.

I have changed disks, host versions and configs, distributed network switches, turned things on and off hoping to trigger reactions, as I'm writing this I keep trying things (1) to avoid bothering and (2) for the screenshots and the only progress I got was when I realized on of the hosts had mixed NIC speeds (1G vs 10G) on the dSwitch. That was hs1, removing the offending NIC got the other two hosts previously reported with no vSAN VMkernels of their own, sort of fine. (no warnings in one place, see shot above, but still missing VMkernerls on other place, see way up in first shot) And, capacity is still missing.

The only thing I have kept constant throughout all this is vCenter, is it possible that vCenter has [selectively] stale data? How could I fix that without deploying everything again? Setting up makeshift DNS servers and other bits from the network takes forever each time. Smiley Sad Even wiping the partition tables is a physical chore, as I  forgot the proper method and I'm booting the hosts one at a time with a gparted Live flash drive to wipe them faster–vSphere viciously fights against accessing the disks; at least that's reassuring if any data was actually in there.

I'll appreciate any advice/help you give me. All disks are empty, I don't want to but I can wipe if necessary. All VMs (including vCenter) are backed up in central storage.

Thanks.

Tags (2)
Reply
0 Kudos
1 Solution

Accepted Solutions
vitaprimo
Enthusiast
Enthusiast
Jump to solution

I was going to delete this out of embarrassment but hopefully my stupidity serves as reminder.  

I checked everything and updated everything except for vCenter Server itself. I was running vCenter Server Appliance 6.7, but a very old one. I don't remember the exact build, it started with 8xxxxx. Since then more digits have been added to the build number, that's how old it was.

As soon as the newer version booted up, all errors were gone. Setting up vSAN clusters is much more flexible too, no specific step order at all. It just works. Current version, a few days old, is 6.7.0 build 15976728…and of course vCSA 7, if you can justify the upgrade. We can't just yet. Smiley Sad

Anyway… Happy weekend!

View solution in original post

Reply
0 Kudos
2 Replies
vitaprimo
Enthusiast
Enthusiast
Jump to solution

Dyslexia made me misread the decimal point, I guess. vSAN datastore capacity is 68.32GB, so not half a terabyte.    At least I know it's not stalled though.

Reply
0 Kudos
vitaprimo
Enthusiast
Enthusiast
Jump to solution

I was going to delete this out of embarrassment but hopefully my stupidity serves as reminder.  

I checked everything and updated everything except for vCenter Server itself. I was running vCenter Server Appliance 6.7, but a very old one. I don't remember the exact build, it started with 8xxxxx. Since then more digits have been added to the build number, that's how old it was.

As soon as the newer version booted up, all errors were gone. Setting up vSAN clusters is much more flexible too, no specific step order at all. It just works. Current version, a few days old, is 6.7.0 build 15976728…and of course vCSA 7, if you can justify the upgrade. We can't just yet. Smiley Sad

Anyway… Happy weekend!

Reply
0 Kudos