JeremeyWise
Enthusiast
Enthusiast

Upgrade ESXi 7.03 to 8 - HostD failing

Jump to solution

 

Doing rolling upgrade of cluster from 7 to 8.  On third of five nodes.  MOSTLY the same (cpu core count diff, dimm sizes etc.. ) 

My upgrade path (as I know the CPU / TPM has to accept that they are not compliant with 😎 is to use installation of vSphere 8 on USB key... boot that..  set vSwitch0 to use NIC that is connected.   Login remote and validate hardware all shows and things overall looks clean. Then reboot and boot USB with esxi8 installler and run "upgrade".

This worked fine three other times but failed this time.

 

Symptom: 

web UI "503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http16LocalServiceSpecE:0x00000039a3339a70] _serverNamespace = / action = Allow authenticationParams = _port = 8309)"

SSH to host:

Noticed that hostname is "localhost" 

run command to reset this:

```

[root@localhost:/vmfs/volumes] esxcli system hostname set --fqdn=esxi.acme.local
Connection failed
[root@localhost:/vmfs/volumes] /etc/init.d/hostd status
hostd is not running.

```

Collect logs on one console 

[root@localhost:/vmfs/volumes] tail -f /var/log/hostd.log

 

 

Restart service on another console

[root@localhost:~] /etc/init.d/hostd start
error [ConfigStore:837df91c80] Schema not found: esx/advanced_options/user_var_definitions
error [ConfigStore:837df91c80] [1037] Schema object not found: comp esx grp advanced_options key user_var_definitions id HostdStatsstoreRamdiskSize
error [ConfigStore:127d171140] Schema not found: esx/advanced_options/user_var_definitions
error [ConfigStore:127d171140] [1037] Schema object not found: comp esx grp advanced_options key user_var_definitions id ESXiVPsDisabledProtocols
Exception occured: Schema object not found: comp esx grp advanced_options key user_var_definitions id ESXiVPsDisabledProtocols
hostd started.
[root@localhost:~] /etc/init.d/hostd status
hostd is running.
[root@localhost:~]

 

Logs attached but nothing jumping out. 

[root@localhost:/vmfs/volumes] tail -f /var/log/hostd.log
2023-01-20T19:14:06.906Z In(166) Hostd[2100161]: [Originator@6876 sub=Default] Ignoring ConfigStoreException[1046] Not a non-singleton object: key hbrsvc
2023-01-20T19:14:06.906Z Er(163) Hostd[2100161]: [Originator@6876 sub=Libs] error [ConfigStore:1baa07cd80] [1046] Not a non-singleton object: key hostspecsvc
2023-01-20T19:14:06.906Z In(166) Hostd[2100161]: [Originator@6876 sub=Libs] info [ConfigStore:1baa07cd80] ConfigStoreException: [context]zKq7AVICAgAAAEkBOQEOY29uZmlnc3RvcmUAABceBGxpYmNvbmZpZ3N0b3JlLnNvAACXkgIAmK0FAQHEAWxpYnZpbWNvbmZpZ3N0b3JlLnNvAAGw4gEBDnEBApYNXGhvc3RkAANX2RxsaWJ2bWFjb3JlLnNvAAJhF1wCDfdbAm8kXALya1AEXR0CbGliYy5zby42AAKFw1I=[/context]
2023-01-20T19:14:06.906Z In(166) Hostd[2100161]: [Originator@6876 sub=Default] Ignoring ConfigStoreException[1046] Not a non-singleton object: key hostspecsvc
2023-01-20T19:14:06.906Z Er(163) Hostd[2100161]: [Originator@6876 sub=Libs] error [ConfigStore:1baa07cd80] [1046] Not a non-singleton object: key hostsvc
2023-01-20T19:14:06.907Z In(166) Hostd[2100161]: [Originator@6876 sub=Libs] info [ConfigStore:1baa07cd80] ConfigStoreException: [context]zKq7AVICAgAAAEkBOQEOY29uZmlnc3RvcmUAABceBGxpYmNvbmZpZ3N0b3JlLnNvAACXkgIAmK0FAQHEAWxpYnZpbWNvbmZpZ3N0b3JlLnNvAAGw4gEBDnEBApYNXGhvc3RkAANX2RxsaWJ2bWFjb3JlLnNvAAJhF1wCDfdbAm8kXALya1AEXR0CbGliYy5zby42AAKFw1I=[/context]
2023-01-20T19:14:06.907Z In(166) Hostd[2100161]: [Originator@6876 sub=Default] Ignoring ConfigStoreException[1046] Not a non-singleton object: key hostsvc
2023-01-20T19:14:06.907Z Er(163) Hostd[2100161]: [Originator@6876 sub=Libs] error [ConfigStore:1baa07cd80] [1046] Not a non-singleton object: key httpnfcsvc
2023-01-20T19:14:06.907Z In(166) Hostd[2100161]: [Originator@6876 sub=Libs] info [ConfigStore:1baa07cd80] ConfigStoreException: [context]zKq7AVICAgAAAEkBOQEOY29uZmlnc3RvcmUAABceBGxpYmNvbmZpZ3N0b3JlLnNvAACXkgIAmK0FAQHEAWxpYnZpbWNvbmZpZ3N0b3JlLnNvAAGw4gEBDnEBApYNXGhvc3RkAANX2RxsaWJ2bWFjb3JlLnNvAAJhF1wCDfdbAm8kXALya1AEXR0CbGliYy5zby42AAKFw1I=[/context]
2023-01-20T19:14:06.907Z In(166) Hostd[2100161]: [Originator@6876

 

One more tidbit, but it is a difference from other working nodes.

```

2023-01-20T23:59:22.777Z ntpd stopped.
Errors:
Invalid operation requested: This ruleset is required and cannot be disabled
2023-01-20T23:59:25.904Z executing start plugin: SSH

[root@localhost:/scratch/log] /etc/init.d/ntpd restart
Stopping ntpd
watchdog-ntpd[2150795]: Terminating watchdog process with PID 2147241
Starting ntpd
[root@localhost:/scratch/log]

```

 

 

 


Nerd needing coffee
Labels (2)
0 Kudos
1 Solution

Accepted Solutions
Kinnison
Expert
Expert

Hi,


For that matter I've never even seen 1HE "home" liquid-cooled system, just a few "big" mainframes (many years ago) and a few HEPCs.
Anyway, I too would consider reinstalling ESXi.


Let us know how it ends. 😊


Good things,
Ferdinando

View solution in original post

0 Kudos
16 Replies
JeremeyWise
Enthusiast
Enthusiast

 

 

 

curious:

[root@localhost:/scratch/log] vdf -h

....<snip>...

...

state.tgz 136K 134K
-----
Ramdisk Size Used Available Use% Mounted on
root 32M 32M 0B 100% --
etc 28M 1016K 27M 3% --
opt 32M 0B 32M 0% --

 

but that does not align with disk

[root@localhost:/scratch/log] df -h
Filesystem Size Used Available Use% Mounted on
NFS 7.0T 4.3T 2.7T 61% /vmfs/volumes/nas_md0_vms
VMFS-6 348.8G 55.5G 293.3G 16% /vmfs/volumes/local_vmfs_esxi1
VFFS 127.8G 1.8G 126.0G 1% /vmfs/volumes/OSDATA-611be152-3a16ce54-0410-a0423f377a7e
VFFS 66.2G 10.6G 55.6G 16% /vmfs/volumes/OSDATA-62071ba9-5751bd90-6e1b-a0423f377a7e
vfat 4.0G 294.8M 3.7G 7% /vmfs/volumes/BOOTBANK1
vfat 4.0G 213.2M 3.8G 5% /vmfs/volumes/BOOTBANK2
[root@localhost:/scratch/log]

 

[root@localhost:/tmp] dd if=/dev/random of=/tmp/foo.test bs=512 count=100
100+0 records in
100+0 records out
[root@localhost:/tmp] ls -alh
total 64
drwxrwxrwt 1 root root 512 Jan 21 00:18 .
drwxr-xr-x 1 root root 512 Jan 20 23:59 ..
-rw-r--r-- 1 root root 50.0K Jan 21 00:19 foo.test
drwx------ 1 root root 512 Jan 20 20:10 vmware-uid_0

 

If it was root out of space for inode or blocks that would fail.

 

but my other servers show

imgdb.tgz 2M 2M
-----
Ramdisk Size Used Available Use% Mounted on
root 32M 4M 27M 14% --
etc 28M 1M 26M 6% --

 

Someone is lying like a rug.

 

 

 


Nerd needing coffee
0 Kudos
JeremeyWise
Enthusiast
Enthusiast

 

Another thread ...  like there was suppose to be kernel prameters or options cleared up but were not

 

Broken system:

[root@localhost:~] /etc/init.d/hostd start
error [ConfigStore:3014f1ac80] Schema not found: esx/advanced_options/user_var_definitions
error [ConfigStore:3014f1ac80] [1037] Schema object not found: comp esx grp advanced_options key user_var_definitions id HostdStatsstoreRamdiskSize
error [ConfigStore:c60c7ff140] Schema not found: esx/advanced_options/user_var_definitions
error [ConfigStore:c60c7ff140] [1037] Schema object not found: comp esx grp advanced_options key user_var_definitions id ESXiVPsDisabledProtocols
Exception occured: Schema object not found: comp esx grp advanced_options key user_var_definitions id ESXiVPsDisabledProtocols
hostd started.
[root@localhost:~] esxcfg-advcfg -l |grep user_var_definitions
error [ConfigStore:5ba122d140] Schema not found: esx/advanced_options/user_var_definitions
error [ConfigStore:5ba122d140] [1037] Schema object not found: comp esx grp advanced_options key user_var_definitions
[root@localhost:~]

 

 

vs working system

 

```

[root@esxi2:~] esxcfg-advcfg -l |grep user_var_definitions
[root@esxi2:~]

```


Nerd needing coffee
0 Kudos
maksym007
Hot Shot
Hot Shot

Tell your vCenter build. 

And ESXi hosts the version from which to which you are updating.

0 Kudos
JeremeyWise
Enthusiast
Enthusiast

Thanks for response:

 

vCenter 

Version:8.0.0
Build:20920323
Last Updated:Jan 15, 2023, 6:25 AM

 

esxi Host

Image Profile(Updated) ESXi-8.0.0-20513097-standard
  

Nerd needing coffee
0 Kudos
Kinnison
Expert
Expert

Hi,


What intrigues me on malfunctioning nodes is this:

VFFS 127.8G 1.8G 126.0G 1% /vmfs/volumes/OSDATA-611be152-3a16ce54-0410-a0423f377a7e
VFFS 66.2G 10.6G 55.6G 16% /vmfs/volumes/OSDATA-62071ba9-5751bd90-6e1b-a0423f377a7e


And also what contains one rather than the other.


Regards,
Ferdinando

0 Kudos
JeremeyWise
Enthusiast
Enthusiast

 

Not sure if this is of help but

 

### Physical drive listing ##

[root@localhost:~] esxcfg-scsidevs -c
Device UID Device Type Console Device Size Multipath PluginDisplay Name
t10.ATA_____KINGSTON_SA400S37120G___________________50026B77838F133D____ Direct-Access /vmfs/devices/disks/t10.ATA_____KINGSTON_SA400S37120G___________________50026B77838F133D____ 114473MB HPP Local ATA Disk (t10.ATA_____KINGSTON_SA400S37120G___________________50026B77838F133D____)
t10.ATA_____SSDSC1NB080G4I_______00AJ041_00AJ044IBM___BTWL432104E1080GGN Direct-Access /vmfs/devices/disks/t10.ATA_____SSDSC1NB080G4I_______00AJ041_00AJ044IBM___BTWL432104E1080GGN 76319MB HPP Local ATA Disk (t10.ATA_____SSDSC1NB080G4I_______00AJ041_00AJ044IBM___BTWL432104E1080GGN)
t10.ATA_____Samsung_SSD_850_PRO_512GB_______________S250NXAGA15787L_____ Direct-Access /vmfs/devices/disks/t10.ATA_____Samsung_SSD_850_PRO_512GB_______________S250NXAGA15787L_____ 488386MB HPP Local ATA Disk (t10.ATA_____Samsung_SSD_850_PRO_512GB_______________S250NXAGA15787L_____)
t10.ATA_____WDC__WDS100T2B0B2D00YS70_________________19106A802926________ Direct-Access /vmfs/devices/disks/t10.ATA_____WDC__WDS100T2B0B2D00YS70_________________19106A802926________ 953869MB HPP Local ATA Disk (t10.ATA_____WDC__WDS100T2B0B2D00YS70_________________19106A802926________)
t10.ATA_____WDC__WDS100T2B0B2D00YS70_________________192490801828________ Direct-Access /vmfs/devices/disks/t10.ATA_____WDC__WDS100T2B0B2D00YS70_________________192490801828________ 953869MB HPP Local ATA Disk (t10.ATA_____WDC__WDS100T2B0B2D00YS70_________________192490801828________)
[root@localhost:~]

####

BTWL432104E1080GGN  is the OS boot / install drive.  Its and 80GB SSD ... I have two other servers who use same drive for what that is worth.

 

I attached details on those two folders / paths  

Trying to find some kind of log to indicate WHY hostd services fails..  But what I see iss that it is not able to read in key /values from 

[root@localhost:~] esxcli system settings advanced list
Connection failed

 

Which is just another symptom of no ?rest? service for esxcli or other services to local for local lookup.

###

[root@localhost:~] localcli network ip connection list
Proto Recv Q Send Q Local Address Foreign Address State World ID CC Algo World Name
----- ------ ------ ------------------ ------------------- ----------- -------- ------- ----------
tcp 0 0 172.16.100.101:443 172.16.100.22:53243 TIME_WAIT 0
tcp 0 0 172.16.100.101:443 172.16.100.22:53242 TIME_WAIT 0
tcp 0 0 172.16.100.101:443 172.16.100.22:53241 TIME_WAIT 0
tcp 0 0 127.0.0.1:443 127.0.0.1:17954 TIME_WAIT 0
tcp 0 0 172.16.100.101:443 172.16.100.31:36858 TIME_WAIT 0
tcp 0 0 172.16.100.101:443 172.16.100.22:53231 TIME_WAIT 0
tcp 0 0 172.16.100.101:443 172.16.100.22:53230 TIME_WAIT 0
tcp 0 0 172.16.100.101:443 172.16.100.22:53229 TIME_WAIT 0
tcp 0 0 127.0.0.1:443 127.0.0.1:53974 TIME_WAIT 0
tcp 0 0 127.0.0.1:80 127.0.0.1:23568 TIME_WAIT 0
tcp 0 0 127.0.0.1:443 127.0.0.1:62016 TIME_WAIT 0
tcp 0 0 127.0.0.1:8089 0.0.0.0:0 LISTEN 2904933 newreno vpxa
tcp 0 0 [::]:8083 [::]:0 LISTEN 2148832 newreno settingsd
tcp 0 0 0.0.0.0:8083 0.0.0.0:0 LISTEN 2148832 newreno settingsd
tcp 0 0 [::1]:8098 [::]:0 LISTEN 2148280 newreno apiForwarder
tcp 0 0 127.0.0.1:8098 0.0.0.0:0 LISTEN 2148280 newreno apiForwarder
tcp 0 0 0.0.0.0:443 0.0.0.0:0 LISTEN 2148151 newreno rhttpproxy
tcp 0 0 [::]:443 [::]:0 LISTEN 2148151 newreno rhttpproxy
tcp 0 0 0.0.0.0:80 0.0.0.0:0 LISTEN 2148151 newreno rhttpproxy
tcp 0 0 [::]:80 [::]:0 LISTEN 2148151 newreno rhttpproxy
tcp 0 0 [::1]:8303 [::]:0 LISTEN 2147967 newreno hostdCgiServer
tcp 0 0 127.0.0.1:8303 0.0.0.0:0 LISTEN 2147967 newreno hostdCgiServer
tcp 0 0 [::1]:9999 [::]:0 LISTEN 2147323 newreno esxgdpd
tcp 0 0 127.0.0.1:9999 0.0.0.0:0 LISTEN 2147323 newreno esxgdpd
tcp 0 0 [::]:22 [::]:0 LISTEN 2097856 newreno busybox
tcp 0 0 0.0.0.0:22 0.0.0.0:0 LISTEN 2097856 newreno busybox
tcp 0 0 172.16.100.101:22 172.16.100.32:2476 ESTABLISHED 2097856 newreno busybox
tcp 0 0 172.16.100.101:22 172.16.100.32:2468 ESTABLISHED 2097856 newreno busybox
tcp 0 0 172.16.101.101:931 172.16.101.110:2049 ESTABLISHED 2097864 newreno NFSv3-ServerMonitor
tcp 0 0 [::]:8000 [::]:0 LISTEN 2097567 newreno
tcp 0 0 [::1]:2233 [::]:0 LISTEN 2097572 newreno
tcp 0 0 127.0.0.1:2233 0.0.0.0:0 LISTEN 2097572 newreno
tcp 0 0 [::1]:7890 [::]:0 LISTEN 2097890 newreno kmxa
tcp 0 0 127.0.0.1:7890 0.0.0.0:0 LISTEN 2097890 newreno kmxa
tcp 0 0 [::]:902 [::]:0 LISTEN 2097856 newreno busybox
tcp 0 0 0.0.0.0:902 0.0.0.0:0 LISTEN 2097856 newreno busybox
tcp 0 0 [::]:8300 [::]:0 LISTEN 2097571 newreno
udp 0 0 127.0.0.1:40455 127.0.0.1:6831 2904933 vpxa
udp 0 0 [fe80:1::1]:123 [::]:0 2150826 ntpd
udp 0 0 [::1]:123 [::]:0 2150826 ntpd
udp 0 0 172.16.101.101:123 0.0.0.0:0 2150826 ntpd
udp 0 0 172.16.100.101:123 0.0.0.0:0 2150826 ntpd
udp 0 0 127.0.0.1:123 0.0.0.0:0 2150826 ntpd
udp 0 0 0.0.0.0:123 0.0.0.0:0 2150826 ntpd
udp 0 0 [::]:123 [::]:0 2150826 ntpd
[root@localhost:~]

####

I can run only "localcli" commands.   But not yet seeing reason for failing services.

 


Nerd needing coffee
0 Kudos
Kinnison
Expert
Expert

Hi,


Your disk size leads me to understand that there is an "OS-DATA" partition of just over 60 Gbytes on a disk drive, and you end up with another "OS-DATA" partition located on a different drive disk with twice the space occupied. And from this point of view, how do the systems that actually work look like?
Honestly, it's the first time I've seen something like this, I don't know what you do with those different disk drives of various sizes and brands, but I would simplify somehow.


Regards,
Ferdinando

0 Kudos
JeremeyWise
Enthusiast
Enthusiast

 

 

I am not sure I am following.

 

I have one disk for "OS".  80GB SSD disk (the rest are vSAN disk or local VMFS for VMs application level HA).

t10.ATA_____SSDSC1NB080G4I_______00AJ041_00AJ044IBM___BTWL432104E1080GGN Direct-Access /vmfs/devices/disks/t10.ATA_____SSDSC1NB080G4I_______00AJ041_00AJ044IBM___BTWL432104E1080GGN 76319MB HPP Local ATA Disk (t10.ATA_____SSDSC1NB080G4I_______00AJ041_00AJ044IBM___BTWL432104E1080GGN)

 

The installation of 7 was "simple installation" using defaults.  Upgrade was to just select the "boot drive"  (denoted by *) and tell it to upgrade.

 

How and what the "OS-DATA" partitions are is just how esxi installer choose to slice up the one disk.  

 

I am also a bit confused why two partitions list "OS-DATA"

[root@localhost:~] df -h
Filesystem Size Used Available Use% Mounted on
NFS 7.0T 4.2T 2.7T 61% /vmfs/volumes/pandora_md0_vms
VMFS-6 348.8G 55.5G 293.3G 16% /vmfs/volumes/local_vmfs_thor
VFFS 127.8G 1.8G 126.0G 1% /vmfs/volumes/OSDATA-611be152-3a16ce54-0410-a0423f377a7e
VFFS 66.2G 10.6G 55.6G 16% /vmfs/volumes/OSDATA-62071ba9-5751bd90-6e1b-a0423f377a7e

And one being the 60GB SSD and the other being 128GB which implies it is putting data it classifies as "OSDATA" on

The 128GB SSD is the dedicated vSAN cache drive...

JeremeyWise_1-1674506981139.png

 

Looking at working node(s) I see similar pattern:  

[root@esxi3:/vmfs/volumes/4e5242b9-afbbf328] df -h
Filesystem Size Used Available Use% Mounted on
NFS 7.0T 4.2T 2.7T 61% /vmfs/volumes/nas_md0_vms
VMFS-6 348.8G 230.0G 118.8G 66% /vmfs/volumes/local_vmfs_odin
VFFS 127.8G 6.9G 120.9G 5% /vmfs/volumes/OSDATA-611be14d-ff327e68-c05d-a0423f35e8ee
VFFS 66.2G 9.2G 57.1G 14% /vmfs/volumes/OSDATA-6206b042-64e7e4e0-8e7b-a0423f35e8ee
vfat 4.0G 294.9M 3.7G 7% /vmfs/volumes/BOOTBANK1
vfat 4.0G 213.2M 3.8G 5% /vmfs/volumes/BOOTBANK2
vsan 3.8T 2.1T 1.7T 56% /vmfs/volumes/vsanDatastore
[root@esxi3:/vmfs/volumes/4e5242b9-afbbf328]

*OS Boot 80GB SSD

** Next disk.. small SSD for vSAN cache

 

I this could be related to issue. But to me it is a clue that 

1) OS upgrade did not populate /etc/hosts   and so host boots as localhost

2) Errors in using anything but localcli commands fail with unable to connect

3) services start but what I need to follow is what is target service for 'esxcli' .. and that is symptomatic of errors hostd is getting of not able to lookup keys/values 

[root@localhost:~] esxcli
Connection failed
[root@localhost:~]

 

Above may all be common against disk issue (noted in first posting where "root" is 100% full" yet test of file system I can create files etc.. so not sure what that "root" really represents (RAM Disk for some OS function).. but I have no way to debug that.

 

Broken host#

[root@localhost:/scratch/log] vdf -h

....<snip>...

...

state.tgz 136K 134K
-----
Ramdisk Size Used Available Use% Mounted on
root 32M 32M 0B 100% --
etc 28M 1016K 27M 3% --

 

Not-broken Host#

[root@odin:~] vdf -h

<snip>

imgdb.tgz 2M 2M
-----
Ramdisk Size Used Available Use% Mounted on
root 32M 4M 27M 14% --
etc 28M 1M 26M 6% --
opt 32M 0B 32M 0% --
var 48M 872K 47M 1% --
tmp 256M 12K 255M 0% --
iofilters 32M 0B 32M 0% --
shm 1024M 0B 1024M 0% --
crx 1024M 0B 1024M 0% --

 

That is one that just gives me the only "Oh.. that is bad"..  as running anything to 100% noted as root = evil.

 

But maybe that is red hering.

 


Nerd needing coffee
0 Kudos
Kinnison
Expert
Expert

Hi,


That something went wrong is now established, but with a little "shop boy" math, in my opinion, those 128 Gigabytes ended up on your 512 Gigabyte Samsung disk, if they had ended up on the disk used as " VSAN cache" I doubt that the thing would "stand on their own feet".

 

Regards,
Ferdinando

0 Kudos
JeremeyWise
Enthusiast
Enthusiast

 

I think what your saying is that the 80GB SSD to which esxi 8 was installed was NOT the only disk to which "OSDATA" was installed to and so if either drives fails, the OS will not boot.

 

One thing I do set is the dump/crash partition to be one of the disk set to "vmfs_local_<hostname>".  But not sure if that  is what you are seeing in this.

##  Broken node

[root@localhost:/vmfs/volumes] ls -alh
total 3592
drwxr-xr-x 1 root root 512 Jan 25 13:31 .
drwxr-xr-x 1 root root 512 Jan 24 11:39 ..
drwxr-xr-x 17 root root 4.0K Jan 23 13:31 4e5242b9-afbbf328
drwxr-xr-t 1 root root 76.0K Feb 11 2022 611be152-3a16ce54-0410-a0423f377a7e
drwxr-xr-t 1 root root 76.0K Jan 20 16:36 611bec6e-48e0ca22-11e8-a0423f377a7e
drwxr-xr-t 1 root root 76.0K Jan 20 17:50 62071ba9-5751bd90-6e1b-a0423f377a7e
drwxr-xr-x 1 root root 8 Jan 1 1970 810e6f29-b44e0187-06bb-4b35e97a8d68
drwxr-xr-x 1 root root 8 Jan 1 1970 86238060-a64b110e-a78c-ab7675c53a90
lrwxr-xr-x 1 root root 35 Jan 25 13:31 BOOTBANK1 -> 86238060-a64b110e-a78c-ab7675c53a90
lrwxr-xr-x 1 root root 35 Jan 25 13:31 BOOTBANK2 -> 810e6f29-b44e0187-06bb-4b35e97a8d68
lrwxr-xr-x 1 root root 35 Jan 25 13:31 OSDATA-611be152-3a16ce54-0410-a0423f377a7e -> 611be152-3a16ce54-0410-a0423f377a7e
lrwxr-xr-x 1 root root 35 Jan 25 13:31 OSDATA-62071ba9-5751bd90-6e1b-a0423f377a7e -> 62071ba9-5751bd90-6e1b-a0423f377a7e
lrwxr-xr-x 1 root root 35 Jan 25 13:31 local_vmfs_thor -> 611bec6e-48e0ca22-11e8-a0423f377a7e
lrwxr-xr-x 1 root root 17 Jan 25 13:31 pandora_md0_vms -> 4e5242b9-afbbf328
[root@localhost:/vmfs/volumes] esxcli system version get
Connection failed
[root@localhost:/vmfs/volumes]

 

 

# Working v8 upgrade

[root@odin:/vmfs/volumes] ls -alh
total 3592
drwxr-xr-x 1 root root 512 Jan 25 13:32 .
drwxr-xr-x 1 root root 512 Jan 24 11:39 ..
drwxr-xr-x 17 root root 4.0K Jan 23 13:31 4e5242b9-afbbf328
drwxr-xr-t 1 root root 76.0K Feb 11 2022 611be14d-ff327e68-c05d-a0423f35e8ee
drwxr-xr-t 1 root root 80.0K Jan 25 13:06 611bec8a-098f5dc0-2aea-a0423f35e8ee
drwxr-xr-t 1 root root 76.0K Jan 25 13:09 6206b042-64e7e4e0-8e7b-a0423f35e8ee
drwxr-xr-x 1 root root 8 Jan 1 1970 9699ef92-15464886-aead-3d3b3be4b602
lrwxr-xr-x 1 root root 35 Jan 25 13:32 BOOTBANK1 -> b53bb946-4b45aaa7-00b2-bd374c8e99cb
lrwxr-xr-x 1 root root 35 Jan 25 13:32 BOOTBANK2 -> 9699ef92-15464886-aead-3d3b3be4b602
lrwxr-xr-x 1 root root 35 Jan 25 13:32 OSDATA-611be14d-ff327e68-c05d-a0423f35e8ee -> 611be14d-ff327e68-c05d-a0423f35e8ee
lrwxr-xr-x 1 root root 35 Jan 25 13:32 OSDATA-6206b042-64e7e4e0-8e7b-a0423f35e8ee -> 6206b042-64e7e4e0-8e7b-a0423f35e8ee
drwxr-xr-x 1 root root 8 Jan 1 1970 b53bb946-4b45aaa7-00b2-bd374c8e99cb
lrwxr-xr-x 1 root root 35 Jan 25 13:32 local_vmfs_odin -> 611bec8a-098f5dc0-2aea-a0423f35e8ee
lrwxr-xr-x 1 root root 17 Jan 25 13:32 pandora_md0_vms -> 4e5242b9-afbbf328
drwxr-xr-x 1 root root 512 Jan 25 13:32 vsan:5258d3c3eacaf6a7-22de3661545c65bb
lrwxr-xr-x 1 root root 38 Jan 25 13:32 vsanDatastore -> vsan:5258d3c3eacaf6a7-22de3661545c65bb
[root@odin:/vmfs/volumes] esxcli system version get
Product: VMware ESXi
Version: 8.0.0
Build: Releasebuild-20513097
Update: 0
Patch: 0

 

#  working version 7 node in cluster waiting for upgrade

[root@varda:/vmfs/volumes] ls -alh
total 2568
drwxr-xr-x 1 root root 512 Jan 25 13:30 .
drwxr-xr-x 1 root root 512 Jan 24 13:22 ..
drwxr-xr-x 1 root root 8 Jan 1 1970 40ccffbd-de4c52d4-d0bc-46e8fed8d4c6
drwxr-xr-x 17 root root 4.0K Jan 23 13:31 4e5242b9-afbbf328
drwxr-xr-t 1 root root 76.0K Jan 25 13:09 6182eae6-617577d2-1cdb-0090fa79a4d2
drwxr-xr-t 1 root root 76.0K Jan 25 13:14 63039d9b-cf2ff096-b2fb-0090fa95ef99
lrwxr-xr-x 1 root root 35 Jan 25 13:30 BOOTBANK1 -> d3e7a4bd-d44cd171-79a8-c89630b1d9ec
lrwxr-xr-x 1 root root 35 Jan 25 13:30 BOOTBANK2 -> 40ccffbd-de4c52d4-d0bc-46e8fed8d4c6
lrwxr-xr-x 1 root root 35 Jan 25 13:30 OSDATA-6182eae6-617577d2-1cdb-0090fa79a4d2 -> 6182eae6-617577d2-1cdb-0090fa79a4d2
drwxr-xr-x 1 root root 8 Jan 1 1970 d3e7a4bd-d44cd171-79a8-c89630b1d9ec
lrwxr-xr-x 1 root root 35 Jan 25 13:30 local_vmfs_varda -> 63039d9b-cf2ff096-b2fb-0090fa95ef99
lrwxr-xr-x 1 root root 17 Jan 25 13:30 pandora_md0_vms -> 4e5242b9-afbbf328
drwxr-xr-x 1 root root 512 Jan 25 13:30 vsan:5258d3c3eacaf6a7-22de3661545c65bb
lrwxr-xr-x 1 root root 38 Jan 25 13:30 vsanDatastore -> vsan:5258d3c3eacaf6a7-22de3661545c65bb
[root@varda:/vmfs/volumes] esxcli system version get
Product: VMware ESXi
Version: 7.0.3
Build: Releasebuild-20036589
Update: 3
Patch: 50

 

Suggestions for next step?

 

 


Nerd needing coffee
0 Kudos
Kinnison
Expert
Expert

Hi,


To be honest, I'm wondering, just out of curiosity, if yours is a so-called "homelab" or a production context.
From what you write, one of your HOSTs still running ESXI version 7 has only one "OS-DATA" partition while the others have two "OS-DATA" partitions I think you should try to establish how they are used.


As I told you I have never seen such a thing.


Regrads,
Ferdinando

0 Kudos
JeremeyWise
Enthusiast
Enthusiast

 

 

Yes this is a "home lab"  but I use it as Dev to build out test first processes and deployment logic to then deploy to our more official presales lab. Hense the upgrade to vSphere 8. vSAN 3 testing for Tanzu.   Keeping ahead of customer services ask 🙂

 

And you also noticed that vSphere 7 nodes only list one OS-DATA volume   while all 8 nodes list the two on different disk IDs.

What I was trying to do is map 

[root@localhost:/vmfs/volumes] ls -alh
total 3592
drwxr-xr-x 1 root root 512 Jan 25 13:31 .
drwxr-xr-x 1 root root 512 Jan 24 11:39 ..
drwxr-xr-x 17 root root 4.0K Jan 23 13:31 4e5242b9-afbbf328
drwxr-xr-t 1 root root 76.0K Feb 11 2022 611be152-3a16ce54-0410-a0423f377a7e
drwxr-xr-t 1 root root 76.0K Jan 20 16:36 611bec6e-48e0ca22-11e8-a0423f377a7e
drwxr-xr-t 1 root root 76.0K Jan 20 17:50 62071ba9-5751bd90-6e1b-a0423f377a7e
drwxr-xr-x 1 root root 8 Jan 1 1970 810e6f29-b44e0187-06bb-4b35e97a8d68
drwxr-xr-x 1 root root 8 Jan 1 1970 86238060-a64b110e-a78c-ab7675c53a90
lrwxr-xr-x 1 root root 35 Jan 25 13:31 BOOTBANK1 -> 86238060-a64b110e-a78c-ab7675c53a90
lrwxr-xr-x 1 root root 35 Jan 25 13:31 BOOTBANK2 -> 810e6f29-b44e0187-06bb-4b35e97a8d68
lrwxr-xr-x 1 root root 35 Jan 25 13:31 OSDATA-611be152-3a16ce54-0410-a0423f377a7e -> 611be152-3a16ce54-0410-a0423f377a7e
lrwxr-xr-x 1 root root 35 Jan 25 13:31 OSDATA-62071ba9-5751bd90-6e1b-a0423f377a7e -> 62071ba9-5751bd90-6e1b-a0423f377a7e
lrwxr-xr-x 1 root root 35 Jan 25 13:31 local_vmfs_thor -> 611bec6e-48e0ca22-11e8-a0423f377a7e
lrwxr-xr-x 1 root root 17 Jan 25 13:31 pandora_md0_vms -> 4e5242b9-afbbf328

 

How does this map back to physical disk (or is it just ramdisk and I am barking up the wrong tree)  See attached for output of physical disks and IDs.   I tried to build mapping based on highlighted "OSDATA" disk but nothing lines up.

But I can say that both fresh installations of 8 and upgrades of 7 show both "ODDATA" disks.   Maybe someone else with 8 can check if they see same thing.   And what these are mapped to (physical disk or RAM disk).

 

 

 


Nerd needing coffee
0 Kudos
Kinnison
Expert
Expert

Hi,


This intrigues me because on the systems I've worked on, for now, including mine (recently updated to version 8.0a) they all always have only one "OS-DATA" partition but as boot devices they use larger capacity disk drives and in any case not managed by sata controllers integrated on the motherboard.


My guess (so I could be wrong) is that somehow with the setup of your systems the installation or update process reacts differently than usual (at least as I'm used to seeing).


Regards,
Ferdinando

0 Kudos
JeremeyWise
Enthusiast
Enthusiast

 think I am at a state such that recovery of this node is not possible.  I will eject it from vSAN and the cluster and do a new installation.  I wish I could open the server up and disconnect all the drives but the boot drive but that would take hours (long story.. 1U servers at home = water cooled to be able to remove fans and keep them <30db 🙂 

I will try to be very explicit when I install esxi to select the one boot disk.

 

Can someone with a fresh installation of 8 on SATA SSD post their disk output.?

Can someone explain that output table where "root" was 100%?

Can someone help where you can map OSDATA volumes UUIDs / GUIDS back to which physical volume those reside on?

 

Seems to be those would be good debug tasks in knowledge of system .

 


Nerd needing coffee
0 Kudos
Kinnison
Expert
Expert

Hi,


For that matter I've never even seen 1HE "home" liquid-cooled system, just a few "big" mainframes (many years ago) and a few HEPCs.
Anyway, I too would consider reinstalling ESXi.


Let us know how it ends. 😊


Good things,
Ferdinando

0 Kudos
JeremeyWise
Enthusiast
Enthusiast

vSAN_orphans_needing_homes.png

Sorry for lag on post.. I swear I clicked post and it was done.

 

Cluster is back "online"  in aspect of I did a reinstllation of node.. and ... tap rock second time.. and it worked 🙂

So...  then vSAN went offline.. because 2 of three nodes were on different version of vSAN... so I had to roll dice.. and upgrade last node... which was..  VIB conflict fun .. but after purging a few OEM VIBs.... upgrade went fine.

vSAN then came back online...  but one of my vSAN was showing yet more orphaned chunkets orphaned .. my eternal growing pile of orphans.

Curious was vSAN hosted VMs would not start ( HA error of nodes not compatable).   So shaved yak of vSAN license had to change / upgrade.. and then eject nodes from cluster and bring back into vCenter..  THEN.. after a few rounds of that.. the vSAN cleared out "HA node not compatable"  so VM would power on.

Now down to trying to do rolling node disk remove to remove my orphan  chunkets.    All the VMs are fine an booting.. just concerned more and more orphans will create issues ... 

 

Thanks for postings and looking at thing.

 

   


Nerd needing coffee
0 Kudos