camealy
Contributor
Contributor

vSAN Disk upgrade from 7 to 10 failing

I have a 5 host vSphere/vSAN cluster with 1 flash and 2 spinning disks per host.

Every time I try to run the Upgrade disk format it fails saying it couldn't gather status for various hosts.  Seems like each time I try the hosts that "don't respond" are different.  Usually only 1-3 of them.  I just upgraded the VCSA and all the Hosts to the latest 6.7 version to see if that would help.  I have also tried this upgrade with all VMs shut down except obviously for the VCSA.

"General vSAN error.  Failed to gather statuses from host(s) host1.domain.tld, host4.domain.tld"

I don't really have any issues with the cluster but I would like to get the warning to go away and get up to disk format v10.

Thanks in advance for any help you can offer!

0 Kudos
4 Replies
TheBobkin
VMware Employee
VMware Employee

Hello camealy​,

Welcome (back) to Communities.

Are they becoming 'Not responding' state? If hosts are becoming non-responsive to vCenter then cluster-level tasks such as On-disk format (or anything else that requires communication between multiple hosts and vCenter e.g. vMotion) will timeout or fail - not being able to upgrade ODF is just a side-effect and the root cause should be identified and resolved as a priority, typical causes are poor or misconfigured management networks or issues with vCenter.

If the hostd aspect is fine (e.g. hosts are NOT entering a 'Not responding' state periodically), then the other probable problematic component is vsanmgmtd which runs on each host and is responsible for a lot of the vSAN-vCenter operations (e.g. vSAN Health, Disk Management and vSAN performance stats) - the most common issues with this service are not with the service itself but that it can be negatively impacted by 3rd-party applications (particularly monitoring applications such as Veeam ONE) which (depending on a number of factors) can overburden this daemon and/or other services that it depends on - a typical symptom of this is observing the daemon frequently running out of system-memory which may be indicated by 'MEMORY PRESSURE' messages being logged in vsanmgmtd.log.

As this is a potentially broad topic (e.g. it is not just a case of one possible root cause issue) and that it can be symptomatic of other misconfigurations/issues in the environmentm, I would advise opening a Support Request with my colleagues in GSS vSAN if you have a valid support account that can be used.

Bob

0 Kudos
camealy
Contributor
Contributor

The hosts stay responding and there are no other warnings or issues, if I never attempt the disk format upgrade everything hums right along.  Even during the attempt to upgrade nothing is adversely affected, I just get the mentioned error message and the process doesn't continue.

0 Kudos
TheBobkin
VMware Employee
VMware Employee

Thanks for clarifying.

Does the ODF upgrade pre-check complete without issues?

Is everything else in vSAN Health currently green?

One won't necessarily see alerts relating to all possible vsanmgmtd issues in the UIs.

Can you PM me or attach vsanmgtd.log, vmkernel.log and vsansystem.log (all located in /var/log/ or /var/run/log/ if older/rolled-over) from one of the host that covers the time period when last this failed (and indicate the time in UTC if not just reproduced before getting the logs)?

Note that I can't make any promises that I will be able to find something (and that this is not an official VMware GSS support channel) but I would like to take a look.

Bob

0 Kudos
camealy
Contributor
Contributor

The pre-check would complete fine.  Your recommendation of downloading those logs helped me fix the issue.  vmkernel log for each host was showing SCSI sense warnings about the cache tier drive of every host.  From what I can tell it has always been there, logging something it didn't like about 5-10 times a second.  The drives themselves appeared fine, but I removed the drive groups of each host one at a time and replaced those cache tier drives with different ones.  Verified those sense warnings weren't there anymore and the version 7-10 upgrade completed successfully.

My guess is the whatever service manages those drives was so busy writing out those SCSI sense issues it would sporadically timeout other commands.  Surprising it only came up during this process since these drives have been in place for years.

Thanks!

0 Kudos