VMware Cloud Community
asg2ki
Enthusiast
Enthusiast

vSphere ESXi 6.7 - PSOD with Distributed Switch upgrade to v6.6

All,

I just upgraded one of my environments from vSphere 6.5 to vSphere 6.7 and I'm experiencing a PSOD right after engaging an upgrade over the assigned dvSwitch to the latest version. So far I'm getting the following PSOD output and I'm more than confident it's not related to a faulty hardware:

pastedImage_0.png

I have a single cluster with 3 hosts included in my environment. The first time I noticed the above PSOD was right after I engaged the dvSwitch upgrade via vCenter which resulted with all 3 hosts getting PSOD, rendering the whole cluster completely down. After I rebooted all 3 hosts, they seemed to have been stabilized, however I decided to reboot the third host in order to check few side things related to the new "quick boot" feature and since then the host is getting a constant PSOD almost immediately after making connection with the vCenter instance. When I checked the status of the dvSwitch I noticed that it is complaining about this very same host not yet being upgraded on the dvSwitch side, so I assume something is still not quite OK with the latest vSphere 6.7 binaries.

It seems that I'll have to disconnect the host physically from the network in order to be able to get it back online and probably remove manually the dvSwitch assignment from its configuration, but in any case I'm making this post as a heads up to anyone who might be willing to make a similar upgrade. So for the moment be extremely careful with the dvSwitch upgrades.

So far I didn't see any indications in the release notes nor anywhere on the web about similar behavior situation. Perhaps VMware should include this problem in the following KB article just in case:

VMware Knowledge Base

Should anyone else experience similar problems, feel free to share your experience and a solution if you have one.

0 Kudos
7 Replies
vijayrana968
Virtuoso
Virtuoso

Check if there is network flapping vSphere 6.7 Release Notes

0 Kudos
asg2ki
Enthusiast
Enthusiast

I did that already but unfortunately it doesn't seem to be the case. Also the release notes are stating that this is related to qfle3f driver and my NIC's are all Intel Gb based.

Thanks for the suggestion anyway.

0 Kudos
vijayrana968
Virtuoso
Virtuoso

Since this is brand new release, you must contact vmware technical support further.

0 Kudos
asg2ki
Enthusiast
Enthusiast

Would have gone this way if I had active support contract, but since I don't I'll just stick to the forum and hope for someone to post some additional tech. thoughts. In any case I'll try some manual troubleshooting steps a bit later when I get a spare moment.

0 Kudos
kheopslinking
Contributor
Contributor

hello

it's a well known issue and technical support ask me to wait 6 months, yes 6 months Smiley Sad until 6.7U1 release.

0 Kudos
golddiggie
Champion
Champion

That's why most people wait at least a couple/few months before they apply the latest version to a production environment. I've even had VMware engineers (support and not) recommend waiting until an 'u1' comes out before going to a fresh release.

I have to ask, are ALL your host network connection in distributed switches? Or did you keep the host management connections on standard vSwitches??

0 Kudos
asg2ki
Enthusiast
Enthusiast

I agree with you. Major upgrade on prod environment should always be preceded by test-lab upgrades in order to avoid such situations. Btw in my case I had all my hosts configured with Distributed vSwitches including the management section, but in any case having the management part staying on dedicated standard vSwitch wouldn't have helped because when I forcefully configured one via console while the host was disconnected from vCenter, the moment this same host came back to the network and started synchronizing it's configuration with the vCenter appliance, the vCenter instance probably tried to continue the upgrade procedure of the Distributed Switch and this effectively broke the host again.

I suspect in my case the situation was due to the fact that I didn't upgrade NSX to the appropriate 6.4 version, so effectively the NSX binaries might have caused the problems when they were being initialized during the regular boot procedure. At the time being NSX 6.4 wasn't available so I took the risk on my own to upgrade the ESXi hosts despite the incompatibilities but it didn't matter since it was just a test-lab and on the other hand I only needed a Load Balancer VM from the NSX part, so no VXLAN was involved there at the time being. I assumed it should be safe to make the upgrade, but probably I was wrong on that one. I can't confirm if the problem actually was due to the NSX incompatibilities but when I fixed the issue on all my hosts at the time being the latest version of NSX was still 6.3.5 which was complaining afterwards that it cannot install the necessary NSX binaries, but all in all that didn't bother me at all.

Anyway I resolved my issue by resetting the individual ESXi's configuration to the factory default settings and then I applied the appropriate host profile over them including all Distributed vSwitch configuration settings. Since then my ESXi hosts work absolutely stable and perhaps the only problem I still have is a cosmetic message about the Distributed vSwitch still being upgraded despite that it already is. I assume this little issue is just a stuck event record somewhere in the vCenter's database.

0 Kudos