VMware Cloud Community
HCBV
Enthusiast
Enthusiast
Jump to solution

ESXi 7.0.2 becomes uncontrollable

Hello all,

I have found myself in a situation where one of my two ESXi hosts becomes completely uncontrollable. It is still pingable, reachable on its web interface and also reachable through VCSA. But it wont do anything, it won't poweroff VM's, it won't show status. Nothing at all.

The story is quite long but I try to give as much background info as possible that might be related. I hope you can follow my thoughts.

When the host is uncontrollable it also allow SSH logins, and even through the CLI, the VM's wont poweroff and cannot be killed.

Rebooting does not work either, it just keeps saying that "a restart is in progress" and it never continues to actually restart. To be fair: I waiting 15 minutes or longer until I decided to pull the plug.

The only thing that works is to do a hard reset through IPMI.

This situation has happened 3 times in a row over a course of 6 days.

 

My setup is as following:

2 ESXI hosts: both ESXi 7.0.2 (hardware vendor supports the version)

1 VCSA on version 7.0.2 u2

2 iSCSI hosts on seperate VLANs, both using MPIO and round robin.

The ESXi hosts have 2 NIC's: both nic's have a vmkernel for iscsi and the first nic has a vmkernel for management/vmotion.

Both ESXi hosts also have internal nvme SSD's containing VM's. And also these VM's are uncontrollable. By which I mean: cant poweroff, reboot, shutdown. One of these two even showed a blue screen while its host was uncontrollable.

 

There have been two changes in the last week, so I am now doubting which might be the cause or that I am just unlucky and am experiencing a bug of some sort.

1) I updated from 7.0.1 (both ESXi and VCSA) to 7.0.2

2) I implemented more VLANs, especially the management vmkernel (vmotion and management) and most of the VM's are now all bound to a DPortGroup on VLAN10 instead of the default VLAN1. I made this DPortGroup on a DSwitch that both hosts were already attached to.

After implementing the updates and the VLAN's everything worked fine for about 48 hours. Then most VM's became unresponsive/crashed and even one Windows Domain controllers showed a blue screen "HAL INITIALIZATION FAILED".

I had to hard reset the ESXi hosts to make them function again. After everything was up and running again, this happened again about 48 hours later. I then started doubting myself and the VLAN configuration that I made. So I rebooted everything again: quickly moved all VM's to host B, completely reinstalled host A. Host B was originally hosting almost all VM's due to the prior error. I completely reinstalled host B but did not reinstall host A since it didn't freeze/become uncontrollable and was now hosting all VM's. Again about 48 hours later host A is now the host that is uncontrollable and also needs a hard reset to make it function again. After the host was up I migrated all VM's to host B and also reinstalled host A.

Both are now reinstalled, VCSA is still the as it is (upgraded from 6.x to 7.x and now to 7.0.2. U2).

I did run in to a lot of trouble while updating both ESXi and VCSA through the lifecycle manager. Ultimately both needed a manual update requiring the ISO files instead of the usual update procedure. I never had to do it this way.

I also found that after upgrading VCSA to 7.0.2 U2 I needed to upgrade the DSwitch. And somehow it kept saying "the update is still in progress" on one of the ESXi hosts. Perhaps something went wrong in this phase that might explain the problem I am having?

I am still worried that this problem might reoccur, and I have no clue what might be wrong.

It might be my own mistake in some configuration that I am not aware of. It might be some network/vlan setting. Or perhaps there is an issue with ESXi / VCSA 7.0.2?

Is there perhaps anyone with more experience and more insight in to the problems? I also started to think about installing a new VCSA and trying to migrate but that didn't work due to the version being the same. The reason that I am thinking about reinstalling VCSA is that ever since upgrading VCSA 6.x to 7.x it always shows a warning in vSphere health about ESXi host connectivity.

I have verified that there is no such problem, I can see all UDP heartbeats on port 902 coming in and also going out exactly every 10 seconds. I also had a frequent/permanent low ram situation on VCSA since updating the VCSA 7.x. I now added another 2 GB of ram to VCSA. I use the tiny installation and have now added 2GB of ram for the 2nd time. So it has a total of 14 GB now instead of the 10 GB I always used to dedicate to VCSA.

To make my long story short: I am lost. Am I doing something wrong or is it likely that there might be a bug or a broken driver or anything else that is going wrong?

Labels (1)
Reply
0 Kudos
53 Replies
plannue
Contributor
Contributor
Jump to solution

How you managed to get the patch straight away is surprising to me. I just opened a case and the agent had no idea what I was talking about. He provided me the few commands for the 'documented workaround', which is to rescan the hba32, then restart hostd and vxpa. I understand that's the official workaround, but having to build out this into a script, SSH to the hosts and upload that just wait for a host to die? Not a fan.

When I pressed him for the actual fixed vib, he made me refer to this thread and had to check with his team on the process.

Now he said he has to perform some things on the backend before he can send me an authorization email which basically just acknowledges that hot patches are not guaranteed, outside the normal build cycle, etc etc. Once I reply to that acknowledging those facts, ONLY THEN can I have the patch. Who knows how long it will take before I actually get it.

Reply
0 Kudos
journeym
Contributor
Contributor
Jump to solution

Same thing is now happening regularly on 2 servers that have usb raid sd cards.

I have opened a case, and agent just said to reboot the host, not mentioned any other workaround.

I mentioned this forum thread and he didnt comment on that, only said that if there is a workaround in that topic use it. Well, that's really worst vmware support experience up to now )

Im also waiting on some sort of patch, and trying to push vmware to provide me with one, but the agent keeps saying there is no workaround, only official paper on rescanning storage, and that there is an USB timeout, i suppose he thinks that it is a hardware issue, not taking into account i have two completely different servers with usb cards, and with the same problem, also, there were no problems before upgrading to 7.

 

 

Reply
0 Kudos
plannue
Contributor
Contributor
Jump to solution

There are methods you can use to climb the support chain. The phrase "I want an escalation, please" should get you to a TAC manager and you could explain that there is a hot patch that engineering has, and that you want it. Otherwise, support pressed me hard on why I cant simply downgrade or 'just run those workaround commands' whenever the issue arises.

1 thing is for sure, that these usb/SD card timeouts are likely NOT hardware related. If I were you, I would press back on that fairly hard. 

This has also been my worst vmware support experience i've had to date, as well. Plus the fact that they haven't pulled the bad VIB from the depot still, makes it all the more upsetting to me. 

Reply
0 Kudos
actyler555
Enthusiast
Enthusiast
Jump to solution

Well Hi VMware employee @pkvmw, turns out you guys do actually read these posts.  You realize how wildly unacceptable this stunt by VMware has been right?  Time to extend vSphere 6.7 support through to 2025 at least.  No way in "H-E" double hockey sticks do I recommend anyone touch vSphere 7 in production anytime soon.

Love that both product development quality and support at VMware have plummeted.  At least we didn't build our entire environment on top of VMware platforms right?  Oh wait....

Regards,
Adam Tyler

Reply
0 Kudos
pkvmw
VMware Employee
VMware Employee
Jump to solution

Hi @actyler555. From a personal perspective I can totally understand your frustration - I really do. But being basically a small wheel in GS, there's not really much I can do nor change. I leave that up to other people here at VMware. I'm just reading and writing in the forums here and there, trying to help in my free-time when possible.

Feel free to communicate your thoughts to your TAM or other appropriate channels you might have.

Can't be of more help in this regard - sorry!

Regards,
Patrik

HCBV
Enthusiast
Enthusiast
Jump to solution

@pkvmw thank you for sharing the links and info.

I have just patched one of my hosts to see if the issue is resolved.

Side note: I am on ESXi 7.0.2 so I could only gather the update through staging/remediating it within VCSA. 

Reply
0 Kudos
pkvmw
VMware Employee
VMware Employee
Jump to solution

You're welcome, @HCBV!

Glad you mentioned it. There's unfortunately no ESXi 7.0 U2c ISO available for direct download. I'm not aware about the reasoning behind, but I do know there're discussions ongoing.

You can patch either via VCSA, or also download the Offline Bundle for 7.0 U2c over here: https://customerconnect.vmware.com/patch/.

Using the latest available ESXi ISO and the offline bundle you can also create your own custom U2c ISO if required - see more details about Image Builder here: https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.esxi.install.doc/GUID-48AC6D6A-B936-4585-87... or via vLCM: https://williamlam.com/2021/03/easily-create-custom-esxi-images-from-patch-releases-using-vsphere-im... 

Regards,
Patrik

Reply
0 Kudos
HCBV
Enthusiast
Enthusiast
Jump to solution

Thank you for the information @pkvmw 

I am a bit confused by the numbering system perhaps.

Just to clarify, when I go to the product patches page and select ESXi as the product there is no version 7.0.2 to choose. Only 7.0, 7.0.0 and 7.0.1. The update is found under version 7.0 and is named: VMware-ESXi-7.0U2c-18426014-depot

Is this version suited for installation on ESXi: VMware ESXi, 7.0.2, 17867351  or will this downgrade my current 7.0.2 to 7.0 U2c?

Reply
0 Kudos
vThomasF
VMware Employee
VMware Employee
Jump to solution

Hi,

this site has an easy list for the build nrs. https://www.virten.net/vmware/esxi-release-build-number-history/

Higher nr is newer patch. 18xxxxx will contain everything which is in 17xxxxx

tonyanshe
Enthusiast
Enthusiast
Jump to solution

Currently running ESXi 7.0 U2d build 18538813 and experiencing the exact same symptoms. Happened to 3 hosts so far. VMware confirmed it is not the SD card issue as that was resolved in ESXi 7.0 U2c and the logs do not have the related errors. Anyone else experience this behaviour in this version? 

Reply
0 Kudos
actyler555
Enthusiast
Enthusiast
Jump to solution

I know it isn't that helpful, but we gave up and are moving to hard drives instead of SDcards....

-Adam

vThomasF
VMware Employee
VMware Employee
Jump to solution

Hi Tony,

do you have a case nr?

Reply
0 Kudos
tonyanshe
Enthusiast
Enthusiast
Jump to solution

Hi Thomas, Yes SR#21279910611 

Reply
0 Kudos