HCBV
Enthusiast
Enthusiast

ESXi 7.0.2 becomes uncontrollable

Jump to solution

Hello all,

I have found myself in a situation where one of my two ESXi hosts becomes completely uncontrollable. It is still pingable, reachable on its web interface and also reachable through VCSA. But it wont do anything, it won't poweroff VM's, it won't show status. Nothing at all.

The story is quite long but I try to give as much background info as possible that might be related. I hope you can follow my thoughts.

When the host is uncontrollable it also allow SSH logins, and even through the CLI, the VM's wont poweroff and cannot be killed.

Rebooting does not work either, it just keeps saying that "a restart is in progress" and it never continues to actually restart. To be fair: I waiting 15 minutes or longer until I decided to pull the plug.

The only thing that works is to do a hard reset through IPMI.

This situation has happened 3 times in a row over a course of 6 days.

 

My setup is as following:

2 ESXI hosts: both ESXi 7.0.2 (hardware vendor supports the version)

1 VCSA on version 7.0.2 u2

2 iSCSI hosts on seperate VLANs, both using MPIO and round robin.

The ESXi hosts have 2 NIC's: both nic's have a vmkernel for iscsi and the first nic has a vmkernel for management/vmotion.

Both ESXi hosts also have internal nvme SSD's containing VM's. And also these VM's are uncontrollable. By which I mean: cant poweroff, reboot, shutdown. One of these two even showed a blue screen while its host was uncontrollable.

 

There have been two changes in the last week, so I am now doubting which might be the cause or that I am just unlucky and am experiencing a bug of some sort.

1) I updated from 7.0.1 (both ESXi and VCSA) to 7.0.2

2) I implemented more VLANs, especially the management vmkernel (vmotion and management) and most of the VM's are now all bound to a DPortGroup on VLAN10 instead of the default VLAN1. I made this DPortGroup on a DSwitch that both hosts were already attached to.

After implementing the updates and the VLAN's everything worked fine for about 48 hours. Then most VM's became unresponsive/crashed and even one Windows Domain controllers showed a blue screen "HAL INITIALIZATION FAILED".

I had to hard reset the ESXi hosts to make them function again. After everything was up and running again, this happened again about 48 hours later. I then started doubting myself and the VLAN configuration that I made. So I rebooted everything again: quickly moved all VM's to host B, completely reinstalled host A. Host B was originally hosting almost all VM's due to the prior error. I completely reinstalled host B but did not reinstall host A since it didn't freeze/become uncontrollable and was now hosting all VM's. Again about 48 hours later host A is now the host that is uncontrollable and also needs a hard reset to make it function again. After the host was up I migrated all VM's to host B and also reinstalled host A.

Both are now reinstalled, VCSA is still the as it is (upgraded from 6.x to 7.x and now to 7.0.2. U2).

I did run in to a lot of trouble while updating both ESXi and VCSA through the lifecycle manager. Ultimately both needed a manual update requiring the ISO files instead of the usual update procedure. I never had to do it this way.

I also found that after upgrading VCSA to 7.0.2 U2 I needed to upgrade the DSwitch. And somehow it kept saying "the update is still in progress" on one of the ESXi hosts. Perhaps something went wrong in this phase that might explain the problem I am having?

I am still worried that this problem might reoccur, and I have no clue what might be wrong.

It might be my own mistake in some configuration that I am not aware of. It might be some network/vlan setting. Or perhaps there is an issue with ESXi / VCSA 7.0.2?

Is there perhaps anyone with more experience and more insight in to the problems? I also started to think about installing a new VCSA and trying to migrate but that didn't work due to the version being the same. The reason that I am thinking about reinstalling VCSA is that ever since upgrading VCSA 6.x to 7.x it always shows a warning in vSphere health about ESXi host connectivity.

I have verified that there is no such problem, I can see all UDP heartbeats on port 902 coming in and also going out exactly every 10 seconds. I also had a frequent/permanent low ram situation on VCSA since updating the VCSA 7.x. I now added another 2 GB of ram to VCSA. I use the tiny installation and have now added 2GB of ram for the 2nd time. So it has a total of 14 GB now instead of the 10 GB I always used to dedicate to VCSA.

To make my long story short: I am lost. Am I doing something wrong or is it likely that there might be a bug or a broken driver or anything else that is going wrong?

Labels (1)
0 Kudos
53 Replies
A13x
Hot Shot
Hot Shot

These are the steps i performed before the patch.

stop the vpxa and hostd these must be stopped before.

/etc/init.d/hostd stop
/etc/init.d/vpxa stop

Check for dead paths esxcfg-mpath -L

ID the sd card device and check for anything dead

esxcli storage core device world list

mine is vmhba32 so i run

esxcfg-rescan -d vmhba32

Check again

esxcli storage core device world list

if still exist esxcfg-rescan -d vmhba32

esxcfg-rescan -u vmhba32

wait at least 5 mins because of the rescan etc etc.

then

/etc/init.d/hostd start
/etc/init.d/vpxa start

the trick is to clear it all, wait for a duration then start the services. If you do not wait enough or if they are not cleared you need to do it all again

 

That's assuming you have the sd card bug and there are a spam of entries with local6.info: vmkernel: cpu24:2097581)ScsiVmas: 1057: Inquiry for VPD page 00 to device mpx.vmhba32:C0:T0:L0 failed with error Timeout

 

 

HCBV
Enthusiast
Enthusiast

I have an identical situation right now.

Did all the same tricks/commands and rescanning leads to: Connection failed.

0 Kudos
Arthos
Enthusiast
Enthusiast

HCBV,

1. Where are the ESXi installed , in USB ?. ESXi 7.0.2 has issues with USB.

2. Could you please share errors from log file.

Thanks.

 

HCBV
Enthusiast
Enthusiast

My ESXi is installed on a dual SD card that uses the USB bus.

I believe I had posted the logfiles before but I will try to get a new one.

I didn't check for dead paths before rebooting the machine, currently it is working:

vmhba33:C0:T0:L0 state:active mpx.vmhba33:C0:T0:L0 vmhba33 0 0 0 NMP active local usb.vmhba33 usb.0:0

 

EDIT:

 

vmkernel.log is full of vmhba33 errors and faults, like this:

2021-06-08T08:43:56.206Z cpu11:2097365)WARNING: NMP: nmpUnclaimPath:1732: NMP device "mpx.vmhba33:C0:T0:L0" quiesce state change failed: Busy

2021-06-08T08:44:17.878Z cpu16:2451749)Vol3: 3711: Could not open device 'mpx.vmhba33:C0:T0:L0:7' for probing: No connection
2021-06-08T08:44:17.878Z cpu10:2451751)Vol3: 3711: Could not open device 'mpx.vmhba33:C0:T0:L0:5' for probing: No connection
2021-06-08T08:44:17.878Z cpu16:2451749)Vol3: 3711: Could not open device 'mpx.vmhba33:C0:T0:L0:7' for probing: No connection
2021-06-08T08:44:17.878Z cpu10:2451751)Vol3: 3711: Could not open device 'mpx.vmhba33:C0:T0:L0:5' for probing: No connection
2021-06-08T08:44:17.878Z cpu16:2451749)Vol3: 3711: Could not open device 'mpx.vmhba33:C0:T0:L0:7' for probing: No connection
2021-06-08T08:44:17.878Z cpu10:2451751)Vol3: 3711: Could not open device 'mpx.vmhba33:C0:T0:L0:5' for probing: No connection
2021-06-08T08:44:17.878Z cpu16:2451749)Vol3: 3711: Could not open device 'mpx.vmhba33:C0:T0:L0:7' for probing: No connection
2021-06-08T08:44:17.878Z cpu10:2451751)Vol3: 3711: Could not open device 'mpx.vmhba33:C0:T0:L0:5' for probing: No connection

 

I am running with:

vmkusb 0.1-1vmw.702.0.0.17867351 VMW VMwareCertified 2021-05-02

0 Kudos
Arthos
Enthusiast
Enthusiast

HCBV,

 

You should move boot off from USB if you are planning to use it in production environment.If USB is the only boot device possible, wait for upgrade.

Thanks.

0 Kudos
HCBV
Enthusiast
Enthusiast

I currently only have a USB device as bootable option.

Is there a resource or page to follow regarding the driver and upcoming update? Is the beta usb driver perhaps available?

0 Kudos
actyler1001
Enthusiast
Enthusiast

This is unreal.  I have this problem too, spent hours of my life reloading ESXi and removing VIBs I thought may be the root cause.  Completely unacceptable that VMware still distributes this broken build of ESXi.  I'm on VMware ESXi, 7.0.2, 17867351 ..

Regards,

Adam Tyler

0 Kudos
actyler1001
Enthusiast
Enthusiast

So get this, I opened a case with VMware to get a copy of this updated VIB (VMW_bootbank_vmkusb_0.1-2vmw.702.0.20.45179358.vib) and they won't do it.  They want to waste my time and collect logs while my hosts are going offline.  No, I just need the fixed VIB thanks, I'll let you know if it solves my problem.  So frustrating.

Regards,

Adam Tyler

0 Kudos
IRIX201110141
Virtuoso
Virtuoso

They accepted our #PR within minutes without want to see any logs.... and closed it one second later because its a known issue and will be "corrected" with the upcomming U3.

Regards,
Joerg

0 Kudos
actyler1001
Enthusiast
Enthusiast

Ya, I've had this poor support experience a few times now with VMware support.  They want logs and it takes forever to collect, then they sit on it for a few days.  They don't know how to read them, and by then you've solved your own problem.  Wondering why we maintain a support agreement at all...

Sounds like you won the lottery and had someone actually help you when you reached out.  Good for you.

Regards,

Adam Tyler

0 Kudos
A13x
Hot Shot
Hot Shot

just get the updated one with secure boot supported.

0 Kudos
actyler1001
Enthusiast
Enthusiast

Not sure I follow.  there's been a fix posted somewhere or?

Regards,

Adam Tyler

0 Kudos
HCBV
Enthusiast
Enthusiast

I still have not been able to grab the new driver that should fix this issue. Am I missing something here?

0 Kudos
IRIX201110141
Virtuoso
Virtuoso

The U3 which contains the fixed driver is not public available yet.

Gruss
Joerg

0 Kudos
LarryBlanco2
Expert
Expert

Same POOR VMWARE SUPPORT experience here as well.  Scratching my head with the direction that VMWare is going regarding support.  From a customers perspective, it definitely the wrong way. 

I can understand that the USB vib that is/was available to address the problem may not be ready for prime time, but if it addresses the problem for the masses with little side affects, that is better than nothing at all.  Sounds like a vaccine no?

Regardless, unable to get a support technician on the phone, having to wait next to my phone to hear back from them.. If you miss the call, oh because you are actually working, then the whole thing starts all over.  Not very productive from a customers standpoint. May work great for VMWare but who pays VMWare for support?  We do.

 

 

0 Kudos
IRIX201110141
Virtuoso
Virtuoso

When ever possible you should create the #SR via my.vmware.com portal and not over the phone in first place.

 

Years a go it was a human which picks up the phone and today its some kind of computer voice recognition with bad qualitiy. I have one special case where iam unable to open a support call 😞

Regards
Joerg

0 Kudos
plannue
Contributor
Contributor

Wow, just had this happen in my environment (UCS blade farm, all SD cards). Not a fun evening.

This has to be one of the worst bugs ever. Seems like every troubleshooting step in the book made the situation worse. Especially when you run a ls /vmfs/volumes and it freezes up both the SSH AND the DCUI indefinitely. 

Unbelievable that support wont hand out the fixed vib without going through all sorts of hassle. Very displeasing.

0 Kudos
LarryBlanco2
Expert
Expert

I hear ya.  We ended up rolling back to U1 because of just that.  most of the time we are able to recover but those few times when u can't are the ones where your boss comes to you asking what the explicit is going on and what are you doing to prevent this from happening again.  You can say its the VMWare SD card bug only so many times before it becomes old. 

 

Also since the release of the fix kept being pushed out, we had no other choice. 

 

Larry

0 Kudos
A13x
Hot Shot
Hot Shot

what i find more annoying is when they as you, when will this be fixed. Something like this should have been fixed much sooner. instead they release a unsigned patch which does work but then prevents you from using secure boot.

Roll back for us was not an option, too hosts and work. 

Getting the patch from VMware was not an issue, provided them with all the info bundles, logs and snippets. Proved it was the same error and got patch straight away. It is possible to use a script as a workaround to bypass this but its not something you would do in a prod environment.

Still not as bad as a few years back when there were VMware bugs and issues with each release. 

actyler555
Enthusiast
Enthusiast

It's absolutely inexcusable on VMware's part.  The fact that this build shipped with the USB/SD card bug is bad enough, but to drag feet for months with no official fix leaves me speechless.  Mounting evidence that VMware is no longer the go to product when it comes to virtualization platforms.  There are many other options out there.

For now, I've downgraded to 6.x and will stay there until the bitter end.  Hopefully VMware will piss off enough people that version 7 is actually something you can comfortably deploy in a production environment at some point.

Regards,
Adam Tyler

0 Kudos