HCBV
Enthusiast
Enthusiast

ESXi 7.0.2 becomes uncontrollable

Jump to solution

Hello all,

I have found myself in a situation where one of my two ESXi hosts becomes completely uncontrollable. It is still pingable, reachable on its web interface and also reachable through VCSA. But it wont do anything, it won't poweroff VM's, it won't show status. Nothing at all.

The story is quite long but I try to give as much background info as possible that might be related. I hope you can follow my thoughts.

When the host is uncontrollable it also allow SSH logins, and even through the CLI, the VM's wont poweroff and cannot be killed.

Rebooting does not work either, it just keeps saying that "a restart is in progress" and it never continues to actually restart. To be fair: I waiting 15 minutes or longer until I decided to pull the plug.

The only thing that works is to do a hard reset through IPMI.

This situation has happened 3 times in a row over a course of 6 days.

 

My setup is as following:

2 ESXI hosts: both ESXi 7.0.2 (hardware vendor supports the version)

1 VCSA on version 7.0.2 u2

2 iSCSI hosts on seperate VLANs, both using MPIO and round robin.

The ESXi hosts have 2 NIC's: both nic's have a vmkernel for iscsi and the first nic has a vmkernel for management/vmotion.

Both ESXi hosts also have internal nvme SSD's containing VM's. And also these VM's are uncontrollable. By which I mean: cant poweroff, reboot, shutdown. One of these two even showed a blue screen while its host was uncontrollable.

 

There have been two changes in the last week, so I am now doubting which might be the cause or that I am just unlucky and am experiencing a bug of some sort.

1) I updated from 7.0.1 (both ESXi and VCSA) to 7.0.2

2) I implemented more VLANs, especially the management vmkernel (vmotion and management) and most of the VM's are now all bound to a DPortGroup on VLAN10 instead of the default VLAN1. I made this DPortGroup on a DSwitch that both hosts were already attached to.

After implementing the updates and the VLAN's everything worked fine for about 48 hours. Then most VM's became unresponsive/crashed and even one Windows Domain controllers showed a blue screen "HAL INITIALIZATION FAILED".

I had to hard reset the ESXi hosts to make them function again. After everything was up and running again, this happened again about 48 hours later. I then started doubting myself and the VLAN configuration that I made. So I rebooted everything again: quickly moved all VM's to host B, completely reinstalled host A. Host B was originally hosting almost all VM's due to the prior error. I completely reinstalled host B but did not reinstall host A since it didn't freeze/become uncontrollable and was now hosting all VM's. Again about 48 hours later host A is now the host that is uncontrollable and also needs a hard reset to make it function again. After the host was up I migrated all VM's to host B and also reinstalled host A.

Both are now reinstalled, VCSA is still the as it is (upgraded from 6.x to 7.x and now to 7.0.2. U2).

I did run in to a lot of trouble while updating both ESXi and VCSA through the lifecycle manager. Ultimately both needed a manual update requiring the ISO files instead of the usual update procedure. I never had to do it this way.

I also found that after upgrading VCSA to 7.0.2 U2 I needed to upgrade the DSwitch. And somehow it kept saying "the update is still in progress" on one of the ESXi hosts. Perhaps something went wrong in this phase that might explain the problem I am having?

I am still worried that this problem might reoccur, and I have no clue what might be wrong.

It might be my own mistake in some configuration that I am not aware of. It might be some network/vlan setting. Or perhaps there is an issue with ESXi / VCSA 7.0.2?

Is there perhaps anyone with more experience and more insight in to the problems? I also started to think about installing a new VCSA and trying to migrate but that didn't work due to the version being the same. The reason that I am thinking about reinstalling VCSA is that ever since upgrading VCSA 6.x to 7.x it always shows a warning in vSphere health about ESXi host connectivity.

I have verified that there is no such problem, I can see all UDP heartbeats on port 902 coming in and also going out exactly every 10 seconds. I also had a frequent/permanent low ram situation on VCSA since updating the VCSA 7.x. I now added another 2 GB of ram to VCSA. I use the tiny installation and have now added 2GB of ram for the 2nd time. So it has a total of 14 GB now instead of the 10 GB I always used to dedicate to VCSA.

To make my long story short: I am lost. Am I doing something wrong or is it likely that there might be a bug or a broken driver or anything else that is going wrong?

Labels (1)
0 Kudos
1 Solution

Accepted Solutions
A13x
Hot Shot
Hot Shot

These are the steps i performed before the patch.

stop the vpxa and hostd these must be stopped before.

/etc/init.d/hostd stop
/etc/init.d/vpxa stop

Check for dead paths esxcfg-mpath -L

ID the sd card device and check for anything dead

esxcli storage core device world list

mine is vmhba32 so i run

esxcfg-rescan -d vmhba32

Check again

esxcli storage core device world list

if still exist esxcfg-rescan -d vmhba32

esxcfg-rescan -u vmhba32

wait at least 5 mins because of the rescan etc etc.

then

/etc/init.d/hostd start
/etc/init.d/vpxa start

the trick is to clear it all, wait for a duration then start the services. If you do not wait enough or if they are not cleared you need to do it all again

 

That's assuming you have the sd card bug and there are a spam of entries with local6.info: vmkernel: cpu24:2097581)ScsiVmas: 1057: Inquiry for VPD page 00 to device mpx.vmhba32:C0:T0:L0 failed with error Timeout

 

 

View solution in original post

53 Replies
vbondzio
VMware Employee
VMware Employee

If the hosts allow access via SSH, can you try to just restart services when this happens again? What's in the vmkwarning.log when things start to go wrong? My guess would be some driver issue (if it is either just iSCSI _or_ NVMe based VMs crashing) that is causing hostd to fail.

HCBV
Enthusiast
Enthusiast

I tried restarting the management service. But that did not help.

Which service should I try to restart if this happens again?

And is there a preferred cli command to restart the services?

0 Kudos
vbondzio
VMware Employee
VMware Employee

If services.sh restart doesn't work then it isn't a transient error that hangs hostd but rather something that persists. Does the service restart fail anywhere or does it complete? Do the e.g. hostd logs hang anywhere / throw any errors when you tail them? Definitely look at vmkernel / vmkwarning logs should this happen again.

HCBV
Enthusiast
Enthusiast

After the last time one of my hosts became uncontrollable it was, as mentioned before, reinstalled from scratch.

Since then it has not yet happened again. So I hope it will stay like this and no further issues will arise.

0 Kudos
HCBV
Enthusiast
Enthusiast

This is now happening again on one of the two hosts.

I tried restarting the management services, did not help.

Before restarting the services I was able to login to the host on its own web interface, but after restarting this no longer responds. Also the local console wont respond any more.

Any suggestions?

0 Kudos
vbondzio
VMware Employee
VMware Employee

So the service restart goes through and hostd no longer is responsive enough for the ESXi host client. As I said, tail vmkernel / vmkwarning logs when this is happening, if SSH / DCUI doesn't work, check the log buffer via KVM (Alt-F12).

HCBV
Enthusiast
Enthusiast

I managed to get the local console to respond, enabled ssh and used services.sh restart.

This went fine.

The host seems to start responding again. All VM's on the host show "disconnected" as their status. Trying to solve this now.

@vbondzio I will try to dig up the requested logs to see what is going on.

 

EDIT:

While typing this message the VM's have restored themselves? They now show a normal status again. Might have been VCSA that didn't recognize them yet.

 

Managed to tail the vmkernel log, I have zero idea on the meaning of this:

2021-04-20T10:50:08.737Z cpu2:2211223 opID=f88abd79)NetDVS: 8732: Failed to clear global property com.vmware.vswitch.pvlanMap on DVS 50 13 c7 9b 46 5e 71 6e-66 43 e4 00 0a 78 41 c3 as it is not set.
2021-04-20T10:50:08.737Z cpu2:2211223 opID=f88abd79)NetDVS: 8732: Failed to clear global property com.vmware.etherswitch.mirrorSessions on DVS 50 13 c7 9b 46 5e 71 6e-66 43 e4 00 0a 78 41 c3 as it is not set.
2021-04-20T10:50:08.737Z cpu2:2211223 opID=f88abd79)NetDVS: 8732: Failed to clear global property com.vmware.etherswitch.lacp on DVS 50 13 c7 9b 46 5e 71 6e-66 43 e4 00 0a 78 41 c3 as it is not set.
2021-04-20T10:50:08.737Z cpu2:2211223 opID=f88abd79)NetDVS: 8732: Failed to clear host property com.vmware.common.host.lacp.extraconfig on DVS 50 13 c7 9b 46 5e 71 6e-66 43 e4 00 0a 78 41 c3 as it is not set.
2021-04-20T10:50:08.738Z cpu2:2211223 opID=f88abd79)netioc: NetIOCSetRespoolVersion:237: Set netioc version for portset: DvsPortset-0 to 3,old version: 3
2021-04-20T10:50:08.738Z cpu2:2211223 opID=f88abd79)netioc: NetIOCSetupUplinkReservationThreshold:127: Set threshold for portset: DvsPortset-0 to 75, old threshold: 75
2021-04-20T10:50:08.740Z cpu2:2211223 opID=f88abd79)netioc: NetIOCPortsetNetSchedStatusSet:1203: Set sched status for portset: DvsPortset-0 to Inactive, old:Inactive
2021-04-20T10:50:08.740Z cpu2:2211223 opID=f88abd79)NetDVS: 8732: Failed to clear global property com.vmware.common.vlanmtucheck.deploy on DVS 50 13 c7 9b 46 5e 71 6e-66 43 e4 00 0a 78 41 c3 as it is not set.
2021-04-20T10:50:08.741Z cpu2:2211223 opID=f88abd79)NetDVS: 8732: Failed to clear global property com.vmware.common.teamcheck.deploy on DVS 50 13 c7 9b 46 5e 71 6e-66 43 e4 00 0a 78 41 c3 as it is not set.
2021-04-20T10:50:08.741Z cpu2:2211223 opID=f88abd79)NetDVS: 8732: Failed to clear host property com.vmware.common.host.volatile.upgrade70OrLater on DVS 50 13 c7 9b 46 5e 71 6e-66 43 e4 00 0a 78 41 c3 as it is not set.

 

And the vmkwarning:

2021-04-20T10:42:26.841Z cpu0:2208992)WARNING: NTPClock: 1404: system clock stepped to 1618915346.841535000, no longer synchronized to upstream time servers
2021-04-20T10:42:30.654Z cpu8:2210006)WARNING: VisorFS: 1091: Attempt to setattr non sticky dir/file from tar mount
2021-04-20T10:42:31.648Z cpu1:2210664)WARNING: VisorFS: 1091: Attempt to setattr non sticky dir/file from tar mount
2021-04-20T10:42:32.015Z cpu13:2210760)WARNING: VisorFS: 1091: Attempt to setattr non sticky dir/file from tar mount
2021-04-20T10:42:32.376Z cpu14:2210791)WARNING: VisorFS: 1091: Attempt to setattr non sticky dir/file from tar mount
2021-04-20T10:42:32.637Z cpu7:2210832)WARNING: VisorFS: 1091: Attempt to setattr non sticky dir/file from tar mount
2021-04-20T10:42:36.295Z cpu2:2210786)WARNING: VisorFS: 1091: Attempt to setattr non sticky dir/file from tar mount
2021-04-20T10:42:36.383Z cpu11:2210981)WARNING: VisorFS: 1091: Attempt to setattr non sticky dir/file from tar mount
2021-04-20T10:42:36.471Z cpu16:2210980)WARNING: VisorFS: 1091: Attempt to setattr non sticky dir/file from tar mount
2021-04-20T10:46:48.152Z cpu1:2209008)WARNING: NTPClock: 1711: system clock synchronized to upstream time servers

 

I have the idea that somehow the storage / storage driver might be the problem. It seems to go down first. The reason that I have this idea is that I noticed that my back-ups have reported the issue first. The message there relates to storage issues. I am now also getting update notifications on the lsi_mr3 driver for the hosts. I will lookup the details of this update to see if it mentions anything that I am currently experiencing. Meanwhile I will look further into retrieving logfiles for diagnosis. I am not experienced in that yet.

0 Kudos
HCBV
Enthusiast
Enthusiast

It's happening again, but now on the other host (my 2nd host) in my 2 node cluster.

The first thing noticeable is Veeam back-ups are failing with the error:

Getting VM info from vSphere
Error: NFC storage connection is unavailable. Storage: [stg:datastore-425,nfchost:host-424,conn:10.0.1.40]. Storage display name: [DSI2]. Failed to create NFC download stream. NFC path: [nfc://conn:10.0.1.40,nfchost:host-424,stg:datastore-425@DC01/DC01.vmx].

The storage (DSI2) is the hosts internal nvme disk that only hosts a domain controller.

I am trying to get the beforementioned logfiles, because it seems that the last time I was too late, the log only contained the last 24 hours and the error occurred during the weekend.

For now: I cant retrieve the logs yet, SCP isn't working with the host.

Restarting the services did not work this time, host stays unreachable through VCSA and all VM's show "disconnected". Also after restarting all services, I am still unable to grab the logfile with SCP. It says "stalled" after about 3 minutes.

Edit:

Restarting the management network and management agents both through KVM and using the DCUI made using SCP possible again to retrieve logfiles.

0 Kudos
HCBV
Enthusiast
Enthusiast

Managed to review the log "vmkwarning". It seem that there is a problem with the boot device? (vmbha32 is my dual sd card)

2021-04-20T23:20:56.648Z cpu0:2097380)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba32:C0:T0:L0" state in doubt; requested fast path state update...
2021-04-20T23:20:59.359Z cpu0:2097380)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba32:C0:T0:L0" state in doubt; requested fast path state update...
2021-04-20T23:20:59.359Z cpu10:2097340)WARNING: HBX: 2847: 'LOCKER-5fb930c9-d2403718-3a91-f86eee1a35ca': HB at offset 3145728 - Failed to clear journal from HB: Timeout:
2021-04-20T23:20:59.359Z cpu10:2097340)WARNING: [HB state abcdef02 offset 3145728 gen 510819 stampUS 963596501181 uuid 6070adb3-755f08cc-a203-f86eee1a35ca jrnl <FB 0> drv 24.82 lockImpl 1 ip 10.0.1.52]
2021-04-20T23:20:59.359Z cpu10:2097340)WARNING: Vol3: 2953: 'LOCKER-5fb930c9-d2403718-3a91-f86eee1a35ca': Failed to clear journal address in on-disk HB. This could result in leak of journal block at <type 6 addr 0>.
2021-04-20T23:21:14.923Z cpu6:2097380)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba32:C0:T0:L0" state in doubt; requested fast path state update...
2021-04-20T23:21:39.361Z cpu0:2097380)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba32:C0:T0:L0" state in doubt; requested fast path state update...
2021-04-20T23:21:54.923Z cpu1:2097380)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba32:C0:T0:L0" state in doubt; requested fast path state update...

2021-04-20T23:28:19.380Z cpu0:2100160)WARNING: Fil3: 1534: Failed to reserve volume f533 28 1 5fb930c9 d2403718 6ef83a91 ca351aee 0 0 0 0 0 0 0

From this point on in the log the last message keep repeating. I checked to see what vmbha32 is, and this is the usb storage controller. I believe this is my dual sd card that esxi is installed on. I can't get the host to function again yet, I want to move the VM's and then install new SD cards to see if this helps.

 

Edit: after rebooting the host I see new messages in vmkwarning with the following text, a lot of them.

VisorFS: 1091: Attempt to setattr non sticky dir/file from tar mount

0 Kudos
HCBV
Enthusiast
Enthusiast

The problems still persist.

This time I am not able to fix the situation by restarting services and/or restarting the management agents.

During services.sh restart certain services fail:

- vvold stop failed with status 3

- vaai-nasd stop failed with status 3

All other services do start fine after multiple tries, but these two keep failing. I haven't found out yet what "status 3" means.

Is it possible that the boot device (dual SD card module) has anything to do with this situation? Or is the boot device only involved during initial boot and are the services being restarted from memory?

Hopefully there is someone out here that can help me pinpoint the problem.

0 Kudos
HCBV
Enthusiast
Enthusiast

Just received a huge amount of patches mostly related to storage drivers. Does this perhaps have anything to do with the issues that I am experiencing?

I have not been able to find a website that shows more info about the patches, is this available somewhere? I would like to read up on the fixes to see if there is anything likewise to the problems that I am experiencing.

 

 

The number of patch definitions downloaded (critical/total): ESX: 0/58 ID: Broadcom-ELX-IMA-plugin_12.0.1200.0-4vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 IMA plugin for elxiscsi driver ID: Broadcom-ELX-brcmfcoe_12.0.1500.1-2vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Emulex FCoE Driver ID: Broadcom-ELX-brcmnvmefc_12.8.298.1-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Broadcom NVMeF FC Driver ID: Broadcom-ELX-lpfc_12.8.298.3-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Emulex FC Driver ID: Broadcom-bnxt-Net-RoCE_216.0.0.0-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Broadcom NetXtreme-E Network and ROCE/RDMA Drivers for VMware ESXi ID: Broadcom-elxiscsi_12.0.1200.0-8vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 iscsi driver for VMware ESX ID: Broadcom-elxnet_12.0.1250.0-5vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 net driver for VMware ESX ID: Broadcom-lpnic_11.4.62.0-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 net driver for VMware ESX ID: Broadcom-lsi-mr3_7.716.03.00-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Broadcom Native MegaRAID SAS ID: Broadcom-lsi-msgpt2_20.00.06.00-3vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Avago (LSI) Native 6Gbps SAS MPT Driver ID: Broadcom-lsi-msgpt35_17.00.02.00-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Broadcom Native 12Gbps SAS/PCIe MPT Driver ID: Broadcom-lsi-msgpt3_17.00.10.00-2vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Avago (LSI) Native 12Gbps SAS MPT Driver ID: Broadcom-lsiv2-drivers-plugin_1.0.0-5vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 LSI NATIVE DRIVERS LSU Management Plugin ID: Broadcom-ntg3_4.1.5.0-0vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Broadcom NetXtreme I ESX VMKAPI ethernet driver ID: Cisco-nenic_1.0.33.0-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Cisco VIC Native driver for VMware ESX ID: Cisco-nfnic_4.0.0.63-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Cisco UCS VIC Native fNIC driver ID: ESXi70U2a-17867351 Impact: Important Release date: 2021-04-29 Products: embeddedEsx 7.0 VMware ESXi 7.0.2 Patch Release ID: ESXi_7.0.2-0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 ESXi Component - core ESXi VIBs ID: HPE-hpv2-hpsa-plugin_1.0.0-3vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 HPSA LSU Management Plugin ID: HPE-nhpsa_70.0051.0.100-2vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 HPSA native driver ID: Intel-NVMe-Vol-Mgmt-Dev-Plugin_2.0.0-2vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 INTEL VMD LSU Management Plugin ID: Intel-SCU-rste_2.0.2.0088-7vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 rste: SCU SAS/SATA for VMware ESX ID: Intel-Volume-Mgmt-Device_2.0.0.1152-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Intel NVME Driver with VMD Technology ID: Intel-i40en_1.8.1.136-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Network driver for Intel(R) X710/XL710/X722 Adapters ID: Intel-igbn_1.4.11.2-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Network driver for Intel(R) Gigabit Server Adapters ID: Intel-irdman_1.3.1.19-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Network driver for Intel(R) X722 and E810 based RDMA Adapters ID: Intel-ixgben_1.7.1.35-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Network driver for Intel(R) 10 Gigabit NIC ID: Intel-ne1000_0.8.4-11vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Networking Driver for Intel PRO/1000 Family Adapters ID: MRVL-Atlantic-Driver-Bundle_1.0.3.0-8vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Marvell AQtion Ethernet Controllers Network driver for VMware ESXi ID: MRVL-E3-Ethernet-iSCSI-FCoE_1.0.0.0-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 QLogic NetXtreme II 10 Gigabit Ethernet FCoE and iSCSI E3 Drivers for VMware ESXi ID: MRVL-E3-Ethernet_1.1.0.11-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Network driver for QLogic NetXtreme II PCI/PCIe Gigabit Ethernet Adapters ID: MRVL-E4-CNA-Driver-Bundle_1.0.0.0-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 QLogic FastLinQ 10/25/40/50/100 GbE Ethernet and RoCE/RDMA Drivers for VMware ESXi ID: MRVL-QLogic-FC_4.1.14.0-5vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Qlogic Native FC driver ID: Mellanox-nmlx4_3.19.16.8-2vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Mellanox Technologies ConnectX-3/Pro Core Ethernet and RoCE Drivers for VMware ESXi ID: Mellanox-nmlx5_4.19.16.10-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Mellanox Technologies ConnectX-4/5 Core Ethernet and RoCE Drivers for VMware ESXi ID: Microchip-smartpqi_70.4000.0.100-6vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 SmartPqi Native driver ID: Microchip-smartpqiv2-plugin_1.0.0-6vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 SMARTPQI LSU Management Plugin ID: Micron-mtip32xx-native_3.9.8-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 P32x/P42x PCIe SSD ID: Solarflare-NIC_2.4.0.2010-4vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Networking driver for Solarflare XtremeScale SFC9xxx Ethernet Controller ID: VMware-NVMe-PCIe_1.2.3.11-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Non-Volatile memory controller driver ID: VMware-NVMeoF-RDMA_1.0.2.1-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 VMware NVME over RDMA Driver ID: VMware-VM-Tools_11.2.5.17337674-17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 ESXi Tools Component ID: VMware-ahci_2.0.9-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 VMware Native AHCI Driver ID: VMware-icen_1.0.0.10-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Network driver for Intel(R) E810 Adapters ID: VMware-iser_1.1.0.1-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 VMware Native iSER Driver ID: VMware-nvme-pcie-plugin_1.0.0-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 NVME PCIe LSU Management Plugin ID: VMware-nvme-plugin_1.2.0.42-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 esxcli plugin for VMware nvme driver ID: VMware-nvmxnet3-ens_2.0.0.22-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Network ENS driver for VMware vmxnet3 Virtual Ethernet Controller ID: VMware-nvmxnet3_2.0.0.30-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Network driver for VMware vmxnet3 Virtual Ethernet Controller ID: VMware-oem-dell-plugin_1.0.0-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 OEM DELL LSU Management Plugin ID: VMware-oem-hp-plugin_1.0.0-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 OEM HP LSU Management Plugin ID: VMware-oem-lenovo-plugin_1.0.0-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 OEM LENOVO LSU Management Plugin ID: VMware-pvscsi_0.1-2vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 PVSCSI Driver ID: VMware-vmkata_0.1-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 ATA Driver ID: VMware-vmkfcoe_1.0.0.2-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 Native Software FCoE Driver for VMware ESX ID: VMware-vmkusb_0.1-1vmw.702.0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 USB Driver ID: esx-update_7.0.2-0.0.17867351 Impact: Important Release date: 2021-04-09 Products: embeddedEsx 7.0 ESXi Install/Upgrade Component ID: TOOLS-17901792 Impact: Important Release date: 2021-04-29 Products: embeddedEsx 7.0.* VMware Tools 11.2.6 Async Release

0 Kudos
A13x
Hot Shot
Hot Shot

I have noticed the exact same issue with hosts i have deployed also. The only work around i have found is a fresh install of Dell EMC customized VMware ESXi 7.0 U2 A01 which was released a few days ago. It solves stability and NSX-V DFW netcpa issue.

I am working on another workaround as i do not wish to rebuild all hosts.

Tags (1)
HCBV
Enthusiast
Enthusiast

I have also found this to be the only fix so far.

A full reinstall of the ESXi host does work so far.

0 Kudos
ICTSystems
Contributor
Contributor

I think this might be a "known issue" after running a similar issue with support.

We have had two hosts over a few weeks stop sending stats first to vcenter and if you leave it (which we did the second time - first time tired restarting management services which made the host go into disconnected state within a few minutes) it disconnects from vcenter and then you get isolated guests. SSH may or may not work depending on some random factor, trying to cat log files to see what is going on with host hangs if you do get on. To resolve this we have to shutdown all guests through their O/S and then hard reset host which by then will not restart via the console screen option to do this as original poster says it just hangs. When the host goes off line the guests can be restarted on other hosts.

I have been told by support it is because communication with the SD card fails and the fix is in the next update to VMware

A13x
Hot Shot
Hot Shot

Contact VMware and request VMW_bootbank_vmkusb_0.1-2vmw.702.0.20.45179358.vib

This fixes the SD card issue.

ICTSystems
Contributor
Contributor

Thanks A13x good info

We did some googling based on our problems and responses to this thread to see how wide spread our issue might be and it does indeed seem to be quite a "thing" for 7.0.2

Eg

FYI vmkusb is buggy in 7.x (local storage failure if USB/SD-based) (vpnwp.com)

I have asked VMware support for comment on your suggested VIB update - we are running on HP synergy based hardware so wonder if this is a globally used VIB or one that doesn't carry over to the custom builds HPE/DELL/whatever create.

A13x
Hot Shot
Hot Shot

ive used the same vib on my hpe and dell and sd card has storage issue has been fixed. There is a temp work around that involves rescanning the vmhba32 and stopping the hostd and vpxa to bring the host back online. if you do not install the vib you will be doing this each time the host drops off. 

HCBV
Enthusiast
Enthusiast

What a coincidence, this issue did not reappear for me after completely reinstalling all hosts. Until during the past night. My main host is now unresponsive and it hosts the most important parts of the virtualized environment.

I tried restarting the services to gain access like I did before, that allowed me to migrate VM's to another host. This time this does not work.

Can someone please elaborate on the commands needed for the hostd/vpxa workaround? I will try to open a support case and request the VIB file in the meantime.

EDIT: I tried /etc/init.d/hostd restart and /etc/init.d/vpxa restart but this does not seem to do the trick yet. Perhaps I misunderstood.

0 Kudos
ICTSystems
Contributor
Contributor

Again timely - we had another host failure last night - slightly different this time - host just disconnected in vcenter but no loss of stats from host in vcenter this time before disconnect. I tried the idea of the rescan etc and this didnt work - after a long time command prompt returned, restarted services and no change in host status in vcenter. Tried hba scan again and got message Connection failed....

Performing a rescan of the storage on an ESXi host (1003988) (vmware.com)

Restarting the Management agents in ESXi (1003490) (vmware.com)

I have also talked to vmware support about the suggested vub to install and this is debug version that is issued on a case by case basis from vmware support it seems after investigation of the individual call.