- Recently upgraded hosts to 6.7 and vCenter to 6.7a
- Hosts are 'not responding' in vCenter Server
- Can ping
- Cannot acess web interface or login via SSH
- Can access it via console, but after you enter login information and press enter it freezes (cursor is still blinking)
- If you remove the host from the inventory and shut down a virtual machine on the host it brings everything back online and the host can be re-added to vCenter
- Four identical hosts, has happened on three of the four (twice on one)
- The host that had this issue twice now will not come back after trying the above method and is completely unresponsive at the console
And what do you see in vmkenel.log and hostd.log of affected hosts?
How did you perform an apgrade?
I updated the hosts via the Update Manger.
I pulled the logs from one of the hosts I was able to get back online. Here's what was in hostd:
--> [context]zKq7AVICAgAAAMKpfAAVaG9zdGQAAHyZNWxpYnZtYWNvcmUuc28AAADAGwBgsBcBWbxkaG9zdGQAAS5JzIKK4QABbGlidmltLXR5cGVzLnNvAANnIA9saWJ2bW9taS5zbwADTCwPA4pKHAMdmRwDAaIcAxlRHAPwZA0DbNoPA3SgHwH148EAJTAoAAM0KAA7DzYEa4AAbGlicHRocmVhZC5zby4wAAXtmg5saWJjLnNvLjYA[/context]
count_events: starting communication with bmc over ipmi driver
count_events: GET_SEL_REPO_INFO returned {version: 0x51, count 41, free 15728,add_stamp 1380738318, erase_stamp 1358956536 op_support 2}
IPMI SEL sync took 0 seconds 0 sel records, last 41
2018-05-29T09:29:30.273Z error hostd[2099052] [Originator@6876 sub=Cimsvc] IPMI SEL unavailable
2018-05-29T09:29:30.274Z warning hostd[2099762] [Originator@6876 sub=Hostsvc.VFlashManager opID=e3ea772f] GetVFlashResourceRuntimeInfo: vFlash is not licensed, not supported
2018-05-29T09:29:59.882Z warning hostd[2099938] [Originator@6876 sub=Hostsvc.Tpm20Provider opID=e3ea776e user=root] Unable to retrieve TPM/TXT status. TPM functionality will be unavailable. Failure reason: Unable to get node: Sysinfo error: Not foundSee VMkernel log for details..
2018-05-29T09:29:59.918Z error hostd[2099938] [Originator@6876 sub=Hostsvc.VFlashManager opID=e3ea776e user=root] CheckLicense: vFlash is not licensed. error = [N5Vmomi9DataArrayINS_18LocalizableMessageEEE:0x000000b0b88b7180]
2018-05-29T09:29:59.923Z warning hostd[2099938] [Originator@6876 sub=Hostsvc.Tpm20Provider opID=e3ea776e user=root] Unable to retrieve TPM/TXT status. TPM functionality will be unavailable. Failure reason: Unable to get node: Sysinfo error: Not foundSee VMkernel log for details..
2018-05-29T09:29:59.964Z warning hostd[2099938] [Originator@6876 sub=Hostsvc.VFlashManager opID=e3ea776e user=root] GetVFlashResourceRuntimeInfo: vFlash is not licensed, not supported
2018-05-29T09:29:59.968Z warning hostd[2099938] [Originator@6876 sub=Hostsvc.VFlashManager opID=e3ea776e user=root] GetVFlashResourceRuntimeInfo: vFlash is not licensed, not supported
2018-05-29T09:30:00.032Z warning hostd[2099885] [Originator@6876 sub=Statssvc] Calculated write I/O size 589477 for scsi0:0 is out of range -- 589477,prevBytes = 27990022656 curBytes = 28010064896 prevCommands = 1280828curCommands = 1280862
2018-05-29T09:30:00.565Z error hostd[2099053] [Originator@6876 sub=PropertyProvider opID=e3ea7773 user=root] Unexpected fault reading property: 000000b0622e1da0, IsSourceAvailable: N5Vmomi5Fault12NotSupported9ExceptionE(Fault cause: vmodl.fault.NotSupported
--> )
And here's what was in vmkernel:
2018-06-01T21:51:50.858Z cpu4:2386360)MemSchedAdmit: 477: uw.2386360 (827751) extraMin/extraFromParent: 33/33, sioc (809) childEmin/eMinLimit: 14066/14080
2018-06-01T21:51:50.858Z cpu4:2386360)MemSchedAdmit: 470: Admission failure in path: sioc/storageRM.2386360/uw.2386360
2018-06-01T21:51:50.858Z cpu4:2386360)MemSchedAdmit: 477: uw.2386360 (827751) extraMin/extraFromParent: 256/256, sioc (809) childEmin/eMinLimit: 14066/14080
2018-06-01T21:51:50.940Z cpu1:2387625)ScsiVsi: 2899: Can't set the maxPathQueueDepth value to more than device advertised maxPathQueueDepth 128
Bump. Two of the hosts have gone into an unresponsive state again.
At this point you should be opening a SR to have them investigate.
Did you get a resolution to this problem?
We opened case with Vmware last year and they were unable to find the root cause.
We have been battling this for the past year, however since our 6.5 upgrade, and quite intermittent, 6-7 total host.
Here is our current environment to compare.
ESXi 6.5.0, 8935087
Cisco UCS B200 M4 latest drivers and UCS blade package 3.2(3d)
nenic - 1.0.16.0
fnic - 1.6.0.37
Backup software Veeam 9.5.0.1922
Thank you,
Phil
I have what sounds like the same issue. Hosts are non-responsive, VMs seem ok. 1 host is locked up after entering the root password, still on password screen. Alt-F# keys work but nothing else. Another host, I got logged on, but once I got to the troubleshooting screen it then locked. If I can get there, restarting the management agents works but getting there is the problem. I have tried connecting with powercli, but connect-viserver times out.
sometimes the lockup on the console will suddenly unfreeze on its own and I will then be able to get to the management agent restart and get the host back up. No clue as to what triggers either the problem, or the console lockup.
In my case, I just upgraded to the latest patches of 6.5u2 with the Hyperthreading Mitigation features. I have set the flag and so far, problems have only happened on hosts that have had the flag set but have not yet rebooted. It is still too early to tell if this is a coincidence. I am pushing thru the reboots as fast as I can so as to eliminate this as a factor, I still have 16 hosts to go. I set the flag via script 3 days ago and am still doing reboots (a weekend intervened).
I'm seeing similar errors (thousands & thousands of them; 8 lines every 30 seconds) and I also have a 6.7 host upgraded from 6.5U2.
The host works fine though (mostly). I do have some strange intermittent connectivity issues with a web application running on one of the VMs.
This is an HP DL380 Gen9, and the similar errors I'm seeing are:
"2018-09-06T15:41:52.976Z cpu10:2099148)MemSchedAdmit: 470: Admission failure in path: nicmgmtd/nicmgmtd.2099148/uw.2099148"
"2018-09-06T15:41:52.976Z cpu10:2099148)MemSchedAdmit: 477: uw.2099148 (9114) extraMin/extraFromParent: 117/117, nicmgmtd (806) childEmin/eMinLimit: 2479/2560"
Your post is the only thing I hit when searching.
I disconnected one of the NIC cards that I was hoping was associated with the errors, and the errors stopped for several hours- but then started back up...
You are not alone. We have the same issue on newly installed Dell PowerEdge R640 VSAN Ready Nodes, with a clean 6.7 installed from scratch. Some of our CentOS 7 images, latest patches and open-vm-tools, suddenly just start dropping off. The guests and the hosts seem fine, but we have 0 connectivity on certain interfaces. For example, on some, management interfaces will work fine, but services/Internet interfaces drop off and have no connectivity.
I've opened a SR, and hope VMware comes back with something soon.
Hello guys,
the same issue for 10x our ESXi 6.7 on DL380 Gen10 with vSAN.
2018-09-18T13:23:34.015Z cpu25:2100568)MemSchedAdmit: 470: Admission failure in path: nicmgmtd/nicmgmtd.2100568/uw.2100568
2018-09-18T13:23:34.015Z cpu25:2100568)MemSchedAdmit: 477: uw.2100568 (12331) extraMin/extraFromParent: 186/186, nicmgmtd (796) childEmin/eMinLimit: 2443/2560
About 1-2 lines each second in /var/log/vmkernel.log
Any progress on SR / any statements from VMware ?
Please share you info.
Thx!
Regards,
JK
Which vendor HBA is there
Try to change as 64 Queue Depth and reboot host ,
you can follow this KB
We were using the built-in Broadcom x4 GigT nics, but switched the traffic to an HP FLR 10GigT Intel-based card (simply due to a guess, considering Broadcom's driver track record).
I haven't disabled the Broadcom cards entirely, just moved all the traffic to the other nics, but the errors have continued to fill the logs, and we still continue to have intermittent connectivity/responsiveness issues with one of the hosts...
Thanks for the idea Rajeev,
We're not currently using the Nic types mentioned in that KB article.
After disabling the embedded Broadcom quad Nic card last Saturday, the "admission failure" messages all stopped that day and have not returned, for what that's worth.
I haven't collected any new feedback from users about the intermittent connectivity issues yet, so I don't know if that helped anything beyond getting rid of log bloat...
Hello,
I found a similar case with "admission failure" messages reported. Can you try disabling netqueue on the card.
esxcli network nic queue loadbalancer set --rsslb=off -n vmnicX
Thanks,
James
It looks like we had (have?) the same issue. Two ESXi hosts on two separate occasions have locked up in the way you have described. VMKERNEL is full of this error:
ScsiVsi: 2899: Can't set the maxPathQueueDepth value to more than device advertised maxPathQueueDepth 128.
We put in a ticket with VMware, but they were unable to resolve it. They suggested executing a NMI if it locks up so that it generates a kernel dump.
Hello,
This is a known issue with 6.7 and we were able to collect NMI vmkernel dump and shared to engineering team who is currently working on it.
Will share updates as soon as we hear any update.
Thanks,
MS
Did they ever respond? I just updated three of my hosts to 6.7 and experiencing the same issues. Occasionally my hosts flip to 'not responding' in vCenter and I'm able to correct it by restarting the host agents. I'm wondering if it has to do with the the HTAware Mitigation so I will disable it for now and see if the situation improves.
Hardware is Cisco UCS B200M3's Intel E5-2660 v2 and FW 4.0(1b). Storage is all iSCSI over the VIC 1240 NICs (VNX and Nimble arrays). Hardware, FW and Drivers all match up to the VMware HCL.
No that is different issue. In this case you cannot restart management agents.. Dcui also hung.. You might be encountering a different issue I guess.. Better to open an support request
Hi, Did you get a resolution to this issue from VMware yet? I'm having the same issue on 2 hosts. We have over 100 other hosts that are OK.