Re: Strange Host Responsiveness Issues

isaacwd · ‎06-08-2018

- Recently upgraded hosts to 6.7 and vCenter to 6.7a

- Hosts are 'not responding' in vCenter Server

- Can ping

- Cannot acess web interface or login via SSH

- Can access it via console, but after you enter login information and press enter it freezes (cursor is still blinking)

- If you remove the host from the inventory and shut down a virtual machine on the host it brings everything back online and the host can be re-added to vCenter

- Four identical hosts, has happened on three of the four (twice on one)

- The host that had this issue twice now will not come back after trying the above method and is completely unresponsive at the console

Finikiez · ‎06-09-2018

And what do you see in vmkenel.log and hostd.log of affected hosts?

How did you perform an apgrade?

isaacwd · ‎06-11-2018

I updated the hosts via the Update Manger.

I pulled the logs from one of the hosts I was able to get back online. Here's what was in hostd:

--> [context]zKq7AVICAgAAAMKpfAAVaG9zdGQAAHyZNWxpYnZtYWNvcmUuc28AAADAGwBgsBcBWbxkaG9zdGQAAS5JzIKK4QABbGlidmltLXR5cGVzLnNvAANnIA9saWJ2bW9taS5zbwADTCwPA4pKHAMdmRwDAaIcAxlRHAPwZA0DbNoPA3SgHwH148EAJTAoAAM0KAA7DzYEa4AAbGlicHRocmVhZC5zby4wAAXtmg5saWJjLnNvLjYA[/context]
count_events: starting communication with bmc over ipmi driver
count_events: GET_SEL_REPO_INFO returned {version: 0x51, count 41, free 15728,add_stamp 1380738318, erase_stamp 1358956536 op_support 2}
IPMI SEL sync took 0 seconds 0 sel records, last 41
2018-05-29T09:29:30.273Z error hostd[2099052] [Originator@6876 sub=Cimsvc] IPMI SEL unavailable
2018-05-29T09:29:30.274Z warning hostd[2099762] [Originator@6876 sub=Hostsvc.VFlashManager opID=e3ea772f] GetVFlashResourceRuntimeInfo: vFlash is not licensed, not supported
2018-05-29T09:29:59.882Z warning hostd[2099938] [Originator@6876 sub=Hostsvc.Tpm20Provider opID=e3ea776e user=root] Unable to retrieve TPM/TXT status. TPM functionality will be unavailable. Failure reason: Unable to get node: Sysinfo error: Not foundSee VMkernel log for details..
2018-05-29T09:29:59.918Z error hostd[2099938] [Originator@6876 sub=Hostsvc.VFlashManager opID=e3ea776e user=root] CheckLicense: vFlash is not licensed. error = [N5Vmomi9DataArrayINS_18LocalizableMessageEEE:0x000000b0b88b7180]
2018-05-29T09:29:59.923Z warning hostd[2099938] [Originator@6876 sub=Hostsvc.Tpm20Provider opID=e3ea776e user=root] Unable to retrieve TPM/TXT status. TPM functionality will be unavailable. Failure reason: Unable to get node: Sysinfo error: Not foundSee VMkernel log for details..
2018-05-29T09:29:59.964Z warning hostd[2099938] [Originator@6876 sub=Hostsvc.VFlashManager opID=e3ea776e user=root] GetVFlashResourceRuntimeInfo: vFlash is not licensed, not supported
2018-05-29T09:29:59.968Z warning hostd[2099938] [Originator@6876 sub=Hostsvc.VFlashManager opID=e3ea776e user=root] GetVFlashResourceRuntimeInfo: vFlash is not licensed, not supported
2018-05-29T09:30:00.032Z warning hostd[2099885] [Originator@6876 sub=Statssvc] Calculated write I/O size 589477 for scsi0:0 is out of range -- 589477,prevBytes = 27990022656 curBytes = 28010064896 prevCommands = 1280828curCommands = 1280862
2018-05-29T09:30:00.565Z error hostd[2099053] [Originator@6876 sub=PropertyProvider opID=e3ea7773 user=root] Unexpected fault reading property: 000000b0622e1da0, IsSourceAvailable: N5Vmomi5Fault12NotSupported9ExceptionE(Fault cause: vmodl.fault.NotSupported
--> )

And here's what was in vmkernel:

2018-06-01T21:51:50.858Z cpu4:2386360)MemSchedAdmit: 477: uw.2386360 (827751) extraMin/extraFromParent: 33/33, sioc (809) childEmin/eMinLimit: 14066/14080

2018-06-01T21:51:50.858Z cpu4:2386360)MemSchedAdmit: 470: Admission failure in path: sioc/storageRM.2386360/uw.2386360

2018-06-01T21:51:50.858Z cpu4:2386360)MemSchedAdmit: 477: uw.2386360 (827751) extraMin/extraFromParent: 256/256, sioc (809) childEmin/eMinLimit: 14066/14080

2018-06-01T21:51:50.940Z cpu1:2387625)ScsiVsi: 2899: Can't set the maxPathQueueDepth value to more than device advertised maxPathQueueDepth 128

isaacwd · ‎06-20-2018

Bump. Two of the hosts have gone into an unresponsive state again.

daphnissov · ‎06-20-2018

At this point you should be opening a SR to have them investigate.

------------------
How to Ask for Help on Tech Forums
https://neonmirrors.net

pbaideme · ‎08-14-2018

Did you get a resolution to this problem?

We opened case with Vmware last year and they were unable to find the root cause.

We have been battling this for the past year, however since our 6.5 upgrade, and quite intermittent, 6-7 total host.

Here is our current environment to compare.

ESXi 6.5.0, 8935087

Cisco UCS B200 M4 latest drivers and UCS blade package 3.2(3d)

nenic - 1.0.16.0

fnic - 1.6.0.37

Backup software Veeam 9.5.0.1922

Thank you,

Phil

GalNeb · ‎08-21-2018

I have what sounds like the same issue. Hosts are non-responsive, VMs seem ok. 1 host is locked up after entering the root password, still on password screen. Alt-F# keys work but nothing else. Another host, I got logged on, but once I got to the troubleshooting screen it then locked. If I can get there, restarting the management agents works but getting there is the problem. I have tried connecting with powercli, but connect-viserver times out.

sometimes the lockup on the console will suddenly unfreeze on its own and I will then be able to get to the management agent restart and get the host back up. No clue as to what triggers either the problem, or the console lockup.

In my case, I just upgraded to the latest patches of 6.5u2 with the Hyperthreading Mitigation features. I have set the flag and so far, problems have only happened on hosts that have had the flag set but have not yet rebooted. It is still too early to tell if this is a coincidence. I am pushing thru the reboots as fast as I can so as to eliminate this as a factor, I still have 16 hosts to go. I set the flag via script 3 days ago and am still doing reboots (a weekend intervened).

Old enough to know better, young enough to try anyway

MightyGorilla · ‎09-06-2018

I'm seeing similar errors (thousands & thousands of them; 8 lines every 30 seconds) and I also have a 6.7 host upgraded from 6.5U2.

The host works fine though (mostly). I do have some strange intermittent connectivity issues with a web application running on one of the VMs.

This is an HP DL380 Gen9, and the similar errors I'm seeing are:

"2018-09-06T15:41:52.976Z cpu10:2099148)MemSchedAdmit: 470: Admission failure in path: nicmgmtd/nicmgmtd.2099148/uw.2099148"

"2018-09-06T15:41:52.976Z cpu10:2099148)MemSchedAdmit: 477: uw.2099148 (9114) extraMin/extraFromParent: 117/117, nicmgmtd (806) childEmin/eMinLimit: 2479/2560"

Your post is the only thing I hit when searching.

I disconnected one of the NIC cards that I was hoping was associated with the errors, and the errors stopped for several hours- but then started back up...

dbray925 · ‎09-10-2018

You are not alone. We have the same issue on newly installed Dell PowerEdge R640 VSAN Ready Nodes, with a clean 6.7 installed from scratch. Some of our CentOS 7 images, latest patches and open-vm-tools, suddenly just start dropping off. The guests and the hosts seem fine, but we have 0 connectivity on certain interfaces. For example, on some, management interfaces will work fine, but services/Internet interfaces drop off and have no connectivity.

I've opened a SR, and hope VMware comes back with something soon.

vKopp · ‎09-18-2018

Hello guys,

the same issue for 10x our ESXi 6.7 on DL380 Gen10 with vSAN.

2018-09-18T13:23:34.015Z cpu25:2100568)MemSchedAdmit: 470: Admission failure in path: nicmgmtd/nicmgmtd.2100568/uw.2100568

2018-09-18T13:23:34.015Z cpu25:2100568)MemSchedAdmit: 477: uw.2100568 (12331) extraMin/extraFromParent: 186/186, nicmgmtd (796) childEmin/eMinLimit: 2443/2560

About 1-2 lines each second in /var/log/vmkernel.log

Any progress on SR / any statements from VMware ?

Please share you info.

Thx!

Regards,

JK

RajeevVCP4 · ‎09-18-2018

Which vendor HBA is there

Try to change as 64 Queue Depth and reboot host ,

you can follow this KB

VMware Knowledge Base

Rajeev Chauhan
VCIX-DCV6.5/VSAN/VXRAIL
Please mark help full or correct if my answer is use full for you

MightyGorilla · ‎09-18-2018

We were using the built-in Broadcom x4 GigT nics, but switched the traffic to an HP FLR 10GigT Intel-based card (simply due to a guess, considering Broadcom's driver track record).

I haven't disabled the Broadcom cards entirely, just moved all the traffic to the other nics, but the errors have continued to fill the logs, and we still continue to have intermittent connectivity/responsiveness issues with one of the hosts...

MightyGorilla · ‎09-18-2018

Thanks for the idea Rajeev,

We're not currently using the Nic types mentioned in that KB article.

MightyGorilla · ‎09-26-2018

After disabling the embedded Broadcom quad Nic card last Saturday, the "admission failure" messages all stopped that day and have not returned, for what that's worth.

I haven't collected any new feedback from users about the intermittent connectivity issues yet, so I don't know if that helped anything beyond getting rid of log bloat...

jameswalkervmw · ‎10-03-2018

Hello,

I found a similar case with "admission failure" messages reported. Can you try disabling netqueue on the card.

esxcli network nic queue loadbalancer set --rsslb=off -n vmnicX

Thanks,

James

James Walker VMware Support Moderator

dzampino · ‎10-04-2018

It looks like we had (have?) the same issue. Two ESXi hosts on two separate occasions have locked up in the way you have described. VMKERNEL is full of this error:

ScsiVsi: 2899: Can't set the maxPathQueueDepth value to more than device advertised maxPathQueueDepth 128.

We put in a ticket with VMware, but they were unable to resolve it. They suggested executing a NMI if it locks up so that it generates a kernel dump.

msripada · ‎10-22-2018

Hello,

This is a known issue with 6.7 and we were able to collect NMI vmkernel dump and shared to engineering team who is currently working on it.

Will share updates as soon as we hear any update.

Thanks,

MS

mowgus · ‎10-24-2018

Did they ever respond? I just updated three of my hosts to 6.7 and experiencing the same issues. Occasionally my hosts flip to 'not responding' in vCenter and I'm able to correct it by restarting the host agents. I'm wondering if it has to do with the the HTAware Mitigation so I will disable it for now and see if the situation improves.

Hardware is Cisco UCS B200M3's Intel E5-2660 v2 and FW 4.0(1b). Storage is all iSCSI over the VIC 1240 NICs (VNX and Nimble arrays). Hardware, FW and Drivers all match up to the VMware HCL.

msripada · ‎10-24-2018

No that is different issue. In this case you cannot restart management agents.. Dcui also hung.. You might be encountering a different issue I guess.. Better to open an support request

buggyx · ‎11-02-2018

Hi, Did you get a resolution to this issue from VMware yet? I'm having the same issue on 2 hosts. We have over 100 other hosts that are OK.

All

Strange Host Responsiveness Issues