Highlighted
Contributor
Contributor

unresponsive host

Hi

Since i upgraded to VMWare esxi6 and vCenter 6 i have the following issue:

The host shows as greyed out and is not responding.

All the vm's on that host are also greyed out and show disconnected.

i can connect to the host directly by is unresponsive.

If i reboot the host it reconnects fine and everything works.  this has happened on 2 hosts so far.  one was a hp and the other was an intel.

Any help or input will really be appreciated.

Thanks

26 Replies
Highlighted

i would recommend to restart the management agents:

VMware KB: Restarting the Management agents on an ESXi or ESX host

what happens if you rightclick the disconnected server in vcenter and say reconnect? does ke asks you for credentials?

maybe he has lost the certificates by upgrading to vpshere 6.0 and now he thinks that this could be another server.

so you have to  reconnect them manuallly

------------------------------------------------------------------------------- If you found this or any other answer helpful, please consider to award points. (use Correct or Helpful buttons) Regards from Switzerland, B. Fernandez http://vpxa.info/
0 Kudos
Highlighted
Contributor
Contributor

hi

i haven't tried to disconnect and reconnect the host.  but i did try to right click and selected connect.  this did nothing.

when i have the problem again I'll try to restart the management agent.

0 Kudos
Highlighted
Contributor
Contributor

Hello at all!

I had the same issue some days ago in two different environments. One standalone free ESXi 6.0 Hypervisor and one in a two-node-cluster managed by vCenter-Server-Appliance 6.0.

I tried to reconnect the host, but i didn't work for me.

At the DCUI is tried to enter my password, but the Host did not respond. Only the reboot did solve my problem. After that everything was fine.

I'm running the ESXi 6.0 on a Fujitsu RX200 S6 and RX 200 S7.

Please let me know if there is a fix for this issue.

Regards,

schulzman

0 Kudos
Highlighted
Hot Shot
Hot Shot

Hi,

I have had something similar on an upgraded test host, the server would randomly disconnect and a reboot resolved it. Eventually the host wouldn't reconnect to vcenter at all.

The fix for me was up uninstall the vpxa agent, restart the host then reconnect to vcenter (as though connecting a new host)

R

0 Kudos
Highlighted
Enthusiast
Enthusiast

could you please confirm how you uninstall the vpxa agent...

0 Kudos
Highlighted
Enthusiast
Enthusiast

0 Kudos
Highlighted
Highlighted
Contributor
Contributor

If you're seeing this in your vmkernel.log at the time of the disconnect it could be related to an issue that will one day be described at the below link (it is not live at this time). We see this after a random amount of time and nothing VMware technical support could do except reboot the host helped.

http://kb.vmware.com/kb/2124669

vmkernel.log:

2015-07-19T08:22:35.552Z cpu0:33257)WARNING: LinNet: netdev_watchdog:3678:

NETDEV WATCHDOG: vmnic4: transmit timed out

2015-07-19T08:22:35.552Z cpu0:33257)WARNING: at vmkdrivers/src_92/vmklinux_92/vmware/linux_net.c:3707/netdev_watchdog()(inside vmklinux)

2015-07-19T08:22:35.552Z cpu0:33257)Backtrace for current CPU #0,worldID=33257, rbp=0x430609af4380

2015-07-19T08:22:35.552Z cpu0:33257)0x4390cf49be10:[0x418029896b4e]vmk_LogBacktraceMessage@vmkernel#nover+0x22 stack: 0x430609af4380, 0

2015-07-19T08:22:35.552Z cpu0:33257)0x4390cf49be30:[0x418029f1e7b7]watchdog_work_cb@com.vmware.driverAPI#9.2+0x27f stack: 0x430609ac3ce

2015-07-19T08:22:35.552Z cpu0:33257)0x4390cf49bea0:[0x418029f44a5f]vmklnx_workqueue_callout@com.vmware.driverAPI#9.2+0xd7 stack: 0x4306

2015-07-19T08:22:35.552Z cpu0:33257)0x4390cf49bf30:[0x41802984f872]helpFunc@vmkernel#nover+0x4e6 stack: 0x0, 0x430609ac3ce0, 0x27, 0x0,

2015-07-19T08:22:35.552Z cpu0:33257)0x4390cf49bfd0:[0x418029a1231e]CpuSched_StartWorld@vmkernel#nover+0xa2 stack: 0x0, 0x0, 0x0, 0x0,

0 Kudos
Highlighted
Contributor
Contributor

sdnbtech, have you heard or seen any updates on the issue you described?  I haven't been able to get an update on the status of a fix from VMware after about a few weeks after confirming VMware engineering is working on a solution.  A host downgrade to 5.5 was the only recommendation aside from rebooting the 6.0 hosts each time networking drops.

0 Kudos
Highlighted
Enthusiast
Enthusiast

I seem to be having very similar issues:

2015-08-11T11:14:53.340Z cpu23:33256)WARNING: LinNet: netdev_watchdog:3678: NETDEV WATCHDOG: vmnic4: transmit timed out

2015-08-11T11:14:53.340Z cpu23:33256)<6>ixgbe 0000:41:00.0: vmnic4: Fake Tx hang detected with timeout of 160 seconds

When this happens, both ports on a dual port NIC die at the same time and only a reboot fixes it.  I opened an SR with VMware support with reference back to here and the not-yet-exiting KB posted above and will follow up if/when I hear something back on this.

0 Kudos
Highlighted
Enthusiast
Enthusiast

Troubleshooting a non-responsive host without looking at the logs is not really effective, You can open a service request with VMware.

0 Kudos
Highlighted
Enthusiast
Enthusiast

share the log details, Without logs it is hard to find root cause. storage might also be the reason. APD recovery issue still unresolved in 6.0.

What about VMs on host , are they live when host go unresponsive? Even time sync make host disconnected.

Highlighted
Enthusiast
Enthusiast

Confirmed what sdnbtech stated above.  The "transmit timed out" is a known issue.  No ETA on a time frame for release yet, not very forthcoming with details.  Basically was told to downgrade if this issue is affecting me as there is no workaround.  Engineer I spoke to says he sees this at least once a week.

0 Kudos
Highlighted
Contributor
Contributor

I checked this morning and there are a few options. 1) Apply a debug build of ESXi that will still be affected by the problem but gather more information for the development team, 2) There is a script that has to be run at each boot of each ESXi server that they believe fixes the issue entirely but can cause performance degradation, 3) Downgrade to 5.5 or below.

My case has now been open 60 days regarding this issue. It's very disappointing.

0 Kudos
Highlighted
Contributor
Contributor

I checked this morning and there are a few options. 1) Apply a debug build of ESXi that will still be affected by the problem but gather more information for the development team, 2) There is a script that has to be run at each boot of each ESXi server that they believe fixes the issue entirely but can cause performance degradation, 3) Downgrade to 5.5 or below.

My case has now been open 60 days regarding this issue. It's very disappointing.

0 Kudos
Highlighted
Contributor
Contributor

hello,

i have the same problem with 2 hp dl 580 G7.

any chance you could share the script to run on each reboot ?

thanks !

0 Kudos
Highlighted
Enthusiast
Enthusiast

The fix script for this is now available here.

That KB article seems to have been published today, the same day 6.0 U1 came out.  No mention that 6.0U1 fixes this problem.  In fact it specifically states "After upgrading to or installing ESXi 6.0.x and ESXi 6.0 Update 1, you may experience these symptoms" so one would have to assume the problem still persists in 6.0 U1 also.  If it was fixed in 6.0 U1, I would expect the article to say as much.

0 Kudos
Highlighted
Enthusiast
Enthusiast

Hi,

Check below article, It could be firewall issue at VC, hope this will help

VMware KB: ESXi/ESX hosts enter a Not Responding state after connecting to vCenter Server

0 Kudos
Highlighted
Enthusiast
Enthusiast

Thanks for the info. I guess I will wait for the patch I don`t want to run into performance issues:(

VMware really needs to have a warning on their download page that references this KB.

0 Kudos