Solved: Disconnected ESXi host is unable to reconnect to V...

proxb · ‎10-14-2010

ESXi 4.0.0 Releasebuild-261974

VCenter 4.0.0 Build 162856

VSphere 4.0.0 Build 162856

I have one host out of the 16 that became disconnected from VCenter recently. I have spent some time researching online ways to re-join the host back into VCenter, but each on has so far failed. I am avoiding performing a reboot of the host due to production VMs are residing on it. Also, I cannot access the webpage for the host,which works fine for the other hosts we have running: https://

I am able to ping both the IP address and the fully qualified domain name of the host successfully.

I have done the following things in hopes of resolving it, all of which have failed:

1.Restarting the Management Agents on the ESXi host from the System Customization windows.

2.Tested the Management network successfully and restarted the management network

3.Attempted to reconnect to the Host via VSphere

4.Logged into Tech Support Mode and ran the following command: /sbin/services.sh restart When I run this command, I do notice that the vmware-aam service fails to start.

Error messages:

1.(When attempting to re-connect) Cannot contact the specified host. The host may not be available on the network, a network configuration problem may exist, or the management services on this host may not be responding.

2. (From Management Agent log) panic HttpNfcSvc Another process is listning on port 12001; Please make sure other instanced of hostd are not running; Failed to initialize httpnfc service.

3.(From VirtualCenter Agent log) 0x1499bb90 error 'App' Failed to discover version ofr authenticating to host agent.; could not resolve version ofr authenticating to host agent.; Creating temporary connect spec: localhost:443.

4.(From VirtualCenter Agent log) 0x1499bb90 error 'App' SSLStreamImp::BIORead (0x2e408bb0) timed out; SSL Connect failed with BIO Error

5.(From VirtualCenter Agent log) HttpUtil::ExecuteRequest] Error in sending request - SSL Exception: The SSL handshake timed out local:127.0.01:63355 peer:127.0.0.1:443

Any suggestions would be greatly appreciated! Let me know if you need more information to work with as well.

Thanks!

Boe

GreatWhiteTec · ‎10-15-2010

This is an ugly one. At least a reboot if not re-install. To minimize down time you can shut down the VMs remove from inventory and bring them up on other hosts/cluster.

___________________

A+, DCSE, MCP, MCSA, MCSE, MCTS, MCITP, MCDBA, NCDA, VCP4

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful.

View solution in original post

dkraut · ‎10-14-2010

Since you've already restarted the management agents, etc,. it's really looking like a Host reboot as the next step. The vm's should still be up and running. Can you RDP (windows) or SSH (*nix) into them and shut them down gracefully? Once that's done, reboot the Host.

GreatWhiteTec · ‎10-15-2010

This is an ugly one. At least a reboot if not re-install. To minimize down time you can shut down the VMs remove from inventory and bring them up on other hosts/cluster.

___________________

A+, DCSE, MCP, MCSA, MCSE, MCTS, MCITP, MCDBA, NCDA, VCP4

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful.

proxb · ‎10-15-2010

Thanks guys! I talked to vmware support and the tech is saying the same thing that a reboot seems like the only solution left to do. So I will go through our standard notification process and give it a reboot later on this afternoon and hope for the best. I will post back the results of the reboot.

proxb · ‎01-20-2011

Sorry, got completely lost in work and never updated this. After a reboot, the ESXi host was able to be reconnected back into vCenter. Thanks again for all of hte help!

erickdiaz · ‎02-15-2012

I just had this problem 2 times this week. The first time I was suggested to reboot the esx host. Since it was Saturday I was able to manage a downtime for production but the 2nd time was Wed so I decide to expend a bit more time researching and was able to bring the esx host online.

I found your notes and went through all those steps too, since all essential vms were up I have a bit more time to take it piece by piece.

I remote into the console and start looking at the Messages under "View System Logs". I noticed some problems reported losing connectivity to a LUN, I also review via vcenter when was the last time that the host was being reported. Putting all pieces together this happen during a heavy activity on the SAN side (later I found that it was caused due to a suddendly disconnection from one of the LUNs at an old SAN , an MD3000i).

Since I was able to SSH, did the following,

esxcfg-mpath -L | more, and review the state of each of the LUNs connections.

iqn.1998-01.com.vmware:XXXXXXXX-ESX02-439cccfc-00023d000008,iqn.1984-05.com.dell:powervault.md3000i.60024e80005b8c41000000004a0cb44b,t,1-

Runtime Name: vmhba37:C7:T12:L31

Device: No associated device

Device Display Name: No associated device

Adapter: vmhba37 Channel: 7 Target: 12 LUN: 31

Adapter Identifier: iqn.1998-01.com.vmware:XXXXXXXXX-ESX02-439cccfc

Target Identifier: 00023d000008,iqn.1984-05.com.dell:powervault.md3000i.60024e80005b8c41000000004a0cb44b,t,1

Plugin: MASK_PATH

State: dead

Transport: iscsi

You may have several of those, depending the amount of paths to your SAN, after it run the following command

esxcfg-rescan vmhba37, this basically tells the storage services to refresh its connections to each LUN.

->the vmhba is listed under the adapter field

It took about 10 mins, and then the esx host resume services. I start vmotion all servers out of the host and then decide to do a clean restart without affecting production.

I will probably suggest anybody to first look at the messages on the log to identify the root of the problem, this time I was able to resolve it without affecting any active application.

All

Disconnected ESXi host is unable to reconnect to VCenter