VMware Cloud Community
proxb
Contributor
Contributor
Jump to solution

Disconnected ESXi host is unable to reconnect to VCenter

ESXi 4.0.0 Releasebuild-261974

VCenter 4.0.0 Build 162856

VSphere 4.0.0 Build 162856

I have one host out of the 16 that became disconnected from VCenter recently. I have spent some time researching online ways to re-join the host back into VCenter, but each on has so far failed. I am avoiding performing a reboot of the host due to production VMs are residing on it. Also, I cannot access the webpage for the host,which works fine for the other hosts we have running: https://

I am able to ping both the IP address and the fully qualified domain name of the host successfully.

I have done the following things in hopes of resolving it, all of which have failed:

1.Restarting the Management Agents on the ESXi host from the System Customization windows.

2.Tested the Management network successfully and restarted the management network

3.Attempted to reconnect to the Host via VSphere

4.Logged into Tech Support Mode and ran the following command: /sbin/services.sh restart When I run this command, I do notice that the vmware-aam service fails to start.

Error messages:

1.(When attempting to re-connect) Cannot contact the specified host. The host may not be available on the network, a network configuration problem may exist, or the management services on this host may not be responding.

2. (From Management Agent log) panic HttpNfcSvc Another process is listning on port 12001; Please make sure other instanced of hostd are not running; Failed to initialize httpnfc service.

3.(From VirtualCenter Agent log) 0x1499bb90 error 'App' Failed to discover version ofr authenticating to host agent.; could not resolve version ofr authenticating to host agent.; Creating temporary connect spec: localhost:443.

4.(From VirtualCenter Agent log) 0x1499bb90 error 'App' SSLStreamImp::BIORead (0x2e408bb0) timed out; SSL Connect failed with BIO Error

5.(From VirtualCenter Agent log) HttpUtil::ExecuteRequest] Error in sending request - SSL Exception: The SSL handshake timed out local:127.0.01:63355 peer:127.0.0.1:443

Any suggestions would be greatly appreciated! Let me know if you need more information to work with as well.

Thanks!

Boe

0 Kudos
1 Solution

Accepted Solutions
GreatWhiteTec
VMware Employee
VMware Employee
Jump to solution

This is an ugly one. At least a reboot if not re-install. To minimize down time you can shut down the VMs remove from inventory and bring them up on other hosts/cluster.

___________________

A+, DCSE, MCP, MCSA, MCSE, MCTS, MCITP, MCDBA, NCDA, VCP4

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful.

View solution in original post

0 Kudos
5 Replies
dkraut
Enthusiast
Enthusiast
Jump to solution

Since you've already restarted the management agents, etc,. it's really looking like a Host reboot as the next step. The vm's should still be up and running. Can you RDP (windows) or SSH (*nix) into them and shut them down gracefully? Once that's done, reboot the Host.

GreatWhiteTec
VMware Employee
VMware Employee
Jump to solution

This is an ugly one. At least a reboot if not re-install. To minimize down time you can shut down the VMs remove from inventory and bring them up on other hosts/cluster.

___________________

A+, DCSE, MCP, MCSA, MCSE, MCTS, MCITP, MCDBA, NCDA, VCP4

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful.

0 Kudos
proxb
Contributor
Contributor
Jump to solution

Thanks guys! I talked to vmware support and the tech is saying the same thing that a reboot seems like the only solution left to do. So I will go through our standard notification process and give it a reboot later on this afternoon and hope for the best. I will post back the results of the reboot.

0 Kudos
proxb
Contributor
Contributor
Jump to solution

Sorry, got completely lost in work and never updated this.  After a reboot, the ESXi host was able to be reconnected back into vCenter. Thanks again for all of hte help!

0 Kudos
erickdiaz
Contributor
Contributor
Jump to solution

I just had this problem 2 times this week. The first time I was suggested to reboot the esx host. Since it was Saturday I was able to manage a downtime for production but the 2nd time was Wed so I decide to expend a bit more time researching and was able to bring the esx host online.

I found your notes and went through all those steps too, since all essential vms were up I have a bit more time to take it piece by piece.

I remote into the console and start looking at the Messages under "View System Logs". I noticed some problems reported losing connectivity to a LUN, I also review via vcenter when was the last time that the host was being reported. Putting all pieces together this happen during a heavy activity on the SAN side (later I found that it was caused due to a suddendly disconnection from one of the LUNs at an old SAN , an MD3000i).

Since I was able to SSH, did the following,

esxcfg-mpath -L | more, and review the state of each of the LUNs connections.

iqn.1998-01.com.vmware:XXXXXXXX-ESX02-439cccfc-00023d000008,iqn.1984-05.com.dell:powervault.md3000i.60024e80005b8c41000000004a0cb44b,t,1-
   Runtime Name: vmhba37:C7:T12:L31
   Device: No associated device
   Device Display Name: No associated device
   Adapter: vmhba37 Channel: 7 Target: 12 LUN: 31
   Adapter Identifier: iqn.1998-01.com.vmware:XXXXXXXXX-ESX02-439cccfc
   Target Identifier: 00023d000008,iqn.1984-05.com.dell:powervault.md3000i.60024e80005b8c41000000004a0cb44b,t,1
   Plugin: MASK_PATH
  State: dead
   Transport: iscsi

You may have several of those, depending the amount of paths to your SAN, after it run the following command

esxcfg-rescan vmhba37, this basically tells the storage services to refresh its connections to each LUN.

->the vmhba is listed under the adapter field

It took about 10 mins, and then the esx host resume services. I start vmotion all servers out of the host and then decide to do a clean restart without affecting production.

I will probably suggest anybody to first look at the messages on the log to identify the root of the problem, this time I was able to resolve it without affecting any active application.

0 Kudos