VMware Cloud Community
daviisfuskk
Contributor
Contributor

4 ESXi 5.5 hosts went disconnected all at the same time and couldn't be added back into the cluster

The environment of my system contains more than 200 guests and contains a cluster which has 7 ESXi 5.5 hosts. This is what happens.

1. One day, 6 guests went down, so I used vMotion to migrate those 6 guests to the other host in the cluster. The guests went back to normal.

2. And I clicked rescan datastore.

3. Then, 4 hosts in the cluster went all disconnected at the same time. I couldn't do anything with it, neither with the guests within the disconnected hosts.

4. So, I removed 2 of the 4 hosts in order to add them back in, but then I found that they could not be added back in showing the status of request timeout.
I couldn't use vSphere Client to connect to the host directly either.

Please kindly suggest the underlying problems and possible solutions here.

Important remark: I really cannot restart any hosts as it will affect the production which will have a tremendous effect on the whole business.


hostd log is from one of the hosts that was removed from the cluster and could not be added back in.

Any further questions or information needed in kindly helping me solve the case, please kindly leave a message here or contact me directly or at daviisfuskk@outlook.com

Thank you very much in advance for your kind help.

0 Kudos
8 Replies
jburen
Expert
Expert

If they all went down at the same time something they have in common could be the cause of this. Check your storage and see if every datastore is still presented to all your hosts. Also, check the health of your storage.

When a host cannot connect to storage it can cause a massive load on the hostd process and therefore preventing management traffic to be processed.

Consider giving Kudos if you think my response helped you in any way.
0 Kudos
NathanosBlightc
Commander
Commander

Hello

First of all when you encounter with such as this situation, please restart the management agent of your host and try to connect again. (Read this KB1003490​, I suggest try Shell or SSH instead of vSphere client).

Second, you mentioned rescan the datastore and then saw the hosts are disconnected. Regardless of situation, did you ever ping the disconnected ESXi hosts from the vCenter server that moment?

Also please tell me if the vSphere HA is configured in your cluster, did you choose that datastore as the storage hearbeating datastore or not? Is it a busy storage with higher rate of I/O?

Please mark my comment as the Correct Answer if this solution resolved your problem
0 Kudos
scott28tt
VMware Employee
VMware Employee

To add another consideration, 5.5 has been out of support for some time - once you get things back up and running you should seriously consider an upgrade to a supported version, particularly since you are clearly running business critical workloads.


-------------------------------------------------------------------------------------------------------------------------------------------------------------

Although I am a VMware employee I contribute to VMware Communities voluntarily (ie. not in any official capacity)
VMware Training & Certification blog
daviisfuskk
Contributor
Contributor

Thank you very much for your kind help.

Here is more information.

After I scanned the datastore of the problematic hosts, I found an old datastore that has been removed and is not used anymore and is greyed out, then the host disconnected from vCenter afterwards.

I already restarted services vpxa and hostd through SSH but still it doesn't work. I still cannot add the hosts back in.

Is there any way I can add the hosts back in without having to restart the host?

0 Kudos
NathanosBlightc
Commander
Commander

Did you add the hosts with IP address or FQDN? If you use the name instead of IP, are you ensure the DNS server responds to the vCenter server correctly?

And then about that old un-used datastore, it seems to be an orphaned object. Is that a local storage or shared? If it's a local datastore, please remove it directly through the ESXi, but if it belongs to a shared storage, please check the log files of that storage system during the failure operation

Please mark my comment as the Correct Answer if this solution resolved your problem
0 Kudos
daviisfuskk
Contributor
Contributor

Thank you again for your reply.

I don't have the log of the storage because it has long been removed.

Do you have any other suggestions for making the host connected again?

0 Kudos
daviisfuskk
Contributor
Contributor

I have restarted but it's still not working.

0 Kudos
NathanosBlightc
Commander
Commander

So I think it's highly efficient and good idea to upgrade the virtual infrastructure you have manage to an supported version as scott28tt​ mentioned before. Maybe this operation fixes your problem

Please mark my comment as the Correct Answer if this solution resolved your problem
0 Kudos