VMware Cloud Community
drivera01
Enthusiast
Enthusiast

esxi5.0 node disconnected and unresponsive in vcenter (due to storage issues)

We have had a rash of nodes over the last few months become unresponsive and disconnected in vcenter.. The root cause I know is storage provisioning and/or zoning. So with that being said.

My question is after the zoning is fixed or whatever underlying issue is resolved. The node continues in this state. I lose connection in vcenter, I could not even connect direct to in via the viclient. I was able to get into it via the console. I tried restarting mgmt services, I tried rescanning the vmhba via commandline.  That failed with an error: "Connect to localhost failed" 

I have yet had any luck recovering from  this scenario. I land up having to reboot the node to fix.

Is there any recommendations on how to fix this issue. I am sure it will happen again until our  network/storage team  irons out their process.

0 Kudos
3 Replies
a_nut_in
Expert
Expert

Hi David,

What you are seeing sounds like an APD scenario. Just a generic search on "APD+VMware" in google will show you just how disruptive the same can be. But with ESXi 5.x this has been contained to an extent and should not be hitting this issue as much.

Which brings me to the question of which ESX/ESXi versions and patchlevels are you running on the environment? For there are some patch levels that are more at risk with a storage issue as opposed to a higher build.

Regards

a

Do remember to mark my post as "helpful" or "correct" if I've helped resolve or answer your query!
0 Kudos
drivera01
Enthusiast
Enthusiast

I am at the base 5.0.0

Right now, I am getting flack for going up to the highest which I think is 5.0 u2?

For sure I cannot go to 5.1 due to some other incompatibility issues.

There is no “official” docs from vmware stating why we must upgrade to possibly mitigate this issue?

That would be the only leg I have to stand on…

Thank you

0 Kudos
a_nut_in
Expert
Expert

Could you also share the following:

1. Exact make model and firmware level of your SAN

2. The hostd and vmkernel logs from one of the affected hosts at the time of the last incident (if available) or just the latest log fine generated

Thanks

a

Do remember to mark my post as "helpful" or "correct" if I've helped resolve or answer your query!
0 Kudos