We have had a rash of nodes over the last few months become unresponsive and disconnected in vcenter.. The root cause I know is storage provisioning and/or zoning. So with that being said.
My question is after the zoning is fixed or whatever underlying issue is resolved. The node continues in this state. I lose connection in vcenter, I could not even connect direct to in via the viclient. I was able to get into it via the console. I tried restarting mgmt services, I tried rescanning the vmhba via commandline. That failed with an error: "Connect to localhost failed"
I have yet had any luck recovering from this scenario. I land up having to reboot the node to fix.
Is there any recommendations on how to fix this issue. I am sure it will happen again until our network/storage team irons out their process.
Hi David,
What you are seeing sounds like an APD scenario. Just a generic search on "APD+VMware" in google will show you just how disruptive the same can be. But with ESXi 5.x this has been contained to an extent and should not be hitting this issue as much.
Which brings me to the question of which ESX/ESXi versions and patchlevels are you running on the environment? For there are some patch levels that are more at risk with a storage issue as opposed to a higher build.
Regards
a
I am at the base 5.0.0
Right now, I am getting flack for going up to the highest which I think is 5.0 u2?
For sure I cannot go to 5.1 due to some other incompatibility issues.
There is no “official” docs from vmware stating why we must upgrade to possibly mitigate this issue?
That would be the only leg I have to stand on…
Thank you
Could you also share the following:
1. Exact make model and firmware level of your SAN
2. The hostd and vmkernel logs from one of the affected hosts at the time of the last incident (if available) or just the latest log fine generated
Thanks
a