VMware Communities > Blogs > Manual Automation > Tags

Blog Posts

Manual Automation

1 Posts tagged with the support tag
0

There's nothing like going in to work Monday morning only to find that one of your ESX hosts is listed as "not responding" in VirtualCenter. Using HP's iLO, I tried restarting the management network. No change. The VMs were still running and functioning normally. The host was still running - there just seemed to be a communications problem between the host and VirtualCenter. After a quick call to VMware technical support, they had me restart the VirtualCenter server service and voila, communications were restored and the host's status in VirtualCenter returned to normal.

I didn't spend a lot of time doing a root-cause analysis as this environment was not in production yet. But I suspected there was a network interruption from which host-VC communications never recovered.

Now let me just say something about VMware technical support. I've worked support incidents with many hardware and software vendors over the years and have to say that VMware has their act together when it comes to product support. I'm not saying they're perfect, but I've received consistent quality support from these guys going back to my ESX Server 1.5 days. They're worth the money and I wouldn't run a Virtual Infrastructure environment without them.


So a week later, it happens again. I open another case with VMware referencing the previous. This time, restarting the management network or the VirtualCenter server service doesn't work. The support tech reviews some additional logs and is basically stumped. The only thing he had left for me to do was to manually shut down the VMs running on this host and reboot the host. This fixes the problem, but doesn't really explain why it happened in the first place. The tech is going to review a new set of logs I just uploaded and let me know if he finds anything. While there wasn't much more that could be done at this point, this always seems to me like a "don't call me, I'll call you" kind of resolution.


Before he has a chance to call me back, it happens a third time on the same host! Same symptoms, same results. The same tech doesn't find anything in the logs from the previous incident so he escalates to senior-level VirtualCenter support. We discovered a new symptom - the host seems to have lost connectivity with the storage, even though the VMs are still running fine (strange but true).


The senior tech said something that jogged my memory and I remembered that while this server survived our 4-day hardware burn-in test, we had problems connecting to the management console very early on to the point where we had to pull the USB key fob and reinstall it. (Keep in mind we're running ESXi.)


To be safe, I installed a new USB key fob and the problem has not occurred again. It's been about three weeks since writing this entry. Moral of the story: don't automatically rule-out the hardware even when the problem appears to be with the software.

0 Comments Permalink