Re: Unresponsive ESX (no CLI / console)

MCEnforcer · ‎11-09-2011

I have an ESX host which responds to ping but nothing else (SSH, vMA or even direct console). All the VMs are running OK so not a mjor problem to fix the host.

I noticed that I could migrate VMs if they were powered off so I shut one down from the OS (Windows) and it is definitely off however vCenter reports that it is powered on and VM Tools is running (not likely).

Unfortuntaly this VM won't migrate and won't pwer on. I have tried to re add to the inventory but I get the usual "The specified key, name or identifer already exists".

Normally I would restart the managemnet agents however without some sort of CLI acces this is not possible. Restarting the host is not an option at this stage.

The local console gets as far as asking for username (root) however will never ask for a password and eventually goes back to username prompt.

Logged with VM support however they are yet to respond so any alternate assistance would be appreciated.

calvinrobinson · ‎11-10-2011

That's interesting. I experienced something similar once where a customer's ESX host exhibited erratic connectivity over it's service console vmnics and lost all connectivity over its vmnics used for vmotion. This was caused by some sort of documented "bug" caused by the processor at the time however I recall not being able to evacuate all the VMs from the host. Eventually the customer gracefully shut down all the VMs and bounced the host. What version of ESX and what update are you running?

VCP4, VCP5, Cloud Infrastructure Consultant

MCEnforcer · ‎11-10-2011

Calvin

Thanks for your response, this was an odd one.

VM support finally came back and logged onto the various system. Their pronouncement was that it was a SAN issue, which we (sort of) doubted as 20 out of 21 hosts were working OK.

Their only solution was to logon to each VM and shut them all down. Once complete – reboot the host – which we weren’t able to schedule until the weekend.

Oddly enough I tried a SSH connection this evening and was able to logon. I then tried reconnecting the host in vCenter and it reappeared. After frantically migrating all the VM to another system I rebooted the box so we shall see ...

I think they may have been on the right lines when they suggested SAN issues as we have had significant performance issues across the board and we have had to take one of the arrays out of action for an engineer visit. We have been operating on reduced spindles and controllers / NICs for a few days thus increasing disk latency. It could be that tolerance on this machine was slightly lower due to other unknown factors.

To answer you other questions we are running 4.0.0 398348 on this machine – soon to be upgraded to 5.0.0 504890. We never got around to 4.1.

Thanks again, yours was the only response.

Martin

calvinrobinson · ‎11-10-2011

Awesome, glad you got it working and got the VMs evacuated. I've seen hosts freak out from APD conditions so storage as a potential culprit is definitely possible.

VCP4, VCP5, Cloud Infrastructure Consultant

Ollfried · ‎11-10-2011

What about the logs? Do you have a syslog-server in place?