So... I have a challenge and am curious what you all think.
Our operations team has a prod host that is:
Disconnected in vCenter
Operational as far as HA thinks
Has running VM's with no issue
Has no ssh or iLo console access
Responds to pings
Has no errors on the switch for either SC port
Where does one go from here. If HA is communicating, can it send instructions? Any thoughts?
my guess is hostd has crashed, which in most cases will require a reboot to fix. Given that you have no SSH access or any kind of remote access, you will have to be in front of the console itself. Restarting hostd may fix the issue if it isn't completely hosed
service mgmt-vmware restart
Even using PowerCLI won't work, because it won't be able to restart any of the management agents.
http://communities.vmware.com/thread/236538
Have you tried a right click "connect"?
Hi there
Have you tried:
Are you able to manage it via your VI Client directly?
Have you tried restarting the management agents (at console type service mgmt-vmware restart)
Have you tried restarting the VC agent (at console type /etc/rc.d/init.d/vmware-vpxa restart )
Regards
-
a CraZy PeNguIn
If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
That's the thing. Cannot ssh, therefore issue remote comands. Cannot get on the "true" console either. It's disconnected in vCenter so no good there either.
I keep going back to the fact HA is working fine since it shows the agent running in the cluster (viewing another cluster nodes vpx logs...) Can HA somehow restart mgmt svcs?
WHat about the the physical console of the ESX host - are you able to access that either directly or though ILO?
If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
It sounds like the local file system has gone read only. This has happened to me a couple times on some older boxes.
The only way to recover it, is reboot. I'd suggest remoting into the guests, and shut them down cleanly.
Once they are down, reboot the host.
There isn't really much else I could do to remedy the situation.
Good luck.
Jase McCarty
Co-Author: VMware ESX Essentials in the Virtual Data Center (ISBN:1420070274) Auerbach
Co-Author: VMware vSphere 4 Administration Instant Reference (ISBN:0470520728) Sybex
Please consider awarding points if this post was helpful or correct
Jase -
Thats what I am afraid of. One last resort I am waiting on from VMware Engineering. We shall see....
Thanks everyone.
The root of the issue was that I was running ESX 3.5 U3 on an IBM x440 (unsupported), and the firmware of the local disks didn't jive with U3.
I rebuilt the box with 3.5 U2, and didn't have the problem after that. Fortunately I don't have those x440's in production anymore.
Jase McCarty
Co-Author: VMware ESX Essentials in the Virtual Data Center (ISBN:1420070274) Auerbach
Co-Author: VMware vSphere 4 Administration Instant Reference (ISBN:0470520728) Sybex
Please consider awarding points if this post was helpful or correct
keep in mind it could be a hostd issue. If you get the right VMware TSE, they may be able to fix it without a reboot.
Good Luck!!
What backup software / process are you using?
This sounds very close to the issue I just had (other than the disconnected host, which was the only command available to one of the Jr. admins) because of the PhdVirtual esXpress 3.6.10 had a problem dealing with 2010.
We don't use a backup agent unfortunately. I think we are stuck.
No, HA won't restart mgmt. if you 'telnet hostnameorIP 443' and don't get a blank screen or any other kind of response, then things are looking really bad as far as not having to reboot vm's/host.
If you get a response on 443, you can try connecting directly to the host with the client.
At the physical/ILO/KVM screen, try alt-F3 or another F# key to see if you can get an alternate console to come up, too.
Just to echo what Jasemcarty mentioned. I had the exact same symptoms when the local filesystems went read only because of a raid controller fault. I happened to have an ssh session up when it went so I was able to do a bit of poking around. Not that it helped much though -the only solution was to remotely login to the machines, shut them down and bring them back on other hosts.