Stumped

thickclouds · ‎01-13-2010

So... I have a challenge and am curious what you all think.

Our operations team has a prod host that is:

Disconnected in vCenter

Operational as far as HA thinks

Has running VM's with no issue

Has no ssh or iLo console access

Responds to pings

Has no errors on the switch for either SC port

Where does one go from here. If HA is communicating, can it send instructions? Any thoughts?

Charlie Gautreaux vExpert http://www.thickclouds.com

Troy_Clavell · ‎01-13-2010

my guess is hostd has crashed, which in most cases will require a reboot to fix. Given that you have no SSH access or any kind of remote access, you will have to be in front of the console itself. Restarting hostd may fix the issue if it isn't completely hosed

service mgmt-vmware restart

Even using PowerCLI won't work, because it won't be able to restart any of the management agents.

http://communities.vmware.com/thread/236538

Have you tried a right click "connect"?

aCrazyPenguin · ‎01-13-2010

Hi there

Have you tried:

Are you able to manage it via your VI Client directly?
Have you tried restarting the management agents (at console type service mgmt-vmware restart)
Have you tried restarting the VC agent (at console type /etc/rc.d/init.d/vmware-vpxa restart )

Regards

-

a CraZy PeNguIn

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

------------------------- Andy Wood - VCP3 & 4 . MCITP:EA . MCSE:S . CCA . CCNA . Sec+ http://www.acrazypenguin.com If you find this answer useful please consider awarding points by marking the answer correct or helpful

thickclouds · ‎01-13-2010

That's the thing. Cannot ssh, therefore issue remote comands. Cannot get on the "true" console either. It's disconnected in vCenter so no good there either.

I keep going back to the fact HA is working fine since it shows the agent running in the cluster (viewing another cluster nodes vpx logs...) Can HA somehow restart mgmt svcs?

Charlie Gautreaux vExpert http://www.thickclouds.com

weinstein5 · ‎01-13-2010

WHat about the the physical console of the ESX host - are you able to access that either directly or though ILO?

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

Jasemccarty · ‎01-13-2010

It sounds like the local file system has gone read only. This has happened to me a couple times on some older boxes.

The only way to recover it, is reboot. I'd suggest remoting into the guests, and shut them down cleanly.

Once they are down, reboot the host.

There isn't really much else I could do to remedy the situation.

Good luck.

Jase McCarty

http://www.jasemccarty.com

Co-Author: VMware ESX Essentials in the Virtual Data Center (ISBN:1420070274) Auerbach

Co-Author: VMware vSphere 4 Administration Instant Reference (ISBN:0470520728) Sybex

_{Please consider awarding points if this post was helpful or correct}

Jase McCarty - @jasemccarty

thickclouds · ‎01-13-2010

Jase -

Thats what I am afraid of. One last resort I am waiting on from VMware Engineering. We shall see....

Thanks everyone.

Charlie Gautreaux vExpert http://www.thickclouds.com

Jasemccarty · ‎01-13-2010

The root of the issue was that I was running ESX 3.5 U3 on an IBM x440 (unsupported), and the firmware of the local disks didn't jive with U3.

I rebuilt the box with 3.5 U2, and didn't have the problem after that. Fortunately I don't have those x440's in production anymore.

Jase McCarty

http://www.jasemccarty.com

Co-Author: VMware ESX Essentials in the Virtual Data Center (ISBN:1420070274) Auerbach

Co-Author: VMware vSphere 4 Administration Instant Reference (ISBN:0470520728) Sybex

_{Please consider awarding points if this post was helpful or correct}

Jase McCarty - @jasemccarty

Troy_Clavell · ‎01-13-2010

keep in mind it could be a hostd issue. If you get the right VMware TSE, they may be able to fix it without a reboot.

Good Luck!!

marvinms · ‎01-13-2010

What backup software / process are you using?

This sounds very close to the issue I just had (other than the disconnected host, which was the only command available to one of the Jr. admins) because of the PhdVirtual esXpress 3.6.10 had a problem dealing with 2010.

thickclouds · ‎01-13-2010

We don't use a backup agent unfortunately. I think we are stuck.

Charlie Gautreaux vExpert http://www.thickclouds.com

danm66 · ‎01-13-2010

No, HA won't restart mgmt. if you 'telnet hostnameorIP 443' and don't get a blank screen or any other kind of response, then things are looking really bad as far as not having to reboot vm's/host.

If you get a response on 443, you can try connecting directly to the host with the client.

At the physical/ILO/KVM screen, try alt-F3 or another F# key to see if you can get an alternate console to come up, too.

timparkinsonShe · ‎01-15-2010

Just to echo what Jasemcarty mentioned. I had the exact same symptoms when the local filesystems went read only because of a raid controller fault. I happened to have an ssh session up when it went so I was able to do a bit of poking around. Not that it helped much though -the only solution was to remotely login to the machines, shut them down and bring them back on other hosts.

All

Stumped