ESX 3.5 VC 2.5 Host Disconnected (happens repeated...

SomeClown · ‎01-21-2008

Greetings,

After upgrading to the aforementioned versions (newest of each) I am having an issue where each host in the farm seems to randomly lose connection to the VirtualCenter. This impacts all manageability of virtual machines, since they show as disconnected and can't be migrated or touched, but they're still up and running just fine. If I ssh into the affected box and do a mgmt-vmware restart (or stop and start) everything comes back up except for "VMware ESX Server Host Agent Watchdog" which shows "FAILED." On a status query I see that hostd is not running. Nothing I do seems to get anything back up again. The last time this happened I rebooted the server and everything came back up, but that's getting a little old.

Thoughts?

SomeClown · ‎01-21-2008

As a follow-up... I do notice that when I run the "service mgmt-vmware restart" command I get the following error message:

"touch: creating '/var/lock/subsys/mgmt-vmare': Read-only file system"

RParker · ‎01-21-2008

OK, this one is hard, and I had to do this on some of my hosts also.

What I did was (before I had to rebuild my VC database) was this.

Disconnect the host from the cluster. Remove the host in VC (this will remove ALL VM's and associated templates) from the cluster. If you need your VM's or templates, migrate them first, but if your host is disconnecting this may prove awkward.

Then you login to the host, and the cluster should have abslutely no reference to that host at all. Once you are in the host, delete vimuser, and vpxa user. Then reboot the host, then login to the host as root first, ensure that you ONLY see VM's no left over resource pools, or anything, just VM's and templates (which are probably VM's). If it's empty even better, but make sure there is nothing from the previous cluster on this host.

Now add the host to the cluster while remaining as the root. At this point you have 2 VIC sessions one directly to the host and one directly to the VC. As the machine is added you will see messages on the host, and after its up and attaching to the cluster, it should calm down and give you no more problems.

I kept getting "resource added, logged in as root, root disconnected, every 10 seconds". This should not happen, and I finally had to start over with a new database to fix it. Previously this worked just rebooting the host, and removing the vpxa and vimusers, and letting the cluster re-add them.. but you will note that VC 2.5 does not readd these, and I am thinking that there is some left over stuff that isn't quite fixed yet with lingering previous hosts.

I hope this works for you, because I sat for 12 hours and tried everything, and never really did resolve the issue, but maybe you will be better off than me. The messages give you a clue, perhaps you can make better sense of them than I did.

SomeClown · ‎01-21-2008

This is the third high-profile crash of this server array since we upgraded. I've already had to completely reinstall one host to fix the first crash, move filestores from one SAN to another to fix a communications lockup on our QLogic HBAs (again, VMware issue), and now we're back to issue number 1 where I either reboot and maybe it works for a couple of days, follow the procedure outlined above, or reinstall (again). I'm really beginning to wonder if VMware actually tests anything they release or if they follow the Microsoft beta-test cycle.

Schorschi · ‎01-21-2008

This only happens on ESX hosts that were upgraded? Or on new installs of ESX 3.5? Just want to make sure I read your original note right, you are talking upgrade right?

SomeClown · ‎01-22-2008

Correct--this is only on hosts upgraded to 3.5 from 3.02 (or whatever the last version was... don't have it in front of me this second.)

dandeane · ‎01-22-2008

There seems to be a new issue with a fully patched 3.0.2 host( build 63195) that has been upgraded to 3.5, were the host will begin to experience kernel panics. It sounds like the solution is to not install the Jan 2 patches, and then upgrade. VMWare is aware of the issue and is apparently working on a solution.

SomeClown · ‎01-22-2008

Hmmm...

Any word on what to do if you have already done the upgrade? Or a timeframe for a fix? Or can you point me to where you found that documented somewhere?

OlivR · ‎11-24-2008

I'm facing the same problem. It's happened on a host that is not part of a cluster. The VMs are running fine. ESX3.5sp2, VC2.5sp3.

Is there any update? Any easy way to fix this?

blbeadle · ‎11-25-2008

I am also having the same issues, is there a fix?

Karunakar · ‎11-25-2008

Hi OlivR,

I think you can try the below steps, this may help, as your servers are not on the cluster, you can try the below steps.

remove the machine from the VC inventory, and then try to perform as below.

search for the vpx rpm installed in the esx.

rpm -qa | grep vpx.

Then stop the mgmt-vmware service .

then remve the vpx rpm using the below command.

rpm -e <vpx rpm >

then start the mgmt-vmware service.

Go to /tmp, and check if a folder named "vmware-root" is available, if not, then create the folder, its an empty folder.

> Copy the vpx-upgrade-esx-7 or 6****** from the upgrade folder in the VC folder to /tmp on the ESX server

> run the command " sh vpx-upgrade-esx-7 or 6****** " in ESX

(By default the script is located in C:\program files\vmware\vmware virtualcenter\upgrade )

Then restart mgmt-vmware process as "/etc/init.d/mgmt-vmware start"

-Karunakar

ThomasV · ‎03-23-2009

We seem to have the same issue with a ESX 3.5 u2 box. Did you guys ever solved this?

All

ESX 3.5 VC 2.5 Host Disconnected (happens repeatedly)