Re: Failure to restart vmware mgmt services

Zigoze · ‎04-16-2009

Hi,

When trying to run:

service mgmt-vmware restart

Stopping VMware ESX Server Management services:

VMware ESX Server Host Agent Watchdog

VMware ESX Server Host Agent

Starting VMware ESX Server Management services:

VMware ESX Server Host Agent (background)

Availability report startup (background)

Restart of Server Host Agent fails on stoping... The host now appears disconnect on Virtual Center, and i'm not able to reconnect as it says the host is not available on the network, or not contactable....

VM's are still working. Restart of ESX server is not an easy option... how can i solve this? Any ideas?

admin · ‎04-16-2009

A reboot is probably best, but you could try killing the vmware-hostd process manually.

Use ps to find the process IDs.

ps aux | grep hostd

root 1579 0.0 0.3 4268 964 ? S Apr15 0:00 /bin/sh /usr/bin/vmware-watchdog -s hostd -u 60 -q 5 -c /usr/sbin/vmware-hostd-support /usr/sbin/vmware-hostd -u

root 1608 0.9 19.4 88548 52364 ? S Apr15 18:10 /usr/lib/vmware/hostd/vmware-hostd /etc/vmware/hostd/config.xml -u

kill -9 the watchdog, in the case PID 1579, and then the vmware-hostd. Do another ps to check they're gone then try restarting mgmt-vmware.

Alex

www.phdvirtual.com

Chamon · ‎04-16-2009

What we have had to do in the past is to kill the watchdog and hostd on the host and then restart them.

cd /var/run/vmware

and

cat vmware-hostd.PID

and

cat watchdog-hostd.PID

to get the PID of the services. then

kill -9

for each of them. Be very careful that you type the correct PID as you don't want to kill the wrong process!

once this is done

rm -f watchdog-hostd.PID

Then restart the mgmt-vmware service.

All of our VMs continued to run and the Host came back into VC

If it does not come back after restarting the service then you may need to remove the virtual center agent and add the host back into VC. Stop the vpxa service and remove the package as follows.

Get the full packaged name by typing the command

"rpm -qa | grep -i vmware-vpxa)

This returned the agent package installed on the host. Then run the following to uninstall the package:

rpm -e VMware-vpxa-2.0.x.xxxxx

(where xxxx is the installed package found in the previous step)

Chamon · ‎04-16-2009

Once you get the host back into VC you should probably put it into maintenance mode and VMotion the VMs off and reboot it as well.

Zigoze · ‎04-16-2009

Hi,

Output of command:

ps aux | grep hostd

root 24309 0.0 0.0 3684 676 pts/0 S 18:33 0:00 grep hostd

So now I should kill this process with a command such as:

kill -9 24309

Is this correct? This won't affect my running vm's, right?

Zigoze · ‎04-16-2009

But, I don't understand...

Result of command:

cat vmware-hostd.PID

2247[root@mpaesx4 vmware]#

Result of command:

cat vmware-watchdog.PID

cat: watchdog-hostd.PID: No such file or directory

So, for killing the process, should I use the PID 2247, or the PID 24309???

lamw · ‎04-16-2009

You should try a standard kill to see if that works and then if that still fails, then do it with -9 which is kill with prejudice.

kill 24309

then if that fails,

kill -9 24309

Also FYI - hostd will probably have a parent process and spawned child, so you should get back two

Here is quick example:

[root@himalaya VMware]# ps -ef | grep hostd | grep -v grep
root      1511     1  0 Mar22 ?        00:00:00 /bin/sh /usr/bin/vmware-watchdog -s hostd -u 60 -q 5 -c /usr/sbin/vmware-hostd-support /usr/sbin/vmware-hostd -u
root      1521  1511  0 Mar22 ?        05:37:11 /usr/lib/vmware/hostd/vmware-hostd /etc/vmware/hostd/config.xml -u

Here you'll want to kill both the child and parent and then restart the process which will auto-respawn the child process

=========================================================================

William Lam

VMware vExpert 2009

VMware ESX/ESXi scripts and resources at:

Chamon · ‎04-16-2009

That is your grep PID

go to the /var/run/vmware directory and run

cat vmware-hosts.PID

and so on as mentioned above. It will not kill your VMs once you kill and delete the .PID files and you restart the mgmt-vmware process it will recreate the .PID files for you

Chamon · ‎04-16-2009

24309 is the PID for the grep command

the 2247 is the hostd PID.

(sorry )

We have had to do this several times unfortunately.

Message was edited by: Chamon

Zigoze · ‎04-16-2009

When trying to kill process 2247 I get:

2247[root@mpaesx4 vmware]# kill 2247

-bash: kill: (2247) - No such process

Any other ideas?

Chamon · ‎04-16-2009

I would delete the vmware-hostd.PID and the watchdog-hostd.PID and restart the

service mgmt-vmware restart

This will recreate the two .PID files and the service should start up. Then If it still cannot be added back to the VC you would have to remove the vpxa agent and then add the host back to the VC

Zigoze · ‎04-16-2009

Removed vmware-hostd.PID. There is no watchdog-hostd.PID file to remove!

Restart of mgmt service ended um with the same result when trying to stop ESX Server Host Agent...

FAILED...

Chamon · ‎04-16-2009

Did it start? It would fail because it was not started. If the services started back up what do you get when you

service mgmt-vmware status?

if it is up try restarting it again. What do you have in /var/log/messages?

When we had this problem before we had some "disk appears confused" events. Ours was a bad CD-ROM drive. Had remove it to get it back working again.

Chamon · ‎04-16-2009

Any luck?

Zigoze · ‎04-16-2009

Same problem:

# service mgmt-vmware restart

Stopping VMware ESX Server Management services:

VMware ESX Server Host Agent Watchdog

VMware ESX Server Host Agent

Starting VMware ESX Server Management services:

VMware ESX Server Host Agent (background)

Availability report startup (background)

Last entries on messages are:

Apr 16 19:09:31 mpaesx4 VMware[init]: Begin '/usr/sbin/vmware-hostd -u', min-uptime = 60, max-quick-failures = 5, max$

Apr 16 19:09:32 mpaesx4 watchdog-hostd: '/usr/sbin/vmware-hostd -u' exited after 1 seconds (quick failure 1)

Apr 16 19:09:32 mpaesx4 watchdog-hostd: Executing cleanup command '/usr/sbin/vmware-hostd-support'

Apr 16 19:09:32 mpaesx4 VMware[init]: /usr/bin/vmware-watchdog: line 192: 25118 Segmentation fault (core dumped) setsid $

Apr 16 19:09:32 mpaesx4 watchdog-hostd: Executing '/usr/sbin/vmware-hostd -u'

Apr 16 19:09:33 mpaesx4 VMware[init]: /usr/bin/vmware-watchdog: line 192: 25218 Segmentation fault (core dumped) setsid $

Apr 16 19:09:33 mpaesx4 watchdog-hostd: '/usr/sbin/vmware-hostd -u' exited after 1 seconds (quick failure 2)

Apr 16 19:09:33 mpaesx4 watchdog-hostd: Executing cleanup command '/usr/sbin/vmware-hostd-support'

Apr 16 19:09:33 mpaesx4 watchdog-hostd: Executing '/usr/sbin/vmware-hostd -u'

Apr 16 19:09:33 mpaesx4 VMware[init]: /usr/bin/vmware-watchdog: line 192: 25288 Segmentation fault (core dumped) setsid $

Apr 16 19:09:33 mpaesx4 watchdog-hostd: '/usr/sbin/vmware-hostd -u' exited after 0 seconds (quick failure 3)

Apr 16 19:09:33 mpaesx4 watchdog-hostd: Executing cleanup command '/usr/sbin/vmware-hostd-support'

Apr 16 19:09:34 mpaesx4 watchdog-hostd: Executing '/usr/sbin/vmware-hostd -u'

Apr 16 19:09:34 mpaesx4 VMware[init]: /usr/bin/vmware-watchdog: line 192: 25358 Segmentation fault (core dumped) setsid $

Apr 16 19:09:34 mpaesx4 watchdog-hostd: '/usr/sbin/vmware-hostd -u' exited after 0 seconds (quick failure 4)

Apr 16 19:09:34 mpaesx4 watchdog-hostd: Executing cleanup command '/usr/sbin/vmware-hostd-support'

Apr 16 19:09:34 mpaesx4 watchdog-hostd: Executing '/usr/sbin/vmware-hostd -u'

Apr 16 19:09:34 mpaesx4 VMware[init]: /usr/bin/vmware-watchdog: line 192: 25428 Segmentation fault (core dumped) setsid $

Apr 16 19:09:34 mpaesx4 watchdog-hostd: '/usr/sbin/vmware-hostd -u' exited after 0 seconds (quick failure 5)

Apr 16 19:09:34 mpaesx4 watchdog-hostd: Executing cleanup command '/usr/sbin/vmware-hostd-support'

Apr 16 19:09:35 mpaesx4 watchdog-hostd: Executing '/usr/sbin/vmware-hostd -u'

Apr 16 19:09:35 mpaesx4 VMware[init]: /usr/bin/vmware-watchdog: line 192: 25498 Segmentation fault (core dumped) setsid $

Apr 16 19:09:35 mpaesx4 watchdog-hostd: '/usr/sbin/vmware-hostd -u' exited after 0 seconds (quick failure 6)

Apr 16 19:09:35 mpaesx4 watchdog-hostd: Executing cleanup command '/usr/sbin/vmware-hostd-support'

Apr 16 19:09:35 mpaesx4 watchdog-hostd: End '/usr/sbin/vmware-hostd -u', failure limit reached

Apr 16 19:24:42 mpaesx4 watchdog-hostd: PID file /var/run/vmware/watchdog-hostd.PID not found

Apr 16 19:24:42 mpaesx4 watchdog-hostd: Unable to terminate watchdog: Can't find process

Apr 16 19:24:42 mpaesx4 watchdog-hostd: PID file /var/run/vmware/watchdog-hostd.PID not found

Apr 16 19:24:42 mpaesx4 watchdog-hostd: Begin '/usr/sbin/vmware-hostd -u', min-uptime = 60, max-quick-failures = 5, m$

Apr 16 19:24:42 mpaesx4 watchdog-hostd: Executing '/usr/sbin/vmware-hostd -u'

Apr 16 19:24:42 mpaesx4 VMware[init]: Begin '/usr/sbin/vmware-hostd -u', min-uptime = 60, max-quick-failures = 5, max$

Apr 16 19:24:42 mpaesx4 VMware[init]: /usr/bin/vmware-watchdog: line 192: 25625 Segmentation fault (core dumped) setsid $

Apr 16 19:24:42 mpaesx4 watchdog-hostd: '/usr/sbin/vmware-hostd -u' exited after 0 seconds (quick failure 1)

Apr 16 19:24:42 mpaesx4 watchdog-hostd: Executing cleanup command '/usr/sbin/vmware-hostd-support'

Apr 16 19:24:43 mpaesx4 watchdog-hostd: Executing '/usr/sbin/vmware-hostd -u'

Apr 16 19:24:43 mpaesx4 watchdog-hostd: '/usr/sbin/vmware-hostd -u' exited after 0 seconds (quick failure 2)

Apr 16 19:24:43 mpaesx4 watchdog-hostd: Executing cleanup command '/usr/sbin/vmware-hostd-support'

Apr 16 19:24:43 mpaesx4 VMware[init]: /usr/bin/vmware-watchdog: line 192: 25725 Segmentation fault (core dumped) setsid $

Apr 16 19:24:43 mpaesx4 watchdog-hostd: Executing '/usr/sbin/vmware-hostd -u'

Apr 16 19:24:43 mpaesx4 VMware[init]: /usr/bin/vmware-watchdog: line 192: 25800 Segmentation fault (core dumped) setsid $

Still cannot connect to the esx host through Virtual Center.

Chamon · ‎04-16-2009

What do you have in var/log/vmware/hostd.log ?

To get it back into VC you will probably need to remove the vpxa and then add it back into VC. Have you tried to add it back into the VC by first removing it from Virtual Center? Or does it still show disconnected?

Zigoze · ‎04-16-2009

hostd-log shows this:

Config target info loaded

Hw info file: /etc/vmware/hostd/hwInfo.xml

Config target info loaded

GetPropertyProvider failed for haTask-ha-root-pool-vim.Resour$

Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-51526

Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-515$

Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-51527

Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-515$

GetPropertyProvider failed for haTask-ha-root-pool-vim.Resourc$

Hw info file: /etc/vmware/hostd/hwInfo.xml

Config target info loaded

Hw info file: /etc/vmware/hostd/hwInfo.xml

Config target info loaded

GetPropertyProvider failed for haTask-ha-root-pool-vim.Resourc$

Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-51567

Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-51567

Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-51568

Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-51568

Hw info file: /etc/vmware/hostd/hwInfo.xml

Config target info loaded

Hw info file: /etc/vmware/hostd/hwInfo.xml

Config target info loaded

Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-51595

Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-515$

Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-51596

Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-515$

GetPropertyProvider failed for haTask-ha-root-pool-vim.Resour$

Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-51606

Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-51606

Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-51607

[[2009-04-16 17:03:29.498 'TaskManager' 21154736 info] Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-51607

Hw info file: /etc/vmware/hostd/hwInfo.xml

Config target info loaded

Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-51621

Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-51$

Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-51622

Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-51$

Hw info file: /etc/vmware/hostd/hwInfo.xml

Config target info loaded

Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-51633

Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-51633

Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-51634

Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-516$

GetPropertyProvider failed for haTask-ha-root-pool-vim.Resourc$

Hw info file: /etc/vmware/hostd/hwInfo.xml

Config target info loaded

Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-51651

Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-516$

Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-51652

Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-516$

Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-51660

Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-516$

Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-51661

Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-516$

GetPropertyProvider failed for haTask-ha-root-pool-vim.Resour$

Hw info file: /etc/vmware/hostd/hwInfo.xml

Config target info loaded

I did not try to remove the vpxa or remove the host of the virtual center, because I still haven't been able to get the mgmt services to restart properly... so, i believe removing the host from the virtual center wouldn't help... also, i'm still not sure on how to stop the vpxa service and remove the agent from the host....

But, at this point, i'm exactly the same... when i try to restart the services i get a failure on the esx server host agent... it says failure on stopping, but ok on restarting (as i already mentioned before). But, although it says it started ok, virtual center isn't able to contact the esx host, and gives a message that the management services are not responding...

Chamon · ‎04-16-2009

Your vms are running correct? I will see what I can find tonight. Did

anything change just before you started having this issue? Why were

you restarting the mgmt service?

On Apr 16, 2009, at 5:02 PM, Zigoze <communities-emailer@vmware.com

Zigoze · ‎04-16-2009

Yes, vms are running correctly...

Thanks so much for your help.

No anything changed before the issue. We only had a problem with a vm (wich is in another node of the cluster) that failed vcb backup stating that there was another operation pending. We had solved this issue before with a simple restart of vmware mgmt services. That is why we were restarting the services on the hosts... all of the other nodes worked fine... but this one failed and the host became disconnected. Now I cannot get it to reconnect again, because mgmt services apparently are not restarting correctly on this host...

Bithound · ‎04-16-2009

You shouldn't need to delete hostd, I don't think I ever have. Certainly not recently; I think I'd look at other things. For example. What's your build? ("vmware-v"). Have you got anything in /var/core?, I'd probably also suggest trying "df -h". I've seen hostd get mucked up if there's too many problems with root or /var filling up. Anything in /var/core or /var/crash can probably go. What's your service console memory set to? Believe me, I wish I could shout to the heavens! If you're still running 272, increase it to 800! :smileycool: (Now that I think about, it might also be worth doing a /mount. Think I saw a service console the other day with a completely read-only filesystem. That was a reboot for sure.)