Unable to connect to ESX 4 update 1 host

SandyB · ‎03-08-2010

Host is showing as disconnected in vCenter, cant reconnect as it times out, i cant connect directly to host with VIC as it also times out. I can SSH to the host i have tried restarting the following services

mgmt-vmware

vmware-vpxa

both of which restart ok, however still can connect to the host, the VMs on the host are still running but can migrate them off as the host is disconnected.

dont really want to cold boot the server

any ideas?

marcelo_soares · ‎03-08-2010

Try:

service sfcbd-watchdog stop

service wsman stop

service slpd stop

After that, check the output of "top" command until the load average goes to something near 1.00 in the first number.

Marcelo Soares

VMWare Certified Professional 310/410

Virtualization Tech Master

Globant Argentina

Consider awarding points for "helpful" and/or "correct" answers.

Marcelo Soares

SandyB · ‎03-08-2010

i've stopped the 3 services

here is the output from "top"

top - 12:40:26 up 22 days, 21:13, 1 user, load average: 3.00, 3.00, 3.00

Tasks: 83 total, 2 running, 81 sleeping, 0 stopped, 0 zombie

Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 96.9%id, 0.0%wa, 0.0%hi, 3.1%si, 0.0%st

Mem: 802296k total, 284744k used, 517552k free, 18220k buffers

Swap: 1638620k total, 108k used, 1638512k free, 173568k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

6824 root 16 0 88128 39m 25m D 0.0 5.0 0:16.20 vmware-hostd

19795 root 18 0 72004 27m 23m S 0.0 3.6 0:00.31 vpxa

32282 root 25 0 137m 21m 2696 S 0.0 2.7 0:00.89 vmware-vimdump

2769 ntp 15 0 19148 4840 3748 S 0.0 0.6 0:02.22 ntpd

4099 root 15 0 88032 3324 2584 R 0.0 0.4 0:01.30 sshd

20407 root 15 0 163m 2256 1248 S 0.0 0.3 1:01.65 ftbackbone

20390 root 15 0 163m 1748 1268 S 0.0 0.2 0:01.14 ftbb

4105 root 15 0 63560 1504 1184 S 0.0 0.2 3:38.61 bash

2869 root 25 0 63392 1388 1136 S 0.0 0.2 0:00.05 vmware-watchdog

2810 root 25 0 63392 1384 1136 S 0.0 0.2 0:00.04 vmware-watchdog

2849 root 25 0 63392 1384 1136 S 0.0 0.2 0:00.04 vmware-watchdog

19786 root 23 0 63424 1300 1056 S 0.0 0.2 0:00.03 vmware-watchdog

4944 root 5 -10 3152 1240 868 S 0.0 0.2 1:06.70 vmkload_app

22342 root 5 -10 3152 1240 868 S 0.0 0.2 2:11.78 vmkload_app

30205 root 5 -10 3152 1240 868 S 0.0 0.2 1:04.06 vmkload_app

13248 root 5 -10 3152 1236 868 S 0.0 0.2 2:10.10 vmkload_app

13252 root 5 -10 3152 1236 868 S 0.0 0.2 1:34.17 vmkload_app

13254 root 6 -10 3152 1236 868 S 0.0 0.2 1:42.85 vmkload_app

15489 root 5 -10 3152 1236 868 D 0.0 0.2 3:45.64 vmkload_app

20368 root 5 -10 3152 1236 868 S 0.0 0.2 1:39.86 vmkload_app

21784 root 5 -10 3152 1236 868 S 0.0 0.2 1:53.32 vmkload_app

13415 root 5 -10 3152 1232 868 S 0.0 0.2 0:32.75 vmkload_app

30019 root 5 -10 3152 1232 868 S 0.0 0.2 1:52.73 vmkload_app

3037 root 5 -10 3152 1228 868 S 0.0 0.2 0:09.46 vmkload_app

13250 root 5 -10 3152 1228 868 S 0.0 0.2 1:33.76 vmkload_app

13920 root 6 -10 3152 1228 868 S 0.0 0.2 0:16.46 vmkload_app

16095 root 5 -10 3152 1228 868 S 0.0 0.2 1:37.84 vmkload_app

17013 root 5 -10 3152 1228 868 D 0.0 0.2 50:17.64 vmkload_app

17296 root 5 -10 3152 1228 868 S 0.0 0.2 1:30.31 vmkload_app

17763 root 5 -10 3152 1228 868 S 0.0 0.2 1:44.01 vmkload_app

23700 root 5 -10 3152 1228 868 S 0.0 0.2 1:32.56 vmkload_app

2733 root 15 0 60488 1208 664 S 0.0 0.2 2:35.43 sshd

2818 root 11 -10 3148 1196 864 S 0.0 0.1 0:00.06 vmkload_app

2877 root 15 -10 3148 1196 856 S 0.0 0.1 0:00.02 vmkload_app

2915 root 15 0 72312 1152 576 S 0.0 0.1 21:04.18 crond

14925 root 15 0 12604 1048 808 R 0.0 0.1 0:00.05 top

24890 root 14 -10 3148 1048 704 S 0.0 0.1 0:00.07 vmkload_app

24746 root 17 0 21640 884 672 S 0.0 0.1 0:00.04 xinetd

still cant connect to the host

marcelo_soares · ‎03-08-2010

Perform a "service mgmt-vmware stop" , and be sure the vmware-hostd left this list you are seeing. If not, perform a "kill PID" on the vmware-hostd process until it goes down (you can repeat the kill or try "kill -9 PID")

Marcelo Soares

VMWare Certified Professional 310/410

Virtualization Tech Master

Globant Argentina

Consider awarding points for "helpful" and/or "correct" answers.

Marcelo Soares

kish09 · ‎03-08-2010

hi,

please chek df-h result ,is enough free space avilable?

check service mgmt-vmware status and note pid also weather pid is changing each time you restart the mgmt services.?

if everything are ok.then will go for higher level troubleshoot.may b due to APD(all path dead) then it may also cause a prob.

whats are in hostd.log.

Regards,

kishan

SandyB · ‎03-08-2010

still no joy i'm afraid i have stopped the mgmt-vmware service and killed the PID of vmware-hostd, still cant connect to the host.

marcelo_soares · ‎03-08-2010

Ok, after killing it the load average dropped? If yes, now try to start it again and check if get stable. After that try connecting to the ESX.

The APD or disk access thing is somethig you will need to check if this do not resolve the issue. Maybe schedule a host reboot...

Marcelo Soares

VMWare Certified Professional 310/410

Virtualization Tech Master

Globant Argentina

Consider awarding points for "helpful" and/or "correct" answers.

Marcelo Soares

SandyB · ‎03-08-2010

the load did drop but after restarting the agents i still cant connect, plenty of free space so thats not the issue.

looks like i'll have to schedule a reboot for 5pm today.

kish09 · ‎03-08-2010

before reboot can u chek from esxcfg-mapth -l for any dead path do a rescan.also esxcfg-advcfg -s 1 /VMFS3/FailVolumeOpenIfAPD run this command.

This workaround is available only in update 1, and changes what the vmkernel does when it detects this APD state for a storage device, basically just immediately failing to open a datastore volume if the device’s state is APD.

All

Unable to connect to ESX 4 update 1 host