Host is showing as disconnected in vCenter, cant reconnect as it times out, i cant connect directly to host with VIC as it also times out. I can SSH to the host i have tried restarting the following services
mgmt-vmware
vmware-vpxa
both of which restart ok, however still can connect to the host, the VMs on the host are still running but can migrate them off as the host is disconnected.
dont really want to cold boot the server
any ideas?
Try:
service sfcbd-watchdog stop
service wsman stop
service slpd stop
After that, check the output of "top" command until the load average goes to something near 1.00 in the first number.
Marcelo Soares
VMWare Certified Professional 310/410
Virtualization Tech Master
Globant Argentina
Consider awarding points for "helpful" and/or "correct" answers.
i've stopped the 3 services
here is the output from "top"
top - 12:40:26 up 22 days, 21:13, 1 user, load average: 3.00, 3.00, 3.00
Tasks: 83 total, 2 running, 81 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 96.9%id, 0.0%wa, 0.0%hi, 3.1%si, 0.0%st
Mem: 802296k total, 284744k used, 517552k free, 18220k buffers
Swap: 1638620k total, 108k used, 1638512k free, 173568k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6824 root 16 0 88128 39m 25m D 0.0 5.0 0:16.20 vmware-hostd
19795 root 18 0 72004 27m 23m S 0.0 3.6 0:00.31 vpxa
32282 root 25 0 137m 21m 2696 S 0.0 2.7 0:00.89 vmware-vimdump
2769 ntp 15 0 19148 4840 3748 S 0.0 0.6 0:02.22 ntpd
4099 root 15 0 88032 3324 2584 R 0.0 0.4 0:01.30 sshd
20407 root 15 0 163m 2256 1248 S 0.0 0.3 1:01.65 ftbackbone
20390 root 15 0 163m 1748 1268 S 0.0 0.2 0:01.14 ftbb
4105 root 15 0 63560 1504 1184 S 0.0 0.2 3:38.61 bash
2869 root 25 0 63392 1388 1136 S 0.0 0.2 0:00.05 vmware-watchdog
2810 root 25 0 63392 1384 1136 S 0.0 0.2 0:00.04 vmware-watchdog
2849 root 25 0 63392 1384 1136 S 0.0 0.2 0:00.04 vmware-watchdog
19786 root 23 0 63424 1300 1056 S 0.0 0.2 0:00.03 vmware-watchdog
4944 root 5 -10 3152 1240 868 S 0.0 0.2 1:06.70 vmkload_app
22342 root 5 -10 3152 1240 868 S 0.0 0.2 2:11.78 vmkload_app
30205 root 5 -10 3152 1240 868 S 0.0 0.2 1:04.06 vmkload_app
13248 root 5 -10 3152 1236 868 S 0.0 0.2 2:10.10 vmkload_app
13252 root 5 -10 3152 1236 868 S 0.0 0.2 1:34.17 vmkload_app
13254 root 6 -10 3152 1236 868 S 0.0 0.2 1:42.85 vmkload_app
15489 root 5 -10 3152 1236 868 D 0.0 0.2 3:45.64 vmkload_app
20368 root 5 -10 3152 1236 868 S 0.0 0.2 1:39.86 vmkload_app
21784 root 5 -10 3152 1236 868 S 0.0 0.2 1:53.32 vmkload_app
13415 root 5 -10 3152 1232 868 S 0.0 0.2 0:32.75 vmkload_app
30019 root 5 -10 3152 1232 868 S 0.0 0.2 1:52.73 vmkload_app
3037 root 5 -10 3152 1228 868 S 0.0 0.2 0:09.46 vmkload_app
13250 root 5 -10 3152 1228 868 S 0.0 0.2 1:33.76 vmkload_app
13920 root 6 -10 3152 1228 868 S 0.0 0.2 0:16.46 vmkload_app
16095 root 5 -10 3152 1228 868 S 0.0 0.2 1:37.84 vmkload_app
17013 root 5 -10 3152 1228 868 D 0.0 0.2 50:17.64 vmkload_app
17296 root 5 -10 3152 1228 868 S 0.0 0.2 1:30.31 vmkload_app
17763 root 5 -10 3152 1228 868 S 0.0 0.2 1:44.01 vmkload_app
23700 root 5 -10 3152 1228 868 S 0.0 0.2 1:32.56 vmkload_app
2733 root 15 0 60488 1208 664 S 0.0 0.2 2:35.43 sshd
2818 root 11 -10 3148 1196 864 S 0.0 0.1 0:00.06 vmkload_app
2877 root 15 -10 3148 1196 856 S 0.0 0.1 0:00.02 vmkload_app
2915 root 15 0 72312 1152 576 S 0.0 0.1 21:04.18 crond
14925 root 15 0 12604 1048 808 R 0.0 0.1 0:00.05 top
24890 root 14 -10 3148 1048 704 S 0.0 0.1 0:00.07 vmkload_app
24746 root 17 0 21640 884 672 S 0.0 0.1 0:00.04 xinetd
still cant connect to the host
Perform a "service mgmt-vmware stop" , and be sure the vmware-hostd left this list you are seeing. If not, perform a "kill PID" on the vmware-hostd process until it goes down (you can repeat the kill or try "kill -9 PID")
Marcelo Soares
VMWare Certified Professional 310/410
Virtualization Tech Master
Globant Argentina
Consider awarding points for "helpful" and/or "correct" answers.
hi,
please chek df-h result ,is enough free space avilable?
check service mgmt-vmware status and note pid also weather pid is changing each time you restart the mgmt services.?
if everything are ok.then will go for higher level troubleshoot.may b due to APD(all path dead) then it may also cause a prob.
whats are in hostd.log.
Regards,
kishan
still no joy i'm afraid i have stopped the mgmt-vmware service and killed the PID of vmware-hostd, still cant connect to the host.
Ok, after killing it the load average dropped? If yes, now try to start it again and check if get stable. After that try connecting to the ESX.
The APD or disk access thing is somethig you will need to check if this do not resolve the issue. Maybe schedule a host reboot...
Marcelo Soares
VMWare Certified Professional 310/410
Virtualization Tech Master
Globant Argentina
Consider awarding points for "helpful" and/or "correct" answers.
the load did drop but after restarting the agents i still cant connect, plenty of free space so thats not the issue.
looks like i'll have to schedule a reboot for 5pm today.
before reboot can u chek from esxcfg-mapth -l for any dead path do a rescan.also esxcfg-advcfg -s 1 /VMFS3/FailVolumeOpenIfAPD run this command.
This workaround is available only in update 1, and changes what the vmkernel does when it detects this APD state for a storage device, basically just immediately failing to open a datastore volume if the device’s state is APD.