Hello,
We have a host, part of a cluster, with production virtual machines on it, on which hostd and vpxa services are not responding and they are impossible to be killed or restarted when logged as root on the host.
My question is, is there a clean method of migrating VMs with a minimum downtime, as they are used in production environment? VMs are stored on a shared disks accessible from other hosts on the cluster.
Regards,
--
I dont recommend services.sh restart as vpxa already showing no such process so it is a zombie now. It requires reboot. If there are existing storage issues or LACP configured, services.sh restart leads to other bigger problems.
Thanks,
MS
Hi,
Currently the ESXi host is in unresponsive status where vpxa and hostd services are not responding. As, ESXi host is not responding, its difficult to migrate the VM's from Existing ESXi host to other ESXi host as vpxa agent unable to communicate to vCenter server.
1) One way to restart the management services of ESXi host ( vpxa & hostd services). As you said you don't want to go for that.
2) Other way is to reboot the ESXi host where HA will restart all the VMs to other available ESXi host. (Subject to availability of resources on other hosts)
3) Last option, you can login to ESXi host individually and poweroff all the VM's and unregister them from existing ESXi host to and re-register to other vCenter manageable ESXi host in the cluster.
-Sachin
Thank you for the answer, will see tomorrow what would be the most appropriate.
I'm curious to know what you have tried thus far to make the determination that they are impossible to rectify.
Currently the host is listed as disconnected in the vSphere interface, but the VMs are still running.
When connected on the host I see a vpxa and hostd process running. When trying to kill/restart those processes I have the following.
[root@host:~] /etc/init.d/vpxa restart
watchdog-vpxa: Terminating watchdog process with PID 67719
sh: can't kill pid 67719: No such process
But:
[root@host:~] ps |grep vpxa
231007 67719 vpxa-worker
67719 67719 vpxa
67735 67719 vpxa-worker
Same thing for hostd.
[root@host:~] ps -s | grep hostd
67292 67292 hostd-worker WAIT LOCK 0-39
68101 67292 hostd-worker WAIT LOCK 0-39
68102 67292 hostd-worker WAIT LOCK 0-39
68113 67292 hostd-worker WAIT LOCK 0-39
68119 67292 hostd-worker WAIT LOCK 0-39
68122 67292 hostd-worker WAIT LOCK 0-39
68970 67292 hostd-worker WAIT LOCK 0-39
68971 67292 hostd-worker WAIT LOCK 0-39
212913 67292 hostd-worker WAIT LOCK 0-39
119831 67292 hostd-worker WAIT LOCK 0-39
1497266 67292 hostd-worker WAIT FS 0-39
1497269 67292 hostd-worker WAIT LOCK 0-39
1497273 67292 hostd-worker WAIT LOCK 0-39
2842190 2842190 hostd WAIT LOCK 0-39
2842192 2842190 hostd-worker WAIT UFUTEX 0-39
2842193 2842190 hostd-worker WAIT UPOL 0-39
2842194 2842190 hostd-worker WAIT UPOL 0-39
2842195 2842190 hostd-worker WAIT UFUTEX 0-39
2842197 2842190 hostd-worker WAIT UFUTEX 0-39
2843309 2843309 hostdCgiServer WAIT UFUTEX 0-39
1179146 67292 hostd-worker WAIT FS 0-39
And:
[root@host:~] kill -9 67292
sh: can't kill pid 67292: No such process
I think that the situation is not recoverable, but any advice would be welcome.
Regards
--
What version and build of ESXi are you running here?
VMware ESXi 6.5.0
That's the version, what is the build?
build-5310538
Try services.sh restart first and see if the watchdog brings them down.
I dont recommend services.sh restart as vpxa already showing no such process so it is a zombie now. It requires reboot. If there are existing storage issues or LACP configured, services.sh restart leads to other bigger problems.
Thanks,
MS
There are other options. Calling watchdog.sh -r hostd can be done or with vpxa substituted which should also be tried.
Hello,
thank everyone for the advices. The situation was unsolvable, so we proceeded to a reboot.
--
IG