VMware Cloud Community
iganchev
Contributor
Contributor
Jump to solution

Migration of a virtual machine when hostd and vpxa are not responding

Hello,

We have a host, part of a cluster, with production virtual machines on it, on which hostd and vpxa services are not responding and they are impossible to be killed or restarted when logged as root on the host.

My question is, is there a clean method of migrating VMs with a minimum downtime, as they are used in production environment? VMs are stored on a shared disks accessible from other hosts on the cluster.

Regards,

--

Tags (3)
Reply
0 Kudos
1 Solution

Accepted Solutions
msripada
Virtuoso
Virtuoso
Jump to solution

I dont recommend services.sh restart as vpxa already showing no such process so it is a zombie now. It requires reboot. If there are existing storage issues or LACP configured, services.sh restart leads to other bigger problems.

Thanks,

MS

View solution in original post

Reply
0 Kudos
12 Replies
bhards4
Hot Shot
Hot Shot
Jump to solution

Hi,

Currently the ESXi host is in unresponsive status where vpxa and hostd services are not responding. As, ESXi host is not responding, its difficult to migrate the VM's from Existing ESXi host to other ESXi host as vpxa agent unable to communicate to vCenter server.

1) One way to restart the management services of ESXi host ( vpxa & hostd services). As you said you don't want to go for that.

2) Other way is to reboot the ESXi host where HA will restart all the VMs to other available ESXi host. (Subject to availability of resources on other hosts)

3) Last option, you can login to ESXi host individually and poweroff all the VM's and unregister them from existing ESXi host to and re-register to other vCenter manageable ESXi host in the cluster.

-Sachin

Reply
0 Kudos
iganchev
Contributor
Contributor
Jump to solution

Thank you for the answer, will see tomorrow what would be the most appropriate.

Reply
0 Kudos
daphnissov
Immortal
Immortal
Jump to solution

I'm curious to know what you have tried thus far to make the determination that they are impossible to rectify.

Reply
0 Kudos
iganchev
Contributor
Contributor
Jump to solution

Currently the host is listed as disconnected in the vSphere interface, but the VMs are still running.

When connected on the host I see a vpxa and hostd process running. When trying to kill/restart those processes I have the following.

[root@host:~] /etc/init.d/vpxa restart

watchdog-vpxa: Terminating watchdog process with PID 67719

sh: can't kill pid 67719: No such process

But:

[root@host:~] ps |grep vpxa

231007   67719  vpxa-worker                                     

67719    67719  vpxa                                            

67735    67719  vpxa-worker

Same thing for hostd.

[root@host:~] ps -s | grep hostd

67292    67292  hostd-worker                                       WAIT    LOCK    0-39

68101    67292  hostd-worker                                       WAIT    LOCK    0-39

68102    67292  hostd-worker                                       WAIT    LOCK    0-39

68113    67292  hostd-worker                                       WAIT    LOCK    0-39

68119    67292  hostd-worker                                       WAIT    LOCK    0-39

68122    67292  hostd-worker                                       WAIT    LOCK    0-39

68970    67292  hostd-worker                                       WAIT    LOCK    0-39

68971    67292  hostd-worker                                       WAIT    LOCK    0-39

212913   67292  hostd-worker                                       WAIT    LOCK    0-39

119831   67292  hostd-worker                                       WAIT    LOCK    0-39

1497266  67292  hostd-worker                                       WAIT    FS      0-39

1497269  67292  hostd-worker                                       WAIT    LOCK    0-39

1497273  67292  hostd-worker                                       WAIT    LOCK    0-39

2842190  2842190  hostd                                              WAIT    LOCK    0-39

2842192  2842190  hostd-worker                                       WAIT    UFUTEX  0-39

2842193  2842190  hostd-worker                                       WAIT    UPOL    0-39

2842194  2842190  hostd-worker                                       WAIT    UPOL    0-39

2842195  2842190  hostd-worker                                       WAIT    UFUTEX  0-39

2842197  2842190  hostd-worker                                       WAIT    UFUTEX  0-39

2843309  2843309  hostdCgiServer                                     WAIT    UFUTEX  0-39

1179146  67292  hostd-worker                                       WAIT    FS      0-39

And:

[root@host:~] kill -9 67292

sh: can't kill pid 67292: No such process

I think that the situation is not recoverable, but any advice would be welcome.

Regards

--

Reply
0 Kudos
daphnissov
Immortal
Immortal
Jump to solution

What version and build of ESXi are you running here?

Reply
0 Kudos
iganchev
Contributor
Contributor
Jump to solution

VMware ESXi 6.5.0

Reply
0 Kudos
daphnissov
Immortal
Immortal
Jump to solution

That's the version, what is the build?

Reply
0 Kudos
iganchev
Contributor
Contributor
Jump to solution

build-5310538

Reply
0 Kudos
daphnissov
Immortal
Immortal
Jump to solution

Try services.sh restart first and see if the watchdog brings them down.

Reply
0 Kudos
msripada
Virtuoso
Virtuoso
Jump to solution

I dont recommend services.sh restart as vpxa already showing no such process so it is a zombie now. It requires reboot. If there are existing storage issues or LACP configured, services.sh restart leads to other bigger problems.

Thanks,

MS

Reply
0 Kudos
daphnissov
Immortal
Immortal
Jump to solution

There are other options. Calling watchdog.sh -r hostd can be done or with vpxa substituted which should also be tried.

Reply
0 Kudos
iganchev
Contributor
Contributor
Jump to solution

Hello,

thank everyone for the advices. The situation was unsolvable, so we proceeded to a reboot.

--

IG

Reply
0 Kudos