I have 6 hosts in a production cluster, running vSphere 6.0. One of them just decided to go on the fritz on Sunday night. Here's what happened...I got an alert from my monitoring system that a single vm (we'll call it monkey1) was down (inaccessible from the network). I logged on remotely, logged into vcenter and issued a Reset on monkey1. The task sat there and tried and tried and then eventually it timed out and gave me the "Operation timed out" message. I then tried again and this time received the message "another task is in progress." I left it go for the evening and figured it wasn't a critical vm so I would look at it in the morning.
When I got in the next day, I started doing some looking online and found a great article from VMware explaining how to kill processes for running VM's and such. VMware Knowledge Base
I went through and tried everything in this article to no avail...nothing worked. I would see the process list and see the VM process ID running, but when I try to kill it, it does absolutely nothing. The command runs like it is working but the process never goes away and the VM never shuts down. Same with some of the other methods in that article.
So I opened a ticket with VMware. After some looking and troubleshooting, they came to the conclusion that the only way to fix this is to reboot the host. This host has several other VM's running on it, some of which are critical to daily operations. So I wanted to migrate all of the other VM's other than monkey1 over to other hosts in the cluster. When I started a vmotion, it got to 72% and then sat there for several minutes, which turned to an hour, which turned to even longer until it eventually timed out with the "Operation timed out" message. Hmmmm...tried a couple of other ones with the same or similar results...sometimes they go to 18%, sometimes to 58%, and sometimes to 72%, but they always time out in the end. Here's the strange things, some of them actually DID migrate to another host, even though I got the operation timed out message and the status never showed that it had finished. The VMware tech was befuzzled by this, stating "I've never seen this kind of behavior before."
In the meantime, as my backup jobs run from Veeam, any backup of any VM that was on this host showed the same behavior...the snapshot creation task will go to 100% and stick there for several minutes until it eventually times out.
The problem right now is that I have 9 vm's left on this host, again, some of which are critical that I can't have down, especially during the day, that will not migrate. I've tried each of them several times and now all I'm getting is the "Another task is in progress" on them.
Does anyone out there have any insight at all as to what could be causing this? I've asked for escalation with VMware but they refuse to do anything until I reboot the host. If I reboot the host now with all of these VM's running, they will go to an "inaccessible" state of some sort until the host is back up and running. I just can't do that...there has to be a way to either fix this with it up or migrate the VM's so that the host can be rebooted. Any daemons or processes or services I can restart on the host that might clear some of this up???
I'm open to suggestions :-)