from the host's shell:
\# ps axu | grep name_of_the_faulty_VM
\# kill -9 above_found_PID
I had the same thing happen and I ended up puting in maintenance mode, which made my VM come back to life but it could not migrate to another host in the cluster, so I had to end up rebooting the server anyway
if they won't powerr off then:
ps -ax | grep "VMname"
kill -9 (returned PID)
if they won't start and it's none of the obvious (avaliable memory,disk space etc....) then:
restart the VMware Virtual Infrastructure Server service on the VC Server.
I did get this issue and did kill it works fine
but when i try to start it up it returns failed to power on VM : No swap file
the file is ther i can see it but i cant delet it
so i did try to restart the VC service but this dosent help it gives me the same error when i try to start the VM
(at the cmd prompt enter) cat /proc/vmware/vm/*/names
This lists the running VM's on the host server you are logged on to.
vmid=1069 pid=-1 cfgFile="/vmfs/volumes/45.../server1/server1.vmx" uuid="50..." displayName="server1"
vmid=1107 pid=-1 cfgFile="/vmfs/volumes/45.../server2/server2.vmx" uuid="50..." displayName="server2"
vmid=1149 pid=-1 cfgFile="/vmfs/volumes/45.../server3/server3.vmx" uuid="50..." displayName="server3"
vmid=1156 pid=-1 cfgFile="/vmfs/volumes/45.../server4/server4.vmx" uuid="50..." displayName="server4"
vmid=1170 pid=-1 cfgFile="/vmfs/volumes/45.../server5/server5.vmx" uuid="50..." displayName="server6"
vmid=1178 pid=-1 cfgFile="/vmfs/volumes/45.../server6/server6.vmx" uuid="50..." displayName="server6"
vmid=1188 pid=-1 cfgFile="/vmfs/volumes/45.../server7/server7.vmx" uuid="50..." displayName="server7"
vmid=1198 pid=-1 cfgFile="/vmfs/volumes/45.../server8/server8.vmx" uuid="50..." displayName="server8"
\[-If you are running ESX 2.5 then you can kill the vmx PID-]
If you are running ESX 3.0.x then you find group ID that controls the PID of the VM.
(at the cmd prompt enter) less -S /proc/vmware/vm/1149/cpu/status
vcpu vm type name uptime status costatus usedsec syssec wait waitsec idlesec (more...)
1149 1149 V vmm0:server3 350042.494 WAIT STOP 15968.954 518.916 COW 325800.734 322397.266 (more...)
Scroll right with the right arrow key to locate the "group" pid. In this case the group pid was 1148 (not shown in this example)
Now with the group PID you can kill the VM safely without corrupting the VM as posted earlier.
(at the cmd prompt enter) /usr/lib/vmware/bin/vmkload_app -k 9 1148
Warning: Apr 20 16:22:22.710: Sending signal '9' to world 1148.
THIS MEANS SUCCESS... if you receive another line then the process might not have been successful.
Hope this helps!
We started out with a VM that just would not power on. Created a new VMX and pointed to the old VMDK's. The new VM would power on just fine then in our particular situation we needed to delete the old VMX from inventory. Was unable to because we would get the Operation Failed since another task is in progess message. Used the PS -auwx |GREP command to find the PID and used KILL (PID). When attempting to delete the process indicator stopped at 95%, timed out - then Orphaned the machine. Then was able to delete the vmx from the COS.
Message was edited by:
OK I had this problem when deleting snapshots.
It would appear that a break in the snapshot chain caused the task to time out. This resulted in the error as posted above.
Attempting to vmotion the machine to another host fixed the problem even though it posted the usual "snapshots aren't supported issue".
Just happened to me this morning, I hope it doesn't happen again. I can't seem to find the cause, but SCSI Distributed File Lock popped in my head (since every time I want to do any operation to the VM, it says another task is in progress - can't even VMotion). I'm opening an SR, will keep you posted.
We have migrated to a brand new environment and it is still happening on a completley fresh install of 3.01. It would be good if someone could finally get an answer as to the cause and an easier solution..... fingers crossed!
I am now having this problem. It only seems to happen over the weekend when we run our esxRanger Pro backups to an external HD. I have just loggeda call for the issue.
We have esxRanger Pro as well, and backup our ESX 3.0.1 hosts (in full) using a dedicated physical[/i] server with an HBA disk connected to it. No problems (overall; few hiccups) with Ranger, but when we've seen any snapshot-related issues such as with 'another task is in progress' that you can't end, including rarely with a VM that has an old snapshot that just isn't happy for whatever reason, we always ALWAYS have to shut down the given VM and then delete its snapshot/s. If that doesn't work (same error even after shutting down the VM, then we vmotion VMs off the host involved, then reboot that host, then delete the snapshot/s (always works) -
task is 'then' no longer in progress.
Snapshots aren't a perfect technology. Our own internal 'best practices' is to make sure no snapshots are sitting out there on the VMFS volumes with the VMs unlesss absolutely necessary/expected. 3rd party backup solutions such as Ranger merely tap the snapshot API to do their thing with fancy scripting coupled with a decent GUI... pointing the finger at Ranger (and I don't think you are, right?) isn't going to solve the underlying technology in regards to snapshots and stability/reliability (again, in my opinion, not a perfect science/technology just yet).
Sorry to post this a little bit longer than expected. My issues may be different than yours, but the symptoms are the same. I do not have any VCB or ESXRanger running or any snapshot backup method running, hence I said that my issues may not be similar to the rest of yours. However, I promised to post the reply by VMware Rep after I opened an SR, so here it is:
"To answer your question, no this issue should not keep happening.
Did you rebuild the VM as I mentioned in my previous email? Did that help if
Another thing you may want to watch for is to prevent your CDROMs / Floppies
from referencing a non-existent ISO or .flp image. Better yet only have
CDROMs and Floppies "Connected" and "Start Connected" options enabled when
using them. Constant and repetitive seeks to the CDROMs and Floppies when
they have "Connected" and / or "Start Connected" enabled, needlessly consumes
CPU and can eventually hang the Guest OS.
Generally most of the time it is possible to kill a hung VM using the
procedures we have already noted. However sometimes the VM becomes
"orphaned" meaning the parent PID has been killed before the children PIDs.
Or the process becomes a "zombie". In both of these instances I have seen
where it becomes necessary to reboot the host to clear the process.
If you use VMotion you could use it to move running VMs off the host prior to
rebooting it so that those VMs do not experience any downtime.
I have not found any reason why the VM hung, as I stated during our phone
conversation. If you experience this problem again, please run the
vm-support script before rebooting the host so that we do not lose
information when the host reboots, and the process IDs are an exact
representation of the currently running system."
Hope that helps a little,
We've seen this too, on a Windows Server 2003 and a Solaris 10 VM. We're running VMware ESX Server 3.0.1 build-44686, and Virtual Center 2.0.1. Killing the PID of the VM on the VI3 host, or using vmload_app to kill the group PID of the VM, lets us start the VM again, but there is still no clue what puts the VMs in limbo. Since the VM shows up as powered on, the problem is not something that HA will help with. I'll outline what we have looked at, with the hope that it will add to the troubleshooting data and maybe jogg others' ideas.
When VMs are in limbo, VMWare tools show "not installed," and the VM can not be powered off as "another operation is already in progress" even when none of us have performed an operation on the VM. Perhaps the other operation is something VC has tried to do (DRS?). Clarly the VM can not be powered off and must be killed.
Could this relate to a "bad host" in the cluster? When a VM gets moved (DRS) to that host, it goes flaky? Half the time we've seen these limbo issues, they have been on a particular host; the other half of the time I was not able to check. I have yet to find VC logs which show which hosts a VM was migrated to by DRS, to see if this host might be involved with all the times we've had VMs go into limbo. Where is the history of which hosts a VM has lived on over time?
Could this be IO related? The Solaris VM is a Solaris jumpstart and NFS server, and the Windows VM is running Microsoft System Center Operations
Manager 2007 with SQL Server 2005 Enterprise Edition. The first time the Solaris VM had this issue, we were transferring a lot of images to it over NFS. The Windows VM does not see much action - it's there for testing and doesn't do much at the moment.
I have two Solaris VMs on this VMWare cluster, both VMs are running 5.10 Generic_118855-36 (64bit), and the same version of VMWare tools. Only one of the Solaris VMs has ended up in limbo. The second time it was in limbo, I was able to VMotion it to other hosts, but it stayed in limbo. We could also ping the VM, but could not SSH to it, connect to it's serial console, or anything else - this is the first time I have seen any kind of response from the guest OS, when a VM was in limbo. I wasn't at a place where I could see it's VMWare console. I have seen some Solaris VM issues with Sol10 before rev 11/06, or with the 32bit kernel, but they don't apply to what we're running here. I wonder what makes the other Solaris VM (which has never gone into limbo) special? Admitedly it doesn't do much, it's there for some samba access and other random testing.
We have two Windows Server 2003 Enterprise SP2 (not R2) VMs, and only one has been in limbo. The VM which hasn't been in limbo is an application server, using IIS. It hasn't seen a lot of action, the app is still being setup.
We could update this VMWare cluster to 3.0.2, and upgrade Virtual Center to 2.0.1 PL2, but I'd like to know something more concrete about this issue before tossing upgrades at the problem.
Has anyone else had headway or additional feedback from VMWare? I"m about to open a case, if only to add to the "me too" list.
Found this in another forum...
Apparently, this is a known bug and a fix is coming in the september time frame?
If you open a case with VMWare concerning this issue, reference SR# is 191595084 and you should be able to point your support guru in the right direction on this issue.
If anyone is using the beta of the fix please post.
I got the same error on 2 VMs SuSE-9 (64) and SuSE-10 (64) - Running ESX 3.01 - no snapshots or backups running. The NT guy says he had one hang like this also.
Happens infrequently, I have a dozen or more SuSE VM, several dozen Windows VMs, running on a farm of 5 ESX servers.
Trying the vm-support -x ( get ID ) then vmx-support -X ID
Took awhile ( 6 minutes ) but in the end the VM is down !
# vm-support -x
VMware ESX Server Support Script 1.27
Available worlds to debug:
# vm-support -X 1189
VMware ESX Server Support Script 1.27
Can I include a screenshot of the VM 1189? : y
Can I send an NMI (non-maskable interrupt) to the VM 1189? This might crash the VM, but could aid in debugging : y
Can I send an ABORT to the VM 1189? This will crash the VM, but could aid in debugging : y
Preparing files: /
Grabbing data & core files for world 1189. This will take 5 - 10 minutes.
thx Shawn !