VMware Cloud Community
VMKR9
Expert
Expert

Power of VM - Operation Failed Since another task is in progress

I have a new ESX3.0 environment that has been running for a few weeks now without to many issues. We are now seeing the error above when powering off some vms, is shows in virtual center as powering off at 100% but just sits there for hours and if you try to power it off again it says Operation failed since another task is in progress, the vm is locked and there is no way to get it to respond, has anyone seen this issue and know how to fix?

The vms log just displays this as the last entery:

Nov 03 07:09:23.694: vmx|

Nov 03 07:09:23.694: vmx|

Nov 03 07:09:23.694: vmx| VMXRequestReset

Nov 03 07:09:23.694: vmx| Stopping VCPU threads...

0 Kudos
47 Replies
jftwp
Enthusiast
Enthusiast

We have esxRanger Pro as well, and backup our ESX 3.0.1 hosts (in full) using a dedicated physical[/i] server with an HBA disk connected to it. No problems (overall; few hiccups) with Ranger, but when we've seen any snapshot-related issues such as with 'another task is in progress' that you can't end, including rarely with a VM that has an old snapshot that just isn't happy for whatever reason, we always ALWAYS have to shut down the given VM and then delete its snapshot/s. If that doesn't work (same error even after shutting down the VM, then we vmotion VMs off the host involved, then reboot that host, then delete the snapshot/s (always works) -


task is 'then' no longer in progress.

Snapshots aren't a perfect technology. Our own internal 'best practices' is to make sure no snapshots are sitting out there on the VMFS volumes with the VMs unlesss absolutely necessary/expected. 3rd party backup solutions such as Ranger merely tap the snapshot API to do their thing with fancy scripting coupled with a decent GUI... pointing the finger at Ranger (and I don't think you are, right?) isn't going to solve the underlying technology in regards to snapshots and stability/reliability (again, in my opinion, not a perfect science/technology just yet).

0 Kudos
jandie
Enthusiast
Enthusiast

Sorry to post this a little bit longer than expected. My issues may be different than yours, but the symptoms are the same. I do not have any VCB or ESXRanger running or any snapshot backup method running, hence I said that my issues may not be similar to the rest of yours. However, I promised to post the reply by VMware Rep after I opened an SR, so here it is:

"To answer your question, no this issue should not keep happening.

Did you rebuild the VM as I mentioned in my previous email? Did that help if

you did?

Another thing you may want to watch for is to prevent your CDROMs / Floppies

from referencing a non-existent ISO or .flp image. Better yet only have

CDROMs and Floppies "Connected" and "Start Connected" options enabled when

using them. Constant and repetitive seeks to the CDROMs and Floppies when

they have "Connected" and / or "Start Connected" enabled, needlessly consumes

CPU and can eventually hang the Guest OS.

Generally most of the time it is possible to kill a hung VM using the

procedures we have already noted. However sometimes the VM becomes

"orphaned" meaning the parent PID has been killed before the children PIDs.

Or the process becomes a "zombie". In both of these instances I have seen

where it becomes necessary to reboot the host to clear the process.

If you use VMotion you could use it to move running VMs off the host prior to

rebooting it so that those VMs do not experience any downtime.

I have not found any reason why the VM hung, as I stated during our phone

conversation. If you experience this problem again, please run the

vm-support script before rebooting the host so that we do not lose

information when the host reboots, and the process IDs are an exact

representation of the currently running system."

Hope that helps a little,

jandie

0 Kudos
ivanfetch
Contributor
Contributor

Hello,

We've seen this too, on a Windows Server 2003 and a Solaris 10 VM. We're running VMware ESX Server 3.0.1 build-44686, and Virtual Center 2.0.1. Killing the PID of the VM on the VI3 host, or using vmload_app to kill the group PID of the VM, lets us start the VM again, but there is still no clue what puts the VMs in limbo. Since the VM shows up as powered on, the problem is not something that HA will help with. I'll outline what we have looked at, with the hope that it will add to the troubleshooting data and maybe jogg others' ideas.

When VMs are in limbo, VMWare tools show "not installed," and the VM can not be powered off as "another operation is already in progress" even when none of us have performed an operation on the VM. Perhaps the other operation is something VC has tried to do (DRS?). Clarly the VM can not be powered off and must be killed.

Could this relate to a "bad host" in the cluster? When a VM gets moved (DRS) to that host, it goes flaky? Half the time we've seen these limbo issues, they have been on a particular host; the other half of the time I was not able to check. I have yet to find VC logs which show which hosts a VM was migrated to by DRS, to see if this host might be involved with all the times we've had VMs go into limbo. Where is the history of which hosts a VM has lived on over time?

Could this be IO related? The Solaris VM is a Solaris jumpstart and NFS server, and the Windows VM is running Microsoft System Center Operations

Manager 2007 with SQL Server 2005 Enterprise Edition. The first time the Solaris VM had this issue, we were transferring a lot of images to it over NFS. The Windows VM does not see much action - it's there for testing and doesn't do much at the moment.

I have two Solaris VMs on this VMWare cluster, both VMs are running 5.10 Generic_118855-36 (64bit), and the same version of VMWare tools. Only one of the Solaris VMs has ended up in limbo. The second time it was in limbo, I was able to VMotion it to other hosts, but it stayed in limbo. We could also ping the VM, but could not SSH to it, connect to it's serial console, or anything else - this is the first time I have seen any kind of response from the guest OS, when a VM was in limbo. I wasn't at a place where I could see it's VMWare console. I have seen some Solaris VM issues with Sol10 before rev 11/06, or with the 32bit kernel, but they don't apply to what we're running here. I wonder what makes the other Solaris VM (which has never gone into limbo) special? Admitedly it doesn't do much, it's there for some samba access and other random testing.

We have two Windows Server 2003 Enterprise SP2 (not R2) VMs, and only one has been in limbo. The VM which hasn't been in limbo is an application server, using IIS. It hasn't seen a lot of action, the app is still being setup.

We could update this VMWare cluster to 3.0.2, and upgrade Virtual Center to 2.0.1 PL2, but I'd like to know something more concrete about this issue before tossing upgrades at the problem.

Has anyone else had headway or additional feedback from VMWare? I"m about to open a case, if only to add to the "me too" list.

0 Kudos
shawnporter
Contributor
Contributor

Found this in another forum...

http://supportforums.vizioncore.com/forums/2/3987/ShowThread.aspx

Apparently, this is a known bug and a fix is coming in the september time frame?

If you open a case with VMWare concerning this issue, reference SR# is 191595084 and you should be able to point your support guru in the right direction on this issue.

If anyone is using the beta of the fix please post.

0 Kudos
harryc
Enthusiast
Enthusiast

I got the same error on 2 VMs SuSE-9 (64) and SuSE-10 (64) - Running ESX 3.01 - no snapshots or backups running. The NT guy says he had one hang like this also.

Happens infrequently, I have a dozen or more SuSE VM, several dozen Windows VMs, running on a farm of 5 ESX servers.

Trying the vm-support -x ( get ID ) then vmx-support -X ID

Took awhile ( 6 minutes ) but in the end the VM is down !

# vm-support -x
VMware ESX Server Support Script 1.27
Available worlds to debug:
vmid=1079 aba-dc-qa
vmid=1089 aba-dc-pub-dev1
vmid=1107 chgbpmdb01dev
vmid=1117 chgsandbox01
vmid=1125 magic-dev
vmid=1133 chgbpmapp01dev
vmid=1143 aba-dc-dev
vmid=1148 timssnet-dev
vmid=1169 rdmcsdev01
vmid=1179 aba-dcqa-pub1
vmid=1189 SuSE10-DMZ
vmid=1198 SuSE10-LAN
# vm-support -X 1189
VMware ESX Server Support Script 1.27

Can I include a screenshot of the VM 1189? : y
Can I send an NMI (non-maskable interrupt) to the VM 1189? This might crash the VM, but could aid in debugging : y
Can I send an ABORT to the VM 1189? This will crash the VM, but could aid in debugging : y
Preparing files: /
Grabbing data & core files for world 1189. This will take 5 - 10 minutes.

thx Shawn !

0 Kudos
ScratchMan
Contributor
Contributor

Here is what we have found was causing this in our environment:

One VM administrator uses his Virtual Infrastructure Client Console to connect an ISO image located on his local machine to the VM CDROM. He does not disconnect the ISO and leaves VIC open. A second administrator connects to the same VM and either disconnects the ISO manually or tries to reboot the server. It seems that Virtual Infrastructure Client will produce a popup window on the first Administrators console alerting the administrator that the ISO has been disconnected. If this popup is not answered (in our case it was hidden behind the window, or the admin is gone for the day) the VM seems to hang indefinately until the Admin hits "OK." I guess this notification is the "Task in Progress." While in this state we receive the error "Operation Failed Since another task is in progress."

I seems that other similar scenerios may cause similar hangups. We have now been careful to disconnect ISO images from the VMs when not in use.

I am curious if other admins see this issue.

Thanks,

Anthony S.

0 Kudos
ISD622
Contributor
Contributor

Thank you Mr. Stan, worked well!

0 Kudos
mr_anderson1
Contributor
Contributor

I was having the same problem on one of my ESXi 4 hosts.

We use Veeam backup 4.0 and the snapshot process on that server had hung. Server is a Debian 4 with VMware tools installed.

The solution that worked for me was simply connecting over to the iLO of the Host, logging into the system with the root user and Restarting the Management Interfaces. Nice 30 second fix.

0 Kudos