Why do Virtual Center and ESX disagree on tasks in...

JMButler · ‎08-20-2006

Periodically we have tasks that time out in our Virtual Center, but continue running in the ESX. This makes the VM effectively impossible to control since the ESX will not allow another task to begin on the VM and Virtual Center does not see the task to allow it to be cancelled. The only solution we have found so far is to reboot the ESX.

We have found that this is the general sequence of events that cause us to get into the bad state:

1. Issue a task through the Virtual Center SDK.

2. The task remains at "In progress" for 15 minutes without actually beginning execution.

3. Virtual Center times out and returns an error for the task, but the ESX continues to run the task.

4. We issue another task through the SDK to clean up for the failure and the task fails due to "Another task in progress." There is no visible task in progress in Virtual Center.

5. At this point all tasks on this VM fail including all attempts to shutdown the VM either softly or through a hard shutdown. This also means we cannot place the ESX in maintenance mode for a reboot because the VM is still running and cannot be powered off by Virtual Center. Our only solution is to reboot the ESX with the VM still running.

If anyone has seen this problem and has some insight to either recovery or avoiding the problem it would be greatly appreciated.

mstahl75 · ‎08-20-2006

Is the task actually running on the ESX server or is there just a file there, something like a lock, that makes it think there is? Have you checked any of the logs to determine what is going on?

You might be able to restart just the management agents instead of restarting the whole server. Though, that kind of depends on what is actually going on.

service mgmt-vmware restart[/b]

Maybe?

cloutidr · ‎08-21-2006

It doesn't appear to have any sort of lock files, it acts more like ESX maintains some sort of database with that information. I'd probably rather not be fooling around in there.

I also haven't tried just restarting the service which may work, but from our system's standpoint doesn't really make anything any easier compared to just restarting the ESX.

I have pulled out the aplicable log entries for one instance of this happening. We clearly have one task that fails (times out according to VC) after 15 minutes and after the ESX logs the task complete, the next task fails complaining one is already in progress. It appears the task just doesn't get cleaned up properly when it fails. It is also interesting that all failed tasks except the one that never really finishes provide a dump in the log.

\[2006-08-20 09:55:58.069 'App' 94997424 info] \[VpxLRO] -- BEGIN task-2711 -- vm-11 -- \[vm-11:reconfigure]

\[2006-08-20 10:10:58.983 'App' 94997424 error] \[vm.Reconfigure] Received unexpected exception

\[2006-08-20 10:10:59.089 'App' 94997424 info] \[VpxLRO] -- FINISH task-2711 -- vm-11 -- \[vm-11:reconfigure]

\[2006-08-20 10:11:02.141 'App' 53259184 info] \[VpxLRO] -- BEGIN task-2865 --vm-11 -- \[vm-11:reconfigure]

\[2006-08-20 10:11:02.771 'App' 53259184 warning] ============BEGIN FAILED METHOD CALL DUMP============

\[2006-08-20 10:11:02.771 'App' 53259184 warning] Invoking \[reconfigure] on \[vim.VirtualMachine:512]

\[2006-08-20 10:11:02.771 'App' 53259184 warning] Arg spec:

(vim.vm.ConfigSpec) {

dynamicType = ,

dynamicProperty = (vmodl.DynamicProperty) [],

key = "guestinfo.parameter",

value = "42"

}

]

}

\[2006-08-20 10:11:02.772 'App' 53259184 warning] Fault Msg: "Operation failed since another task is in progress"

\[2006-08-20 10:11:02.772 'App' 53259184 warning] ============END FAILED METHOD CALL DUMP============

\[2006-08-20 10:11:02.863 'App' 53259184 error] \[vm.Reconfigure] Received unexpected exception

\[2006-08-20 10:11:02.864 'App' 53259184 info] \[VpxRLO] -- FINISH task-2865 -- vm-11 -- \[vm-11:reconfigure]

All future tasks continue to fail in the same manner with the "Operation failed since another task is in progress" message.

Again, any insight would be greatly appreciated.

mstahl75 · ‎08-21-2006

I also haven't tried just restarting the service which may work, but from our system's standpoint doesn't really make anything any easier compared to just restarting the ESX.

The main benefit, if it works, is that your VM will still be up and running rather than it going down with the host. Also, it only takes a little time for the service to restart.

cloutidr · ‎08-21-2006

The problem is that this is an automated system of sorts and we repond to the failed task by powering off the VM. That task fails, but it does manage to get the VM into a state where if it isn't off, it closely resembles being off. So it either managed to turn it off or crashes it. In either case, the system now requires (for a number of reasons not all VMware related) manual intervention. This is what we are really trying to avoid and why it really does not make a difference whether the user has to reboot the ESX or restart the service.

Trumpeteer · ‎02-02-2007

The solution I use is as follows:

1) start a putty session on the ESX machine

2) restart the management agents of VMWare, so the ESX machine and Virtual Center Server are in line again through the command: service mgmt-vmware restart

3) find the processID of the hanging VM through: ps -ef | grep

This does the job anytime. Now you can startup the VM again. All other VM's are not influenced, and a restart of ESX is not neccesary

trojanjo · ‎03-26-2007

I had to bounce the ESX 3.0.1 host.

The process for the machine with this same problem would not die even after restarting the management interfaces. One note: After restarting the mgmt services it let me try to start the VM again but then I just had two pid's for the guest... weird.

Message was edited by:

trojanjo

---- Visit my blog. http://www.2vcps.com
Follow me: http://twitter.com/jon_2vcps

All

Why do Virtual Center and ESX disagree on tasks in progress?