Hello,
We're currently installing a vCloud Director evaluation instance and I'm having a problem. We're deploying a single-cell environment, but the cell killing itself off after 5-15 minutes of operation. The process seems to suffer a heartbeat failure and the cell dies. If I restart the vcd process, it comes back find but sure enough, 5-15 minutes later the cell is dead reportedly due to a heartbeat failure
Here is the specifics of our environment:
When starting vcd we see it initialise fine and can log in without issue, attach a vCenter, set up some orgs, etc. There seems to be nothing we can't do. Below is a log sample of an instance that started at 13:04 and which died inexplicably at 13:22.
2011-04-14 13:17:52,890 | DEBUG | VC.TaskManager.NonActiveTaskCompletionsPurger | TaskManager | Marked 0 tasks for purging |
2011-04-14 13:17:52,891 | DEBUG | VC.TaskManager.NonActiveTaskCompletionsPurger | TaskManager | Purged 0 tasks. |
2011-04-14 13:18:52,904 | DEBUG | VC.TaskManager.NonActiveTaskCompletionsPurger | TaskManager | Marked 0 tasks for purging |
2011-04-14 13:18:52,904 | DEBUG | VC.TaskManager.NonActiveTaskCompletionsPurger | TaskManager | Purged 0 tasks. |
2011-04-14 13:19:52,917 | DEBUG | VC.TaskManager.NonActiveTaskCompletionsPurger | TaskManager | Marked 0 tasks for purging |
2011-04-14 13:19:52,917 | DEBUG | VC.TaskManager.NonActiveTaskCompletionsPurger | TaskManager | Purged 0 tasks. |
2011-04-14 13:20:52,931 | DEBUG | VC.TaskManager.NonActiveTaskCompletionsPurger | TaskManager | Marked 0 tasks for purging |
2011-04-14 13:20:52,931 | DEBUG | VC.TaskManager.NonActiveTaskCompletionsPurger | TaskManager | Purged 0 tasks. |
2011-04-14 13:21:39,856 | DEBUG | VC-1141059082Listener (178) | VcUpdateListenerImpl | Client-side timeout in Inner WFU loop, retry |
2011-04-14 13:21:39,865 | DEBUG | VC-1141059082Listener (178) | CompleteVlsiCallImpl | PropertyCollector.waitForUpdates method invoked on PropertyCollector:propertyCollector at https://10.128.21.4:443/sdk/vimService (session mgmt off) |
2011-04-14 13:21:52,945 | DEBUG | VC.TaskManager.NonActiveTaskCompletionsPurger | TaskManager | Marked 0 tasks for purging |
2011-04-14 13:21:52,945 | DEBUG | VC.TaskManager.NonActiveTaskCompletionsPurger | TaskManager | Purged 0 tasks. |
2011-04-14 13:22:50,281 | INFO | HeartbeatTimer | CellLivenessStatusServiceImpl | Marking cell as inactive |
2011-04-14 13:22:50,283 | INFO | HeartbeatTimer | CellLivenessStatusServiceImpl | Scheduler put on standby |
2011-04-14 13:22:52,958 | DEBUG | VC.TaskManager.NonActiveTaskCompletionsPurger | TaskManager | Marked 0 tasks for purging |
2011-04-14 13:22:52,958 | DEBUG | VC.TaskManager.NonActiveTaskCompletionsPurger | TaskManager | Purged 0 tasks. |
2011-04-14 13:23:52,972 | DEBUG | VC.TaskManager.NonActiveTaskCompletionsPurger | TaskManager | Marked 0 tasks for purging |
2011-04-14 13:23:52,972 | DEBUG | VC.TaskManager.NonActiveTaskCompletionsPurger | TaskManager | Purged 0 tasks. |
At that point, the cell stops responding and the web UI reports errors. If I restart the vcd service it resumes without issue.
The vcloud-director-info log is just as void of clues:
2011-04-14 13:06:39,316 | INFO | VC-1141059082Listener (178) | VirtualCenterListener | VC connection state event 1141059082. Changing state to CONNECTED |
2011-04-14 13:06:39,334 | INFO | VC-1141059082Listener (178) | VcUpdateListenerImpl | VC 1141059082: Successfully Finished Initial Sync |
2011-04-14 13:22:50,281 | INFO | HeartbeatTimer | CellLivenessStatusServiceImpl | Marking cell as inactive |
2011-04-14 13:22:50,283 | INFO | HeartbeatTimer | CellLivenessStatusServiceImpl | Scheduler put on standby |
2011-04-14 13:24:30,877 | ERROR | CloudScheduler_QuartzSchedulerThread | QuartzSchedulerThread | quartzSchedulerThreadLoop: RuntimeException null |
java.lang.reflect.UndeclaredThrowableException
I've had a crack at changing the ActiveMQ heartbeat/failover testing frequencies in the Config table but with no success (they didn't even seem to get read?). I've also noticed that ActiveMQ is loading up on ports 61616 and 61613, but the doc seems to suggest that this should be on 61616 and 61611. Otherwise, as I said, everything else seems to work perfectly fine.
I'm grateful for any and all help that you can offer!
Mike
Can you post the full stack trace of the exception in the debug log?
Thanks,
Sangeeta
Mike,
Have a look at post http://communities.vmware.com/message/1674087
This is not the same issue, but it does point to a couple of places you might be able to look for more diagnosis. In particular, check out the global.properties file to see if the wrong IP address is popping up in there.
You've probably already looked, but these KB articles might help was some background:
http://kb.vmware.com/kb/1026292 (Cell Architecture)
http://kb.vmware.com/kb/1030954 (Cell Architecture Diagram)
http://kb.vmware.com/kb/1026294 (Multi-cell features - might help with explaining Heartbeat behaviour)