Wondering if anyone has experienced symptoms similar to the following:
A guest VM (Windows 2008r2) has high memory and CPU load, and is live migrated to a different host within the cluster. No storage migration. Shortly after the migration event is completed, no errors recorded in vSphere, CPU usage spikes near 100%, VM performance begins to degrade, and must be rebooted to get back to expected performance levels. We're not even sure VMware is causing this at this point, as load on the affected guests is quite high as it is. We're trying to determine if it's possible that the live migration is exacerbating the issue.
I've opened a ticket with support, but all they could do is check over our VMotion setup. We were unable to reproduce the issue for them.
Thanks all for the replies. I was finally able to reproduce the issue for support, and after digging around the vmkernel.log file support determined that the issue was caused by lingering issues following an All Paths Down event. They recommended that we reboot the hosts that were affected by the APD event. We haven't seen a recurrence since rebooting the hosts.
Welcome to the Community - I have only seen this type of behavior when vmotioning a heavy loaded machine to a host that is close to the limit as it is - was the vmotion manual or triggered by DRS?
This was triggered by DRS. What would you define as "close to the limit" in such a scenario? Are we talking memory or CPU, or both?
Could be one or the other or both but since DRS triggered the vmotion that is not it - DRS would not have moved it if there were inufficient resources - what workload is this VM carrying?
It's a 2008 R2 Xenapp 6.5 server with 30+ sessions. CPU usage is typically 25-65% with 75-90% RAM usage.
Did you check your non-paged pool size? There's a chance that you have a memory leak somewhere. Check your VMTools version, is it outdated? Update it.
I have seen this behaviour in the past, it's usually related to high memory on the target host (90%+). If you have the time, it will after a while calm back down....but it can take a while. Is there a reason you're running these so close to the edge?
Thanks all for the replies. I was finally able to reproduce the issue for support, and after digging around the vmkernel.log file support determined that the issue was caused by lingering issues following an All Paths Down event. They recommended that we reboot the hosts that were affected by the APD event. We haven't seen a recurrence since rebooting the hosts.