Windows 2003 SP2 Enterprise
During a VMotion, particular guest operating systems see disk errors, which cause application issues
I currently have an issue where if I vmotion a guest SQL server around in the cluster, the guest sees disk errors in the event log and SQL operation latency in the SQL logs. This is causing the application the guest runs to have issues, and require rebooting/troubleshooting of the distributed application servers. The owners of the guest OS want me to move them to a non-DRS ESX host so they don't have any issues, but then I can't apply updates/patches or distribute load without shutting down the guest. The SQL servers aren't under intense load, but the VMDKs are rather large. They have a 20GB C:\ drive and a 280GB D:\ drive. I've verified that their disk timeout settings at HKLM\SYSTEM\CurrentControlSet\Services\disk\Timeout is set to 60. All the hosts in the cluster are IBM 3650s with the same processors, same settings enabled, all of whom (all 😎 see the same LUN that the guest resides on.
Anyone have any ideas that I could try? I don't have much rights over the guest OS itself, but I could work with them if it came to things like disk resizing, registry values, performance logging, etc. The vmotions work in a timely and correct fashion, they just appear to the guest to vanish the disks temporariily. Maybe it's ESX host queue depth on the HBAs. I've no idea. I'm grasping at straws.
What type of storage are you using? Is it FC/iSCSI/NFS? Are you seeing SCSI errors in the system log in Windows? Are you seeing SCSI errors in the vmkernel log on the ESX hosts?
In addition to the storage type if you could give us the exact vendor and model information it would help us to figure out if there are any know issues/ configuration options that you have to turn on.
Also try and post the Windows Even log abstract from the event view.
Left a few things out.
All of the ESX hosts are connected to a FC IBM SAN. We use an IBM SVC to do storage abstraction for us, the ESX host does not see any other storage controllers directly. We have a team that manages the SAN, and carves up LUNs to us when we ask for them.
Additionally, I reviewed the /var/log/vmkernel logs for the timeframe in question on both the vmotion source and vmotion destination host, and they are completely devoid of any errors. When I get into work tomorrow I'll attach a copy of the logs to this thread in case someone spots something I missed, but I don't think that will be the case.
As for the event logs and SQL logs, those I shall also grab and attach in the morning. Thanks for the replies guys, I appreciate knowing there are people willing to help out there.
Event Type: Information
Event Source: MSSQLSERVER
Event Category: (2)
Event ID: 833
Time: 2:19:48 AM
SQL Server has encountered 2 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file in database (16). The OS file handle is 0x00000808. The offset of the latest long I/O is: 0x000000099db400
During a VMotion or SVMotion it is possible that the VM will be unavailable until the VMotion/SVMotion completes. You will need to measure this on your system. This lose of disk/network access is expected.
Your application needs to be better tuned for this eventuality, or the client should reissue its request in case of this type of failure.
Edward L. Haletky
VMware Communities User Moderator
Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education.
SearchVMware Blog: http://itknowledgeexchange.techtarget.com/virtualization-pro/
As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization
The message you posted can be seen whether or not a vmotion occurrs. I have a rather large db which processes huge transactions that frequently posts this type of message. This message can also be reported incorrectly.
How much memory are you giving this vm, and how long does a vmotion take? Are your ESX servers' CPU busy when you see these messages? Or are they busy overall?
You might want to check this KB http://kb.vmware.com/KB/1268. And if the problem persists we need more information like what is the end storage and the HBA you are using on both Source and Destination server.
We might need the vmkernel.log from both the source and destination during the migration.
We are currently experiencing the same SAN disk connectivity issues with our ESX 3.5 HP Blade server cluster connected to EVA600 SAN and have suffered data corruption on our SQL VMs, requiring several of them to be restored from backup. The problem definitely seems to be triggered by Vmotion across ESX servers, and we have had to set our DRS settings to conservative whilst we try to resolve the problem. Despite having support for the hardware and software via HP, we are yet to get anywhere near a solution.
It is interesting to note that we are not alone with this issue. Has anyone else experienced this problem, and/or found a solution?