I've got a weird problem. Some of my vm's will not vmotion, other's will vmotion just fine. When a VM fails to vmotion, it is basically getting to 10% complete, pausing there, and then timing out. It never completes.
All vm's are Win2k3, and almost all are deployed from the same template and so have the same size disks, same RAM and number of CPU's, etc. The only difference between some of the ones that will vmotion and the one's that won't is the name.
IBM HS21XM blades, ESX 3.5, VC 2.5U1.
I contacted VMware support about it, they looked at logs and told me the problem was probably CPU masks or swapping on the hosts. I cleared them on affected VM's, bumped up SC memory on the hosts, problem still occurs.
What about the load on these machines? I had this issue (very rarely) when there was heavy load (in this case a database application was running large queries).
If you found this information useful, please consider awarding points for "Correct" or "Helpful" answers/replies. Thanks!!
On the VM's that will not vmotion, can you manually vmotion them? Do all the validations pass? Try that and see if you get the error to show that is preventing the vmotion. Could be a connected CD Drive, or are you trying to vmotion to a host that does not share the same storage?
Hope this helps!
we had the same issue. There are 7 HP Proliant 380 G5 Server in our ESX HA/DRS Cluster. 2 older machine doesn`t support sse4.1.
This kb will help you:
All hosts are identical IBM HS21XM blades. They all have the same CPU.
For the affected VM's, manual vmotions do not work either. Validation is 50/50. Sometimes validation succeeds without any issues, sometimes validation shows a warning about preserving CPU features, but none of them ever fail validation. I find it weird that I get any CPU warnings at all, because the CPU's on each host are the same. Ideas?
The load on these hosts is low. Rarely exceeding 50% CPU or memory usage, usually down around 30% on both. That causes me to wonder why there is any SC swapping going on at all. Remember that some vm's vmotion without a problem, and some don't . So I don't think that load is involved. If excessive load was causing the problem, then wouldn't I have this issue with ALL vm's instead of just some of them?
Have you noticed if it is the same VM's that fail? And are they failing to a certain ESX host or any ESX host? Or is it any VM failing at different times? Have you verified DNS settings? Ensure all you ESX hosts have /etc/hosts file with all the other ESX's entered in there. Have you performed anything on your ESX hosts. I can't remember what did this for me, if it was ESX patches or something else, but I remember some vmotions failing. What I did then was SSH into any ESX host an perform a vmkping to all other ESX hosts to both it's IP's, the SC IP and the vmotion IP. after that I can successfully vmotion.
Hope this helps!
It is the same vm's that fail. Every now and then, and this is rare, one of them will successfully vmotion to another host but then it will go back to failing.
If a vm fails to vmotion to/from one host, it will fail to/from all hosts. I have verified this by attempting to vmotion a host from Host1 to all other hosts. Then shut it down, cold migrate it to host2, power it on and then try to vmotion to all other hosts. Repeated for all hosts in cluster.
I have not tried to vmkping between all the hosts and see if that resolves the problem. I didn't bother with this because some VM's will vmotion just fine and so I figured that Vmotion itself was working without issue. I will attempt this and see what happens.
It's probably worth noting that I am in the process of migrating our hosts from 3.5 to ESXi installable. That is why I need to put all the hosts in maintenance mode, and how I noticed that some vm's are not vmotioning. I have this problem from/to both ESX 3.5 and ESXi, so I don't think it's related to the mix of hypervisor version. I have tested this by creating esxi- and esx3.5-only clusters and the same problem occurs.
What error do you get when the vmotion fails?
If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
I get the "operation timed out" pop-up. Looking in tasks, the only error on the host is that the Migrate Virtual Machine task shows the status as "Operation Timed Out." Events shows "migrating $vm off host" and then the next event is "failed to migrate $vm to $host" with no additional details.
Are all ESX hosts at the same version level? When a vm fails, you may want to check the tasks&events tab in virtual center. I've seen this type of issue occassionally when there are duplicate vmkernel IP addresses. Double-check these in your ESX configuration. When vmotion fails at 10%, it is typically network related.
kjb007...while they are not all at the same version level, vmotion will fail even between the hosts that are. Again, it is not all VM's, just some of them. Most will vmotion just fine.
Troy Clavell...thanks for the link. I double-checked a couple and, as I thought, none of our VM's have any reservations set so I don't think that's the issue. It was a good idea, though.
Ok, force a vmotion of one of the bad vm's. Then, after it fails, look at the vmware.log as well as the tasks & events tab in vCenter. There may be some messages in your vmkernel log at that time also if there are issues.
Make sure the network connections are all GbE, and not 100 MB. 100MB connection can cause timeouts sometimes and others, although it would be random, and not the same vm's over and over unless the failing vm's have more memory in use.
ESXi....there is no vmkernel.log, at least not that I can find.
vmware.log shows no errors.
Connections are all GbE.
New behavior to report. Some VM's migrate very quickly. Some hit 10% and then just time out. And others are hitting 10%, pausing for a minute, then moving along at normal speed to 90%, where they pause again for a minute or so, and then complete with a total time of about 2-3 minutes.There doesn't seem to be a correlation between RAM allocation and migration time, I tried bumping up the RAM on a quick-migrating VM and it didn't slow it down at all.
Check the tasks & events tab for vm's that fail migration. I understand that connections are GbE, make sure that negotiated speed is actually 1000 Mbps, and not 100. You should be able to see that in the vswitch properties.
verified negotiated GbE on all interfaces.
the tasks and events for a failed migration go like this:
Task: Migrage Virtual Machine
Being Migrated from hostA to hostB
Migrating off hostA
Failed to migrate to hostB......the details for this event only say "failed to migrate $vm from hostA to hosB"
The event Migrate Virtual Machine simply has a status of "Operation timed out" with the task details simply referencing the above mentioned events.
If all the interfaces are at 1 Gb, and vmotion tasks are still timing out, then I'd look at the switches and network utilization next. I will occassionally get timeouts, but if I kick off the vmotion again, it will succeed. I will usually see a couple of timeouts when I place a few servers into maintenance mode during patch time. This usually forces ten's of vm's into vmotion, and occassionally a few will timeout. Have you verified the state of the network when a vmotion will fail?
This may be a long shot, but I had this problem a while back, and I can't remember exactly the steps I took, but I know part of it was to first power off the VM (shutdown guest). Then change a setting or two in the VM config, even if you don't really change anything click 'OK' to 'save' the changes. Make sure the tools are up to date for these failed VM's by check the appropriate box in the vmx config. Then power it back on. Wait for the guest to boot, check the tools make sure they show the most recent version and that the tools are running inside the guest. Then try manual migration again.
If that doesn't work, then power off the VM again (shut down guest) and remove from inventory. Register the VM again (on a completely different ESX server in the cluster) and try the steps again.
I seem to remember it was something completely odd like the VC DB had some registration code for the VM incorrect, and thus the VC can't 'select' the correct VM to vmotion. It hasn't happened for a while since I completely rebuilt the VC DB, which leads me to believe it's a VC DB issue, and not a VM problem on ESX. But the above steps should fix it if I remember correctly.
If they are the same CPUs just double check the BIOS has the same settings such as intel 64 bit VT etc bit turned on.
Other things to check..
CD/Floppy still attached to VM - If so remove it
The virtual machine vNIC is attached to an internal only network or
a vSwitch not available on another ESX server - Check spellings of
vSwitches and which network type the VM isconnected to.
VMtools currently installing in the Virtual Machine - Wait for install to complete or cancel the installation.
The virtual machine is stored on a datastore local to that ESX host
- The VM needs to be on a centralized datastore (SAN etc) that is
availble to another ESX host.
If you found this information useful please award points using the buttons at the top of the page accordingly.