amurph
Contributor
Contributor

some vm's will not vmotion, others work just fine

I've got a weird problem. Some of my vm's will not vmotion, other's will vmotion just fine. When a VM fails to vmotion, it is basically getting to 10% complete, pausing there, and then timing out. It never completes.

All vm's are Win2k3, and almost all are deployed from the same template and so have the same size disks, same RAM and number of CPU's, etc. The only difference between some of the ones that will vmotion and the one's that won't is the name.

IBM HS21XM blades, ESX 3.5, VC 2.5U1.

I contacted VMware support about it, they looked at logs and told me the problem was probably CPU masks or swapping on the hosts. I cleared them on affected VM's, bumped up SC memory on the hosts, problem still occurs.

Any ideas?

0 Kudos
24 Replies
amurph
Contributor
Contributor

I went over the basics the very first time it happened, none of those things are involved here but thanks for the input.

0 Kudos
sfont3n
Enthusiast
Enthusiast

can you vmkping all the host from each other?

also look at the FT_HOSTS file to make sure everything is ok.

0 Kudos
frangonzalez
Contributor
Contributor

amurph,

Did you ever find a solution to your problem. I'm seeing similar behavior where vmotion pauses at 10% for a few minutes, quickly moves to over 90%, pauses for a couple more minutes and then completes.

Thanks.

0 Kudos
amurph
Contributor
Contributor

It turned out to be storage related. We have multiple FC storage arrays and I noticed that the problem only occurred on one and not the other. I storage vmotioned a VM that was failing vmotions over to the good one, and it worked great. We tested a few more, and the issue would go away 100% of the time. So we made a plan to migrate all running VM's over to the other storage array over the course of a couple of weeks. As we were doing this, vmotion performance on the problem array improved. I think it is safe to say that the storage array we were using was simply being overworked or had some other issue that was manifesting itself as load increased. We've had trouble with it before, so I don't know why I was surprised but you'd think that an enterprise class FC storage array that costs close to $1m wouldn't give you a problem likes this. We weren't even oversubscribing our LUNS, only a few VM's per 500GB LUN!

VMware support passed us over to their storage team to try and get to the bottom of exactly what was going on but unfortunately we had completed the migration by then and were unable to generate any failures for them to examine in the log files.

0 Kudos
frangonzalez
Contributor
Contributor

Thanks for the update amurph. I opened a case with VMware, and my problem turned out to be related to HA. HA skips vMotion networks by default, so it was using my storage network (We use NFS datastores). The problem is described here: http://www.vmguru.nl/wordpress/2008/12/ha-problem-checklist/ Unfortunately, there's a typo on that page: das.allowNetwork[1] should instead read das.allowNetwork1.

So, long story short, this caused HA issues that dramatically slowed down vMotion. The support engineer's theory was that the HA issues were keeping vCenter very busy, leaving little resources to manage the vMotion process.

As a test, we disabled HA in the cluster, and vMotion got a turbo boost. The VMware support engineer helped me correct the typo, we reenabled HA, and all is now well.

0 Kudos