How long should a VMotion migration take?

kechambe · ‎10-07-2007

How much outage should one expect to see when 'high priority' migrating a VM?

I'm using an extended ping to measure and seeing about 15 seconds of outage:

64 bytes from 10.0.1.201: icmp_seq=61 ttl=127 time=0.399 ms 64 bytes from 10.0.1.201: icmp_seq=62 ttl=127 time=0.401 ms 64 bytes from 10.0.1.201: icmp_seq=64 ttl=127 time=3.129 ms 64 bytes from 10.0.1.201: icmp_seq=65 ttl=127 time=2.330 ms 64 bytes from 10.0.1.201: icmp_seq=66 ttl=127 time=0.382 ms 64 bytes from 10.0.1.201: icmp_seq=67 ttl=127 time=0.357 ms 64 bytes from 10.0.1.201: icmp_seq=68 ttl=127 time=14147.063 ms 64 bytes from 10.0.1.201: icmp_seq=69 ttl=127 time=13147.018 ms 64 bytes from 10.0.1.201: icmp_seq=70 ttl=127 time=12146.964 ms 64 bytes from 10.0.1.201: icmp_seq=71 ttl=127 time=11146.954 ms 64 bytes from 10.0.1.201: icmp_seq=72 ttl=127 time=10146.969 ms 64 bytes from 10.0.1.201: icmp_seq=73 ttl=127 time=9146.965 ms 64 bytes from 10.0.1.201: icmp_seq=74 ttl=127 time=8146.919 ms 64 bytes from 10.0.1.201: icmp_seq=75 ttl=127 time=7146.866 ms 64 bytes from 10.0.1.201: icmp_seq=76 ttl=127 time=6146.837 ms 64 bytes from 10.0.1.201: icmp_seq=77 ttl=127 time=5146.778 ms 64 bytes from 10.0.1.201: icmp_seq=78 ttl=127 time=4146.716 ms 64 bytes from 10.0.1.201: icmp_seq=79 ttl=127 time=3146.664 ms 64 bytes from 10.0.1.201: icmp_seq=80 ttl=127 time=2146.632 ms 64 bytes from 10.0.1.201: icmp_seq=81 ttl=127 time=1146.588 ms 64 bytes from 10.0.1.201: icmp_seq=82 ttl=127 time=146.615 ms 64 bytes from 10.0.1.201: icmp_seq=83 ttl=127 time=0.358 ms 64 bytes from 10.0.1.201: icmp_seq=84 ttl=127 time=0.274 ms 64 bytes from 10.0.1.201: icmp_seq=85 ttl=127 time=0.424 ms

What do other people see?

My environment is as follows:

Hardware:

4 HP DL380G5 (2 2.33GHz Woodcrest CPU + 8GB RAM + QLogic 2342 HBA)
Cisco 6506 with GigE cards. (Running 2 port Etherchannel Trunk to ESX hosts, but several other configurations returned identical results.)
EMC Clarrion Array with 146GB Drives
Cisco 9509 FC Switch

Software:

ESX 3.0.2, VC 2.0.2 (NTP is setup and running well, as is name resolution.)
Windows Server 2003 guest

Switch Config:

interface Port-channel2

description VMWARE-01

switchport

switchport trunk encapsulation dot1q

switchport mode trunk

switchport nonegotiate

mtu 9216

no ip address

spanning-tree portfast trunk

!

interface GigabitEthernet1/1

description VMWARE-01 ETH1

switchport

switchport trunk encapsulation dot1q

switchport mode trunk

switchport nonegotiate

mtu 9216

no ip address

spanning-tree portfast trunk

channel-group 2 mode on

!

interface GigabitEthernet1/2

description VMWARE-01 ETH2

switchport

switchport trunk encapsulation dot1q

switchport mode trunk

switchport nonegotiate

mtu 9216

no ip address

spanning-tree portfast trunk

channel-group 2 mode on

!

Rumple · ‎10-07-2007

You should see no more then 1-2 pings drop on average.

I'm no cisco guy, but looking at your config and ours (we also trunk) here are the differences:

you specified an MTU - we do not do that

you have switchport no-negotiate. = we do not do this either

One thing I don't see there is a native vlan thats different then the vlan's you are allowing across the trunks.

kimono · ‎10-08-2007

--oops , replied to the wrong email

Message was edited by: kimono

--oops , replied to the wrong email

/kimono/

kechambe · ‎10-08-2007

I'm not seeing pings drop pre-se. It's more like they get queued up for ~15 seconds and then go out in one big burst. This would probably be fine for most apps however I'm testing a voice mail application that streams real-time audio in the form of UDP (RTP) packets. If the packet doesn't arrive in time, it may as well not arrive at all.

I'm still not sure that what I'm seeing is excepted behavior with VMware. ?:|

I have also the same setup with XenSource Enterprise 4. It handles a live migration with virtually no interruption. We're talking less then a second.

I do know a fair bit about switches so I'll explain those settings:

MTU = 9216 <--- This allows for frames larger then 1500 bytes (9216 bytes in this case) to traverse this interface. Frames larger then 1500 bytes are often referred to as 'Jumbo'. The advantage of Jumbo frames is that there is less protocol overhead. With Jumbos, it's quicker to move large sums of data in fewer larger frames. VMware doesn't support Jumbo frames today, but having it turned on won't impair things. It's just like having extra lanes of highway on a quiet Sunday morning drive.

switchport nonegotiate <--- This turns off Dynamic Trunking Protocol (DTP). DTP manages trunk negotiation. DTP should be turned off when one side of a trunk (in this case the VMware side) does not support DTP.

A native VLAN allows a device that does not send 802.1q tagged frames to operate on the port. Its frames are placed on the native VLAN buy the switch. In my case, I have told VMware which VLANs to use specifically in the config so all frames should have a tag.

Thanks again for your reply. Any other info you have to over will be much appreciated!

ian4563 · ‎10-09-2007

I can't advise how to fix your problem but I can tell you this isn't normal behavior. Here's a sample of my vmotion between two Dell 6850s. We have a dedicated NIC and switch for vmotion. As you can see I had one ping reach 23ms, hardly a disruption.

64 bytes from 138.254.134.36: icmp_seq=54 ttl=128 time=0.967 ms

64 bytes from 138.254.134.36: icmp_seq=55 ttl=128 time=0.849 ms

64 bytes from 138.254.134.36: icmp_seq=56 ttl=128 time=0.565 ms

64 bytes from 138.254.134.36: icmp_seq=57 ttl=128 time=0.718 ms

64 bytes from 138.254.134.36: icmp_seq=58 ttl=128 time=0.493 ms

64 bytes from 138.254.134.36: icmp_seq=59 ttl=128 time=0.876 ms

64 bytes from 138.254.134.36: icmp_seq=60 ttl=128 time=0.756 ms

64 bytes from 138.254.134.36: icmp_seq=61 ttl=128 time=0.645 ms

64 bytes from 138.254.134.36: icmp_seq=64 ttl=128 time=23.4 ms

64 bytes from 138.254.134.36: icmp_seq=65 ttl=128 time=0.922 ms

64 bytes from 138.254.134.36: icmp_seq=66 ttl=128 time=0.432 ms

64 bytes from 138.254.134.36: icmp_seq=67 ttl=128 time=0.312 ms

64 bytes from 138.254.134.36: icmp_seq=68 ttl=128 time=0.441 ms

64 bytes from 138.254.134.36: icmp_seq=69 ttl=128 time=0.447 ms

64 bytes from 138.254.134.36: icmp_seq=70 ttl=128 time=0.818 ms

64 bytes from 138.254.134.36: icmp_seq=71 ttl=128 time=0.464 ms

64 bytes from 138.254.134.36: icmp_seq=72 ttl=128 time=0.335 ms

64 bytes from 138.254.134.36: icmp_seq=73 ttl=128 time=0.842 ms

64 bytes from 138.254.134.36: icmp_seq=74 ttl=128 time=0.482 ms

64 bytes from 138.254.134.36: icmp_seq=75 ttl=128 time=0.366 ms

64 bytes from 138.254.134.36: icmp_seq=76 ttl=128 time=1.74 ms

kechambe · ‎10-09-2007

OK that's much closer to what I'm seeing with Xen.

When you say you have a dedicated NIC and switch -- is that a virtual switch or hardware switch?

Thank you!

Keith

ian4563 · ‎10-09-2007

We have a dedicated 1G physical NIC (with it's own vswitch) on each server just for vmotion. These all attach to an HP2824 switch with a non-routable private VLAN, this switch is dedicated to vmotion. Can you post a screenshot of the network configuration page from VC?

kechambe · ‎10-10-2007

I spent some time on this and I now believe the VI3 client is root cause.

In my environment the console access can take a while to come up. Last night I noticed that while waiting it hangs the machine for a moment -- VMotion or not.

*If I open VI3 and never look at a VM console I can VMotion with no problem. But if I look at the VM console and then try to VMotion, I'm screwed for the life of the VI3 client. (Unless I have the below workaround in place.)

*Even if you browse away from the console, VI still tries to access it during a VMotion. Once at 0%, twice at 94%, and once at 100%. It's the last one that hangs (because it's a fresh connection to a new ESX server perhaps??) for the 15 seconds. That explains why I saw when I would VMotion.

To workaround the issue I set the maximum console connections to 0. (That's how I can tell when it tries to connect because I get an error stating 'Console limit reached'.) But this is an annoying workaround -- so now I need to determine why the VI3 client having problems.

Shouldn't this stuff be a little easier? Yes, yes it should.

kechambe · ‎10-11-2007

I was able to resolve this issue. The root cause was related to the use of certificates in VI server. To resolve it I had to remove "Update Root Certificates" from "Add/Remove Windows Components" on the VI server.

A sniffer trace tipped me off and then I found this online:

http://communities.vmware.com/message/676215

I feel this issue is pretty serious as it makes the VM completely unavailable for 15 seconds, Windows 2003 has this service installed by default, and it's so difficult to troubleshoot. I imagine it affects a lot of installs. If you are reading this post I'd double check by connectin/disconnecting from the console of a specific VM a few times while running an extended ping.

Thanks for all the input on the issue. It's working like a champ now! :smileygrin:

All

How long should a VMotion migration take?