14 Replies Latest reply on Mar 25, 2008 2:57 PM by vvarnell

    VM dissapears for 30 sec during vmotion

    mvaughn25 Enthusiast
    vExpert

      I recently rebuilt my lab servers to with ESX 3.5. Since then, I have noticed a weird "hiccup" during the VMotion process. VMotion works and does not record any errors, but the VM will stop responding to pings for about 20-30 seconds in the middle of the process. Has anyone else seen this? Have I missed something in rebuilding my ESX servers, maybe a network setting? Prior to this, I never experienced more than one dropped ping, and it always recovered immediately for no noticable downtime.

       

        • 1. Re: VM dissapears for 30 sec during vmotion
          uslacker99 Expert

          Can you do a ping from the VM and see if you lose any connectivity? How many NICs do you have in the VSwitch that the VM is on?

          • 2. Re: VM dissapears for 30 sec during vmotion
            mvaughn25 Enthusiast
            vExpert

            Can you do a ping from the VM and see if you lose any connectivity?

            -


            It shows that it loses one packet...no big deal.

             

             

             

             

             

             

             

             

             

             

            How many NICs do you have in the VSwitch that the VM is on?

            -


            One GB nic.

             

            -


            I have also noticed VERY long pause at 10%, and the dropped packets are during a pause at 90%. All IP's can be ping'ed from my workstation (VMConsole and VMKernel). From the ESX host, everything is visible from both ping and vmkping. DNS also works, so I am not sure of where to look next. With the major changes to VMotion, is this part of the new behavior? It is still not an outage, but I really do not like the 30 second "hiccup".

            • 3. Re: VM dissapears for 30 sec during vmotion
              uslacker99 Expert

              Hard to say what the problem is. Any errors/warnings in /var/log/vmkernel?

              • 4. Re: VM dissapears for 30 sec during vmotion
                mvaughn25 Enthusiast
                vExpert

                 

                Sorry for the delay in responding, let's pick up where we left off.

                 

                 

                In this test, I was migrating the VM from ESX01 to ESX02 (named for reference). Here are the /var/log/vmkernel.log entries from the time in question on both servers, with comments. You will notice that a lot was going on on the detination host (ESX02) during the 31 seconds that the VM was unavailable on the network.

                 

                -


                ########################

                ##VMotion Test

                ##Source host (ESX01)

                ########################

                Dec 17 20:32:05 esx1 vmkernel: 6:05:24:27.172 cpu1:1206)FS3: 1974: Checking if lock holders are live for lock

                 

                 

                 

                [type 10c00001 offset 55209984 v 812, hb offset 3637248

                Dec 17 20:32:05 esx1 vmkernel: gen 64, mode 1, owner 475ecd09-8d792c58-33c0-0019bbcf7ed0 mtime 537115]

                Dec 17 20:32:10 esx1 vmkernel: 6:05:24:31.670 cpu1:1206)FS3: 1974: Checking if lock holders are live for lock

                 

                 

                [type 10c00001 offset 55209984 v 814, hb offset 3637248

                Dec 17 20:32:10 esx1 vmkernel: gen 64, mode 1, owner 475ecd09-8d792c58-33c0-0019bbcf7ed0 mtime 537120]

                 

                 

                #Progress paused at 10 percent

                 

                 

                Dec 17 20:32:30 esx1 vmkernel: 6:05:24:52.028 cpu1:1206)Migrate: vm 1207: 7332: Setting migration info ts =

                 

                 

                1197952297078125, src ip = <172.22.34.13> dest ip = <172.22.34.14> Dest wid = 1154 using SHARED swap

                Dec 17 20:32:30 esx1 vmkernel: 6:05:24:52.028 cpu1:1206)World: vm 1214: 895: Starting world migSendHelper-

                 

                 

                1207 with flags 1

                Dec 17 20:32:30 esx1 vmkernel: 6:05:24:52.028 cpu1:1206)World: vm 1215: 895: Starting world migRecvHelper-

                 

                 

                1207 with flags 1

                Dec 17 20:32:30 esx1 vmkernel: 6:05:24:52.029 cpu3:1107)MigrateNet: vm 1107: 854: Accepted connection from

                 

                 

                <172.22.34.14>

                 

                 

                 

                Dec 17 20:32:34 esx1 vmkernel: 6:05:24:56.238 cpu1:1207)Migrate: 7245: 1197952297078125: Stopping pre-copy:

                 

                 

                only 523 pages were modified

                 

                 

                ##Progress at 61 percent

                ##VM not available on network

                 

                 

                ##A LOT of traffic on ESX02 log during this timeframe

                 

                 

                ##Progress at 90 percent

                ##VM available on network (down for 31 seconds)

                 

                 

                 

                Dec 17 20:33:05 esx1 vmkernel: 6:05:25:26.626 cpu3:1063)<4>lpfc0:0712:FPe:SCSI layer issued abort device

                 

                 

                Data: x1 x1

                Dec 17 20:33:06 esx1 vmkernel: 6:05:25:27.887 cpu3:1063)<4>lpfc0:0749:FPe:Completed Abort Task Set Data: x1

                 

                 

                x1 x128

                Dec 17 20:33:11 esx1 vmkernel: 6:05:25:32.506 cpu2:1033)FS3: 5031: Waiting for timed-out heartbeat [HB state

                 

                 

                abcdef02 offset 4143104 gen 69 stamp 537928536083 uuid 475ec8fe-a382103d-2ff7-0019bbccf378 jrnl 

                 

                 

                drv 4.31]

                Dec 17 20:33:12 esx1 vmkernel: 6:05:25:33.576 cpu0:1206)FS3: 5031: Waiting for timed-out heartbeat [HB state

                 

                 

                abcdef02 offset 4143104 gen 69 stamp 537933532738 uuid 475ec8fe-a382103d-2ff7-0019bbccf378 jrnl 

                 

                 

                drv 4.31]

                Dec 17 20:33:12 esx1 vmkernel: 6:05:25:33.588 cpu3:1039)FS3: 4828: Reclaimed timed out heartbeat [HB state

                 

                 

                abcdef02 offset 4143104 gen 69 stamp 537910248336 uuid 475ec8fe-a382103d-2ff7-0019bbccf378 jrnl 

                 

                 

                drv 4.31]

                 

                -


                ########################

                ##VMotion Test

                ##Destination host (ESX02)

                ########################

                ##Note: ESX02 7 clock seconds behind ESX01

                ########################

                Dec 17 20:31:57 esx2 vmkernel: 6:05:11:49.636 cpu2:1035)World: vm 1153: 895: Starting world vmware-vmx with

                 

                 

                flags 4

                 

                 

                ##Progress paused at 10 percent

                 

                 

                Dec 17 20:32:24 esx2 vmkernel: 6:05:12:16.516 cpu2:1153)World: vm 1154: 895: Starting world vmm0:Win2k3-01

                 

                 

                with flags 8

                Dec 17 20:32:24 esx2 vmkernel: 6:05:12:16.516 cpu2:1153)Sched: vm 1154: 5333: adding 'vmm0:Win2k3-01': group

                 

                 

                'host/user/pool1': cpu: shares=-3 min=0 minLimit=-1 max=-1

                Dec 17 20:32:24 esx2 vmkernel: 6:05:12:16.516 cpu2:1153)Sched: vm 1154: 5352: renamed group 46 to vm.1153

                Dec 17 20:32:24 esx2 vmkernel: 6:05:12:16.516 cpu2:1153)Sched: vm 1154: 5366: moved group 46 to be under

                 

                 

                group 19

                 

                 

                ##Progress at 61 percent

                ##VM not available on network

                 

                 

                Dec 17 20:32:29 esx2 vmkernel: 6:05:12:21.615 cpu2:1153)Swap: vm 1154: 2169: extending swap to 1048576 KB

                Dec 17 20:32:30 esx2 vmkernel: 6:05:12:22.247 cpu1:1153)Migrate: vm 1154: 7332: Setting migration info ts =

                 

                 

                1197952297078125, src ip = <172.22.34.13> dest ip = <0.0.0.0> Dest wid = -1 using SHARED swap

                Dec 17 20:32:30 esx2 vmkernel: 6:05:12:22.247 cpu1:1153)World: vm 1155: 895: Starting world migSendHelper-

                 

                 

                1154 with flags 1

                Dec 17 20:32:30 esx2 vmkernel: 6:05:12:22.247 cpu1:1153)World: vm 1156: 895: Starting world migRecvHelper-

                 

                 

                1154 with flags 1

                Dec 17 20:32:30 esx2 vmkernel: 6:05:12:22.619 cpu2:1097)MigrateNet: vm 1097: 854: Accepted connection from

                 

                 

                <172.22.34.13>

                Dec 17 20:32:43 esx2 vmkernel: 6:05:12:35.203 cpu1:1116)<4>lpfc0:0712:FPe:SCSI layer issued abort device

                 

                 

                Data: x1 x1

                Dec 17 20:32:44 esx2 vmkernel: 6:05:12:36.464 cpu3:1116)<4>lpfc0:0749:FPe:Completed Abort Task Set Data: x1

                 

                 

                x1 x128

                Dec 17 20:32:57 esx2 vmkernel: 6:05:12:49.314 cpu3:1155)Alloc: vm 1154: 15574: Regular swap file bitmap

                 

                 

                checks out.

                Dec 17 20:32:57 esx2 vmkernel: 6:05:12:49.342 cpu3:1156)Migrate: 3935: 1197952297078125: DONE paging in,

                 

                 

                status 0

                 

                 

                ##Progress at 90 percent

                ##VM available on network (down for 31 seconds)

                 

                 

                Dec 17 20:33:03 esx2 vmkernel: 6:05:12:55.051 cpu3:1153)VSCSI: 4059: Creating Virtual Device for world 1154

                 

                 

                vscsi0:0 (handle 8204)

                Dec 17 20:33:03 esx2 vmkernel: 6:05:12:55.291 cpu3:1153)Migrate: 6394: 1197952297078125: No page at vpn 0x514

                 

                 

                offset: 5324800 size: 16384 bytesLeft: 2040

                Dec 17 20:33:08 esx2 vmkernel: 6:05:13:00.319 cpu3:1153)World: vm 1157: 895: Starting world vmware-vmx with

                 

                 

                flags 44

                Dec 17 20:33:08 esx2 vmkernel: 6:05:13:00.319 cpu0:1157)World: vm 1158: 895: Starting world vmware-vmx with

                 

                 

                flags 44

                Dec 17 20:33:08 esx2 vmkernel: 6:05:13:00.320 cpu0:1157)World: vm 1159: 895: Starting world vmware-vmx with

                 

                 

                flags 44

                Dec 17 20:33:08 esx2 vmkernel: 6:05:13:00.320 cpu2:1154)Init: 1054: Received INIT from world 1154

                Dec 17 20:33:08 esx2 vmkernel: 6:05:13:00.372 cpu1:1154)World: vm 1160: 895: Starting world prefault-mig-1154

                 

                 

                with flags 1

                Dec 17 20:33:08 esx2 vmkernel: 6:05:13:00.372 cpu0:1160)Swap: vm 1154: 7929: Starting opportunistic swapper

                 

                 

                for the migration swap file

                Dec 17 20:33:08 esx2 vmkernel: 6:05:13:00.376 cpu1:1154)Migrate: 7550: 1197952297078125: Waiting to receive

                 

                 

                all changed pages before resuming...

                Dec 17 20:33:08 esx2 vmkernel: 6:05:13:00.376 cpu1:1154)Migrate: 7571: 1197952297078125: Received all changed

                 

                 

                pages.

                Dec 17 20:33:08 esx2 vmkernel: 6:05:13:00.376 cpu1:1154)Migrate: 3828: 1197952297078125: Resume handshake

                 

                 

                successful

                Dec 17 20:33:08 esx2 vmkernel: 6:05:13:00.382 cpu1:1154)Net: 4203: unicastAddr 00:50:56:b2:57:7a;

                Dec 17 20:33:09 esx2 vmkernel: 6:05:13:01.195 cpu0:1160)Swap: vm 1154: 8133: Finish swapping in migration

                 

                 

                swap file. (faulted in 0 pages). Success.

                 

                 

                 

                 

                 

                 

                 

                 

                 

                 

                 

                 

                 

                 

                 

                • 5. Re: VM dissapears for 30 sec during vmotion
                  Rumple Master

                   

                  The step that typically causes drops is the MAC address being swapped from esx server to esx server.  I believe this causes a few seconds where the same mac address shows up on 2 different ports.  If you do not have portfast enabled on the gear (if its cisco), that will cause that 30 second delay.  Other managed switches also have settings that allow that MAc to swap places really quickly as well but I can't provide much more information then that.

                   

                   

                  If you can ping out from the vm and drop a packet but from outside into the VM tells me that it could also be a reverse arp issue.  The workstation is still sending the information to the wrong port because the switch hasn't realized the mac is now on a different port until it does its refresh of arp tables...

                   

                   

                   

                   

                   

                   

                   

                   

                  • 6. Re: VM dissapears for 30 sec during vmotion
                    depping Champion
                    User ModeratorsVMware Employees

                    Rumple is right, enable portfast disable spanning tree on these ports. Will probably solve this problem, there are numerous pdf's about this subject.

                     

                     

                     

                     

                    Duncan

                    http://www.ictivity.nl

                    • 7. Re: VM dissapears for 30 sec during vmotion
                      mvaughn25 Enthusiast
                      vExpert

                       

                      Actually, I also notice issues trying to ping out from the VM during the vmotion. We will look at the networking adjustments that you mentioned and update you on our results.

                       

                       

                      • 8. Re: VM dissapears for 30 sec during vmotion
                        TomHowarth Guru
                        vExpertUser Moderators

                        Please remember to award points for those helpful or correct answers.

                         

                         

                         

                         

                        Tom Howarth

                        VMware Communities User Moderator

                        • 9. Re: VM dissapears for 30 sec during vmotion
                          mvaughn25 Enthusiast
                          vExpert

                           

                          I am out of th eoffice, but asked a colleague to try the portfast and spanning tree recommendations from above. Here are the comments that I received back. To clarify, both ESX servers are on the same physical cisco switch and this environment did not show any issues until we rebuilt it last with ESX 3.5 almost 2 weeks ago. All switch port settings were untouched before and after that change.

                           

                           

                          -


                           

                           

                          Alright... here's what I've come up with so far:

                           

                           

                          Spanning tree portfast makes no difference whether its on or off (which I thought because that only gets involved if a physical switch port goes down and comes back up.  Outside of that spanning tree takes roughly 50 seconds anyways, not 30, in a normal configuration)

                           

                           

                          Speed and duplex have been set on both the ESX servers and the switch with counters cleared, no difference.  I move the test VM out of the resource pool... again no difference.

                           

                           

                          I have noticed that at roughly ~2-3 minutes after a migration the latency spikes and comes back down (from 2ms to has high as 1400ms), not sure if its related though.

                           

                           

                           

                           

                           

                           

                           

                           

                          • 10. Re: VM dissapears for 30 sec during vmotion
                            jccoca Hot Shot

                            Disable port-security in the cisco switch for the VM ports.

                            • 11. Re: VM dissapears for 30 sec during vmotion
                              BORGcube Novice

                              If the VM dissapears (sic) for 30 seconds it sounds very much like Spanning tree.  Make sure you have spanning tree port fast enabled on all your network links attached to your  ESX Server.

                              • 12. Re: VM dissapears for 30 sec during vmotion
                                vvarnell Enthusiast

                                 

                                I'm seeing this (or something similar) as well, but it is between 16 and 20 seconds (4-5 dropped pings).

                                 

                                 

                                My cluster is 16 nodes, eleven ESX 3.0.2 and five ESX 3.5. I noticed this in the process of upgrading to ESX 3.5.

                                 

                                 

                                Vmotion behavior:

                                 

                                • 3.5 to 3.5 -> 4-5 dropped pings

                                • 3.0.2 to 3.5 -> 4-6 dropped pings

                                • 3.0.2 to 3.0.2 -> 1 dropped ping (this has been the case since 2.5 and what I consider "normal")

                                • 3.5 to 3.0.2  -> 2-3 dropped pings

                                 

                                Also, the startup of the vMotion seems to take several seconds longer than previously.

                                 

                                 

                                Only changes are the upgrades to ESX 3.5 (done with CD-based upgrade, ZIP-based upgrade and new methods). No SAN or network changes and the timing is wrong for spanning tree/portfast.

                                 

                                 

                                Thoughts? Ideas?

                                 

                                 

                                VwV

                                • 13. Re: VM dissapears for 30 sec during vmotion
                                  mvaughn25 Enthusiast
                                  vExpert

                                   

                                  Unfortunately, though all of the feedback was well educated and are items that should be followed if ever troubleshooting an issue with these symptoms, none of the solutions here resolved the problem. Without changing the network settings, we completely rebuilt the lab environment (Win2k3/VC Server and 2 ESX hosts) and all of the problems went away. Vmotion now goes through with no problems. Not sure where the problem was, but it is gone now. I really hate problems that go away without a solid answer, because you don't know how to keep them from coming back. We even rebuilt on the same hardware, so it really had to be a configuration issue, though we poured over and over the configs and compared them to our working environments with no luck.

                                   

                                   

                                  If I ever see this issue again, I will re-open this thread or link it to a new one. Thank you to everyone for their valued feedback.

                                   

                                   

                                  • 14. Re: VM dissapears for 30 sec during vmotion
                                    mvaughn25 Enthusiast
                                    vExpert

                                    warnell,

                                     

                                     

                                    Sorry to see you are having a similar issue. Rebuilding my ESX 3.5 servers resolved my problems. I  also did a clean install of VC, though I doubt that hd anything to do with it. I am back to one dropped packet during a VMotion, which is expected as ARP changes are broadcast. Your timing table definitely shows the ESX 3.5 servers as the suspects. Sorry I can't be more helpful.