VMware Cloud Community
mvaughn25
Enthusiast
Enthusiast

VM dissapears for 30 sec during vmotion

I recently rebuilt my lab servers to with ESX 3.5. Since then, I have noticed a weird "hiccup" during the VMotion process. VMotion works and does not record any errors, but the VM will stop responding to pings for about 20-30 seconds in the middle of the process. Has anyone else seen this? Have I missed something in rebuilding my ESX servers, maybe a network setting? Prior to this, I never experienced more than one dropped ping, and it always recovered immediately for no noticable downtime.

--------------------- Mark Vaughn VCP3, VCP4, vExpert 2009/2010 http://blog.mvaughn.us
Tags (3)
Reply
0 Kudos
14 Replies
uslacker99
Expert
Expert

Can you do a ping from the VM and see if you lose any connectivity? How many NICs do you have in the VSwitch that the VM is on?

Reply
0 Kudos
mvaughn25
Enthusiast
Enthusiast

Can you do a ping from the VM and see if you lose any connectivity?

-


It shows that it loses one packet...no big deal.

How many NICs do you have in the VSwitch that the VM is on?

-


One GB nic.

-


I have also noticed VERY long pause at 10%, and the dropped packets are during a pause at 90%. All IP's can be ping'ed from my workstation (VMConsole and VMKernel). From the ESX host, everything is visible from both ping and vmkping. DNS also works, so I am not sure of where to look next. With the major changes to VMotion, is this part of the new behavior? It is still not an outage, but I really do not like the 30 second "hiccup".

--------------------- Mark Vaughn VCP3, VCP4, vExpert 2009/2010 http://blog.mvaughn.us
Reply
0 Kudos
uslacker99
Expert
Expert

Hard to say what the problem is. Any errors/warnings in /var/log/vmkernel?

Reply
0 Kudos
mvaughn25
Enthusiast
Enthusiast

Sorry for the delay in responding, let's pick up where we left off.

In this test, I was migrating the VM from ESX01 to ESX02 (named for reference). Here are the /var/log/vmkernel.log entries from the time in question on both servers, with comments. You will notice that a lot was going on on the detination host (ESX02) during the 31 seconds that the VM was unavailable on the network.

-


########################

##VMotion Test

##Source host (ESX01)

########################

Dec 17 20:32:05 esx1 vmkernel: 6:05:24:27.172 cpu1:1206)FS3: 1974: Checking if lock holders are live for lock

[type 10c00001 offset 55209984 v 812, hb offset 3637248

Dec 17 20:32:05 esx1 vmkernel: gen 64, mode 1, owner 475ecd09-8d792c58-33c0-0019bbcf7ed0 mtime 537115]

Dec 17 20:32:10 esx1 vmkernel: 6:05:24:31.670 cpu1:1206)FS3: 1974: Checking if lock holders are live for lock

[type 10c00001 offset 55209984 v 814, hb offset 3637248

Dec 17 20:32:10 esx1 vmkernel: gen 64, mode 1, owner 475ecd09-8d792c58-33c0-0019bbcf7ed0 mtime 537120]

#Progress paused at 10 percent

Dec 17 20:32:30 esx1 vmkernel: 6:05:24:52.028 cpu1:1206)Migrate: vm 1207: 7332: Setting migration info ts =

1197952297078125, src ip = <172.22.34.13> dest ip = <172.22.34.14> Dest wid = 1154 using SHARED swap

Dec 17 20:32:30 esx1 vmkernel: 6:05:24:52.028 cpu1:1206)World: vm 1214: 895: Starting world migSendHelper-

1207 with flags 1

Dec 17 20:32:30 esx1 vmkernel: 6:05:24:52.028 cpu1:1206)World: vm 1215: 895: Starting world migRecvHelper-

1207 with flags 1

Dec 17 20:32:30 esx1 vmkernel: 6:05:24:52.029 cpu3:1107)MigrateNet: vm 1107: 854: Accepted connection from

<172.22.34.14>

Dec 17 20:32:34 esx1 vmkernel: 6:05:24:56.238 cpu1:1207)Migrate: 7245: 1197952297078125: Stopping pre-copy:

only 523 pages were modified

##Progress at 61 percent

##VM not available on network

##A LOT of traffic on ESX02 log during this timeframe

##Progress at 90 percent

##VM available on network (down for 31 seconds)

Dec 17 20:33:05 esx1 vmkernel: 6:05:25:26.626 cpu3:1063)<4>lpfc0:0712:FPe:SCSI layer issued abort device

Data: x1 x1

Dec 17 20:33:06 esx1 vmkernel: 6:05:25:27.887 cpu3:1063)<4>lpfc0:0749:FPe:Completed Abort Task Set Data: x1

x1 x128

Dec 17 20:33:11 esx1 vmkernel: 6:05:25:32.506 cpu2:1033)FS3: 5031: Waiting for timed-out heartbeat [HB state

abcdef02 offset 4143104 gen 69 stamp 537928536083 uuid 475ec8fe-a382103d-2ff7-0019bbccf378 jrnl

drv 4.31]

Dec 17 20:33:12 esx1 vmkernel: 6:05:25:33.576 cpu0:1206)FS3: 5031: Waiting for timed-out heartbeat [HB state

abcdef02 offset 4143104 gen 69 stamp 537933532738 uuid 475ec8fe-a382103d-2ff7-0019bbccf378 jrnl

drv 4.31]

Dec 17 20:33:12 esx1 vmkernel: 6:05:25:33.588 cpu3:1039)FS3: 4828: Reclaimed timed out heartbeat [HB state

abcdef02 offset 4143104 gen 69 stamp 537910248336 uuid 475ec8fe-a382103d-2ff7-0019bbccf378 jrnl

drv 4.31]

-


########################

##VMotion Test

##Destination host (ESX02)

########################

##Note: ESX02 7 clock seconds behind ESX01

########################

Dec 17 20:31:57 esx2 vmkernel: 6:05:11:49.636 cpu2:1035)World: vm 1153: 895: Starting world vmware-vmx with

flags 4

##Progress paused at 10 percent

Dec 17 20:32:24 esx2 vmkernel: 6:05:12:16.516 cpu2:1153)World: vm 1154: 895: Starting world vmm0:Win2k3-01

with flags 8

Dec 17 20:32:24 esx2 vmkernel: 6:05:12:16.516 cpu2:1153)Sched: vm 1154: 5333: adding 'vmm0:Win2k3-01': group

'host/user/pool1': cpu: shares=-3 min=0 minLimit=-1 max=-1

Dec 17 20:32:24 esx2 vmkernel: 6:05:12:16.516 cpu2:1153)Sched: vm 1154: 5352: renamed group 46 to vm.1153

Dec 17 20:32:24 esx2 vmkernel: 6:05:12:16.516 cpu2:1153)Sched: vm 1154: 5366: moved group 46 to be under

group 19

##Progress at 61 percent

##VM not available on network

Dec 17 20:32:29 esx2 vmkernel: 6:05:12:21.615 cpu2:1153)Swap: vm 1154: 2169: extending swap to 1048576 KB

Dec 17 20:32:30 esx2 vmkernel: 6:05:12:22.247 cpu1:1153)Migrate: vm 1154: 7332: Setting migration info ts =

1197952297078125, src ip = <172.22.34.13> dest ip = <0.0.0.0> Dest wid = -1 using SHARED swap

Dec 17 20:32:30 esx2 vmkernel: 6:05:12:22.247 cpu1:1153)World: vm 1155: 895: Starting world migSendHelper-

1154 with flags 1

Dec 17 20:32:30 esx2 vmkernel: 6:05:12:22.247 cpu1:1153)World: vm 1156: 895: Starting world migRecvHelper-

1154 with flags 1

Dec 17 20:32:30 esx2 vmkernel: 6:05:12:22.619 cpu2:1097)MigrateNet: vm 1097: 854: Accepted connection from

<172.22.34.13>

Dec 17 20:32:43 esx2 vmkernel: 6:05:12:35.203 cpu1:1116)<4>lpfc0:0712:FPe:SCSI layer issued abort device

Data: x1 x1

Dec 17 20:32:44 esx2 vmkernel: 6:05:12:36.464 cpu3:1116)<4>lpfc0:0749:FPe:Completed Abort Task Set Data: x1

x1 x128

Dec 17 20:32:57 esx2 vmkernel: 6:05:12:49.314 cpu3:1155)Alloc: vm 1154: 15574: Regular swap file bitmap

checks out.

Dec 17 20:32:57 esx2 vmkernel: 6:05:12:49.342 cpu3:1156)Migrate: 3935: 1197952297078125: DONE paging in,

status 0

##Progress at 90 percent

##VM available on network (down for 31 seconds)

Dec 17 20:33:03 esx2 vmkernel: 6:05:12:55.051 cpu3:1153)VSCSI: 4059: Creating Virtual Device for world 1154

vscsi0:0 (handle 8204)

Dec 17 20:33:03 esx2 vmkernel: 6:05:12:55.291 cpu3:1153)Migrate: 6394: 1197952297078125: No page at vpn 0x514

offset: 5324800 size: 16384 bytesLeft: 2040

Dec 17 20:33:08 esx2 vmkernel: 6:05:13:00.319 cpu3:1153)World: vm 1157: 895: Starting world vmware-vmx with

flags 44

Dec 17 20:33:08 esx2 vmkernel: 6:05:13:00.319 cpu0:1157)World: vm 1158: 895: Starting world vmware-vmx with

flags 44

Dec 17 20:33:08 esx2 vmkernel: 6:05:13:00.320 cpu0:1157)World: vm 1159: 895: Starting world vmware-vmx with

flags 44

Dec 17 20:33:08 esx2 vmkernel: 6:05:13:00.320 cpu2:1154)Init: 1054: Received INIT from world 1154

Dec 17 20:33:08 esx2 vmkernel: 6:05:13:00.372 cpu1:1154)World: vm 1160: 895: Starting world prefault-mig-1154

with flags 1

Dec 17 20:33:08 esx2 vmkernel: 6:05:13:00.372 cpu0:1160)Swap: vm 1154: 7929: Starting opportunistic swapper

for the migration swap file

Dec 17 20:33:08 esx2 vmkernel: 6:05:13:00.376 cpu1:1154)Migrate: 7550: 1197952297078125: Waiting to receive

all changed pages before resuming...

Dec 17 20:33:08 esx2 vmkernel: 6:05:13:00.376 cpu1:1154)Migrate: 7571: 1197952297078125: Received all changed

pages.

Dec 17 20:33:08 esx2 vmkernel: 6:05:13:00.376 cpu1:1154)Migrate: 3828: 1197952297078125: Resume handshake

successful

Dec 17 20:33:08 esx2 vmkernel: 6:05:13:00.382 cpu1:1154)Net: 4203: unicastAddr 00:50:56:b2:57:7a;

Dec 17 20:33:09 esx2 vmkernel: 6:05:13:01.195 cpu0:1160)Swap: vm 1154: 8133: Finish swapping in migration

swap file. (faulted in 0 pages). Success.

--------------------- Mark Vaughn VCP3, VCP4, vExpert 2009/2010 http://blog.mvaughn.us
Reply
0 Kudos
Rumple
Virtuoso
Virtuoso

The step that typically causes drops is the MAC address being swapped from esx server to esx server. I believe this causes a few seconds where the same mac address shows up on 2 different ports. If you do not have portfast enabled on the gear (if its cisco), that will cause that 30 second delay. Other managed switches also have settings that allow that MAc to swap places really quickly as well but I can't provide much more information then that.

If you can ping out from the vm and drop a packet but from outside into the VM tells me that it could also be a reverse arp issue. The workstation is still sending the information to the wrong port because the switch hasn't realized the mac is now on a different port until it does its refresh of arp tables...

Reply
0 Kudos
depping
Leadership
Leadership

Rumple is right, enable portfast disable spanning tree on these ports. Will probably solve this problem, there are numerous pdf's about this subject.

Duncan

Reply
0 Kudos
mvaughn25
Enthusiast
Enthusiast

Actually, I also notice issues trying to ping out from the VM during the vmotion. We will look at the networking adjustments that you mentioned and update you on our results.

--------------------- Mark Vaughn VCP3, VCP4, vExpert 2009/2010 http://blog.mvaughn.us
Reply
0 Kudos
TomHowarth
Leadership
Leadership

Please remember to award points for those helpful or correct answers.

Tom Howarth

VMware Communities User Moderator

Tom Howarth VCP / VCAP / vExpert
VMware Communities User Moderator
Blog: http://www.planetvm.net
Contributing author on VMware vSphere and Virtual Infrastructure Security: Securing ESX and the Virtual Environment
Contributing author on VCP VMware Certified Professional on VSphere 4 Study Guide: Exam VCP-410
Reply
0 Kudos
mvaughn25
Enthusiast
Enthusiast

I am out of th eoffice, but asked a colleague to try the portfast and spanning tree recommendations from above. Here are the comments that I received back. To clarify, both ESX servers are on the same physical cisco switch and this environment did not show any issues until we rebuilt it last with ESX 3.5 almost 2 weeks ago. All switch port settings were untouched before and after that change.

-


Alright... here's what I've come up with so far:

Spanning tree portfast makes no difference whether its on or off (which I thought because that only gets involved if a physical switch port goes down and comes back up. Outside of that spanning tree takes roughly 50 seconds anyways, not 30, in a normal configuration)

Speed and duplex have been set on both the ESX servers and the switch with counters cleared, no difference. I move the test VM out of the resource pool... again no difference.

I have noticed that at roughly ~2-3 minutes after a migration the latency spikes and comes back down (from 2ms to has high as 1400ms), not sure if its related though.

--------------------- Mark Vaughn VCP3, VCP4, vExpert 2009/2010 http://blog.mvaughn.us
Reply
0 Kudos
jccoca
Hot Shot
Hot Shot

Disable port-security in the cisco switch for the VM ports.

Reply
0 Kudos
BORGcube
Contributor
Contributor

If the VM dissapears (sic) for 30 seconds it sounds very much like Spanning tree. Make sure you have spanning tree port fast enabled on all your network links attached to your ESX Server.

Reply
0 Kudos
vvarnell
Enthusiast
Enthusiast

I'm seeing this (or something similar) as well, but it is between 16 and 20 seconds (4-5 dropped pings).

My cluster is 16 nodes, eleven ESX 3.0.2 and five ESX 3.5. I noticed this in the process of upgrading to ESX 3.5.

Vmotion behavior:

  • 3.5 to 3.5 -> 4-5 dropped pings

  • 3.0.2 to 3.5 -> 4-6 dropped pings

  • 3.0.2 to 3.0.2 -> 1 dropped ping (this has been the case since 2.5 and what I consider "normal")

  • 3.5 to 3.0.2 -> 2-3 dropped pings

Also, the startup of the vMotion seems to take several seconds longer than previously.

Only changes are the upgrades to ESX 3.5 (done with CD-based upgrade, ZIP-based upgrade and new methods). No SAN or network changes and the timing is wrong for spanning tree/portfast.

Thoughts? Ideas?

VwV

Reply
0 Kudos
mvaughn25
Enthusiast
Enthusiast

Unfortunately, though all of the feedback was well educated and are items that should be followed if ever troubleshooting an issue with these symptoms, none of the solutions here resolved the problem. Without changing the network settings, we completely rebuilt the lab environment (Win2k3/VC Server and 2 ESX hosts) and all of the problems went away. Vmotion now goes through with no problems. Not sure where the problem was, but it is gone now. I really hate problems that go away without a solid answer, because you don't know how to keep them from coming back. We even rebuilt on the same hardware, so it really had to be a configuration issue, though we poured over and over the configs and compared them to our working environments with no luck.

If I ever see this issue again, I will re-open this thread or link it to a new one. Thank you to everyone for their valued feedback.

--------------------- Mark Vaughn VCP3, VCP4, vExpert 2009/2010 http://blog.mvaughn.us
Reply
0 Kudos
mvaughn25
Enthusiast
Enthusiast

warnell,

Sorry to see you are having a similar issue. Rebuilding my ESX 3.5 servers resolved my problems. I also did a clean install of VC, though I doubt that hd anything to do with it. I am back to one dropped packet during a VMotion, which is expected as ARP changes are broadcast. Your timing table definitely shows the ESX 3.5 servers as the suspects. Sorry I can't be more helpful.

--------------------- Mark Vaughn VCP3, VCP4, vExpert 2009/2010 http://blog.mvaughn.us
Reply
0 Kudos