VMware Cloud Community
gusnole9399
Contributor
Contributor

VMotion super Slowdown issue after update 3, at wits end and VM support forgot about us

Try to make this a quick summary. We have 3 ESX 3.5 servers. We recently upgrade server1 and server2, but not server3(Still build 64607). Following the upgrades, VMotion to/from server2 became excrutionatingly slow, 15-20 minutes per with some timeouts. Vmotion between server1 and server3 was just fine.

Anyway, called VM support, the 2nd tech(more at the bottom on VM support) did some stuff, edited hosts files on all server and bumped the console memory to 800MB as recommended. We rebooted server2 and VMotion improved somewhat, down to 3-4 minutes but not what it should be. Following VM support deciding not to get back in contact with us, we called Dell's VMWare support group. Provided them with a bunch of logs and they are still pouring thru them, but at the same time we started fiddling around thinking this was a physical network issue.

So, we decide to move our Vmotion ports from the two Intel Quad Nics in server2 to the two built in Broadcom nics, and oddly enough Vmotion started working right again. Huray, for a minute. Server3 decided to start doing the exact same thing as server2 was doing. VMotion to/from server1 or server2(to server3) was 10 minutes or so. We tounched nothing on server3 at all so this baffles me completely.

I know we probably need to get update 3 on server3 but without VMotion working properly we can't since VMotion takes so long on server3, we have lots of VM's on it. I'm at a complete loss, if anyone has any suggestions Id appreciate it. Everything worked just fine will all servers until Update3 on server 1and 2, and how server3 starts having problems and server2 starts working fine is beyond me. Very frustrated and no ideas left.

Our network config is as follows, which might help mentioning. We have two Quad port nics in all servers, on each we have 1 port for the console, one port for Vmoiton and two ports for VM's, this is on both Quad's for redundancy purposes.

On a side note, VM support has been awful on this, the original tech took logs on a Friday(two weeks ago), didn't get back to us until the following Wednesday and then asked for more logs, then never got back with us. Next tech was some help to start this Monday, but never got back to us since, even with repeated voice mails from me, stil have not heard back now on Thursday, which is why I ended up calling Dell's VM support yesterday to see if they could help out, so far we've stumped them but they have been keeping in contact with us unlike VM support.

Anyway, that is why I am desperately seeking any kind of solid help on this. I really appreciate any suggestions.

Reply
0 Kudos
8 Replies
Texiwill
Leadership
Leadership

Hello,

Please provide a diagram of your vNetwork as well as the output of:

esxcfg-vmknic -l

esxcfg-nics -l

esxcfg-vswitch -l

Also, are you using VLANs for vMotion, what are your hardware switches, etc.

VMotion should never take more than 20 or so seconds.


Best regards,
Edward L. Haletky
VMware Communities User Moderator, VMware vExpert 2009
====
Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education.
Blue Gears and SearchVMware Pro Blogs -- Top Virtualization Security Links -- Virtualization Security Round Table Podcast

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
Reply
0 Kudos
gusnole9399
Contributor
Contributor

Much Thanks for the response.

We seem to have gotten things to work at the moment, by only using one nic for vmotion on all three ESX servers and plugging those nics into only one of our physical switches. Not exactly how we want it since we have no redundancy. Everything seems to point to a physical network issue(switch probably), but I'm not sure if there is an underlying issue with our VM network setup that would be causing these issues, since the problems only occurred after updating two of our servers to Update 3, specifically after updating our ESX03 server which was the second one we updated.

Our vNetwork is as follows(in the zip file), I don't have a graphic unfortunately, I just included txt files for the 3 servers with the dumps of those commands, and also a quickie Excel spreadsheet, hope it gives something to go off of, if not let me know what else I need to provide to be of more help.

Reply
0 Kudos
Texiwill
Leadership
Leadership

Hello,

You do have an oddity. Basically you have 2 uplinks per vmknic for Vmotion on ESX2 you also have 2 cables connected to the pSwitch in question. I.e.

Intel NICs

ESX2 vmnic6 -> pSwitch

vmnic7 -> pSwitch

Broadcom NICs

ESX3 vmnic4

vmnic5 -> pSwitch

Broadcom NICs

ESX4 vmnic8

vmnic9 -> pSwitch

So basically ESX2 has 2 links to the pSwitch and the others have only 1 each. All are running a 1000 full, which is goodness. If it was 1000 half then that would be a smoking gun.

However, on ESX2 disconnect vmnic6 from the configuration. Since you are connecting all to the same pSwitch it should not have any affect, but perhaps it is a switch packet delivery issue when going to ESX2 from the other hosts. So let us match up all the hosts to at least look the same from a pSwitch/pNIC combination regarding # of connections. This really should not make a difference but its worth trying. If it does then it really does sound like the pSwitch is an issue. Do you have another you can use to replace this one? I also doubt it is a broadcom vs Intel thing but you can easily test that as well.

Could you also post the output of

esxcfg-route -l


Best regards,
Edward L. Haletky
VMware Communities User Moderator, VMware vExpert 2009
====
Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education.
Blue Gears and SearchVMware Pro Blogs -- Top Virtualization Security Links -- Virtualization Security Round Table Podcast

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
Reply
0 Kudos
admin
Immortal
Immortal

I forwarded this thread to VMware Support. - Robert

Robert Dell'Immagine, Director of VMware Communities

Reply
0 Kudos
gusnole9399
Contributor
Contributor

On esx02, we had some wrongly labeled cables which I just corrected and now the appropriate vnics are on the right vswitches; I was wrong on esx02, both of it's Vmotion nics are plugged in, as opposed to just the one for the other two servers, I also forgot that we switched Vmotion to use the builtin Broadcom nics to rule out any Intel issues, I had seen some random issues with Intel quad nics so we switched them over to Broadcom, we've been troubleshooting this issue for two weeks so we got a little frustrated and started trying various things to see what would work.

In any rate, if we plug in both Vmotion nics on ESX04 or ESX04, things stop working(Vmotion comes to a crawl).

This is the output for esxcfg-route, it's all the same on all 3 servers. Thanks again for the help.

Network Netmask Gateway

10.1.117.0 255.255.255.0 Local Subnet

default 0.0.0.0 10.1.117.1

Reply
0 Kudos
Texiwill
Leadership
Leadership

Hello,

Disconect the second pNIC from Vmotion on all hosts or better yet make it the explicit failover pNIC for all hosts, that way it is not in use during ANY vMotion. Only 1 pNIC should ever be in use.


Best regards,
Edward L. Haletky
VMware Communities User Moderator, VMware vExpert 2009
====
Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education.
Blue Gears and SearchVMware Pro Blogs -- Top Virtualization Security Links -- Virtualization Security Round Table Podcast

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
Reply
0 Kudos
admin
Immortal
Immortal

Hi gusnole9399,

Please send me the SR number for the support case as well as your email address via private message, and someone from Support will investigate to see why you didn't get a response.

Regards, Robert

Robert Dell'Immagine, Director of VMware Communities

Reply
0 Kudos
gusnole9399
Contributor
Contributor

Our case# was 1157532001. I think we only have gold support so I'm not sure if that has any bearing on how our case is handeled.

Reply
0 Kudos