VMware Cloud Community
juchestyle
Commander
Commander
Jump to solution

Significant networking degradation on 100 Full Vswitch

Ok guys, Steve Beaver and I are working hard to work through an issue and would love your ideas and suggestions.

Here is the background:

We are seeing a significant degradation of network throughput on 100 full virtual switches. The transfer rate never stabilizes and there are significant peaks and valleys when we transfer a 650 meg iso file from a physical server to a vm.

The physical is on the same physical switch as the ESX so we believe our physical network is not the issue. We are using several network analyzers that show throughput up to the ESX are not the issue.

Our Gig Full virtual switches are performing closer to what our physical servers are doing, with some smaller peaks, and a better stabilization.

We have opened a support ticket, and we would like to move the vm's to the gig full, but can't yet due to ip address. We are under the gun to make it work or fix whatever the issue is.

Virtual Center 2.01

ESX 3.01

Proliant DL 585 G1 (3 of these in a DRS/HA Cluster San attached storage)

4 AMD cpu's @ 2.4 Ghz (24 gigs of memory each host)

Intel 82546EB gigabit Ethernet controller (HP NC7782)

Suggestions, ideas???

Respectfully,

Kaizen!
Reply
0 Kudos
1 Solution

Accepted Solutions
wally
Enthusiast
Enthusiast
Jump to solution

Our config consists of two farms, the first has 10x DL385 servers with 1x PCI-X Intel PRO 1000GT quad port adapters and 2x PCI-X QLA2340 HBA's. Our second farm is just being installed with 5x DL385G2 with 2x PCI-X QLA2340 and 1x PCI-e Intel Pro 1000PT quad port adapters. I will run some tests tomorrow on this new farm. All servers are running on 2x dual core 2.4ghz opterons.

Until now we never even bothered about firmware, this setup has been running since august last year without problems. The graph above is from a VM that has been running since dec 4 last year.

Nothing was changed, no firmware upgrades, no vmware upgrades (only critical patches, these were installed in 2006) no network updates no changes to the VM, the last windows patches in this VM were installed feb 18 and believe me we tripple checked if nothing changed on feb 21/22. Tests with FTP show the exact same throughput as the backup agent (veritas 10.1d sp3) so we rule out a problem with the backup software.

This firmware info is from our DL385G1 farm:

DL385 ROM A05, 03/01/2006

smart array v2.58 rev.B

iLO firmware 1.82

QLA2340: firmware 3.03.19 option rom v1.43

Since the output since Feb 22 seems 'capped' I personaly believe it is a software issue, not a hardware issue. Other non-vmware servers on the same switch (cisco6513) on this location don't seem to suffer from this drop, the first tests -that still have to be confirmed- also don't show anything like this on linux VM's (FC4-6). So my best guess at the moment is that it's a windows or vmware issue, but we haven't tripple checked every single thing yet.

One of the things I still want to try is use one of the onboard NIC's to rule out the quad port nic/driver.

Message was edited by: wally (typo's)

View solution in original post

Reply
0 Kudos
24 Replies
bister
Expert
Expert
Jump to solution

Hi,

with 100-full-virtual-switches you mean vSwitches that are bind to a 100-full-NIC which is connected to a 100-full-switch-port, right?

I guess you already checked all of this:

\- Components are correctly configured to 100-full

\- Network cable is OK

\- Physical switch shows no CRC errors for the connection

\- Physical NIC is OK, checked on other NIC

Is the behavour same for the opposite way: Sending an ISO from VM to physical server?

How are you transferring the ISO? FTP? CIFS?

Sometimes Windows reacts strange when copying files with Windows Explorer: Different throughputs when initiating the copy-task from source or target machine (sounds strange, I know).

Some of my ideas...

Regards,

Christian

wally
Enthusiast
Enthusiast
Jump to solution

Maybe a strange question but since when do you see this degradation, we're suffering from a similar issue (but with gig ports) which we are still investigating.

Since Feb 22 our exchange backup runs only with 500-600GByte/min instead of the 2400-3000GByte/min it did up until feb 21.

Message was edited by: wally

Reply
0 Kudos
sbeaver
Leadership
Leadership
Jump to solution

What we are seeing is alot slower throughout and maybe some tx hang. We applied the patch for the tx hang and still did not help. The steps we follwed so far.

I have a NIC on each ESX server in the farm that is attached to a vSwitch. This connection goes to my DMZ switch. All ports and ESX configuration are set to 100full. The test was with a linux physical box that sent a 550mb iso file first to another linux box on the same switch as the linux box that is pushing and pulling the ISO. With the 2 servers on the same switch we got a full pipe and was about 100mb.

The next test is the same linux client that send the same ISO to a physical box on the DMZ switch and got the speed we expected.

Next did the same test except to a VM on the smae DMZ switch and there is when thing really slow down.

This issue seems to effect each ESX server in the cluster the same

Steve Beaver
VMware Communities User Moderator
VMware vExpert 2009 - 2020
VMware NSX vExpert - 2019 - 2020
====
Co-Author of "VMware ESX Essentials in the Virtual Data Center"
(ISBN:1420070274) from Auerbach
Come check out my blog: [www.virtualizationpractice.com/blog|http://www.virtualizationpractice.com/blog/]
Come follow me on twitter http://www.twitter.com/sbeaver

**The Cloud is a journey, not a project.**
Reply
0 Kudos
sbeaver
Leadership
Leadership
Jump to solution

Wally,

I also see this on the gigabit ports but to isolate things my DMZ vSwitch only has one nic and no bond. The rest of my vSwitches use 2 nics in a mac-out configuration. Wanted to start with the most straight forward way

Steve Beaver
VMware Communities User Moderator
VMware vExpert 2009 - 2020
VMware NSX vExpert - 2019 - 2020
====
Co-Author of "VMware ESX Essentials in the Virtual Data Center"
(ISBN:1420070274) from Auerbach
Come check out my blog: [www.virtualizationpractice.com/blog|http://www.virtualizationpractice.com/blog/]
Come follow me on twitter http://www.twitter.com/sbeaver

**The Cloud is a journey, not a project.**
Reply
0 Kudos
wally
Enthusiast
Enthusiast
Jump to solution

We suffered from Tx hang last saturday & this morning, I'm not sure if they are related (Tx hang is not always on the server on which we see degraded performance).

Below is a picture that shows what we are looking at (the 'spikes' are the nightly backup job). So far we're unable to determine a cause for this drop. We don't use traffic shaping, vmotioned the VM 3 times by now, tried the "ethernet0.features=0" but nothing seems to increase the speed. Looking at the picture you would say there's a hard limit installed somewhere.

juchestyle
Commander
Commander
Jump to solution

Hey Wally,

Thank you for your input. I am wondering if this is a hardware (Nic card or other) or firmware issue, or combination of the two in the right circumstance.

Wally, would you share what type of hardware you are using? Specifically what type of nic and if you could the firmware version?

We upgraded the firmware on everything in one of our hosts last night. I believe we are now at 7.07 firmware on the nic cards that are using the 100 full. Steve and I are waiting for our network guy to rerun the transfer tests on the new host with the new firmware to see if we get anything out of it. Just curious what type of hardware you are using!

Respectfully,

Kaizen!
Reply
0 Kudos
wally
Enthusiast
Enthusiast
Jump to solution

Our config consists of two farms, the first has 10x DL385 servers with 1x PCI-X Intel PRO 1000GT quad port adapters and 2x PCI-X QLA2340 HBA's. Our second farm is just being installed with 5x DL385G2 with 2x PCI-X QLA2340 and 1x PCI-e Intel Pro 1000PT quad port adapters. I will run some tests tomorrow on this new farm. All servers are running on 2x dual core 2.4ghz opterons.

Until now we never even bothered about firmware, this setup has been running since august last year without problems. The graph above is from a VM that has been running since dec 4 last year.

Nothing was changed, no firmware upgrades, no vmware upgrades (only critical patches, these were installed in 2006) no network updates no changes to the VM, the last windows patches in this VM were installed feb 18 and believe me we tripple checked if nothing changed on feb 21/22. Tests with FTP show the exact same throughput as the backup agent (veritas 10.1d sp3) so we rule out a problem with the backup software.

This firmware info is from our DL385G1 farm:

DL385 ROM A05, 03/01/2006

smart array v2.58 rev.B

iLO firmware 1.82

QLA2340: firmware 3.03.19 option rom v1.43

Since the output since Feb 22 seems 'capped' I personaly believe it is a software issue, not a hardware issue. Other non-vmware servers on the same switch (cisco6513) on this location don't seem to suffer from this drop, the first tests -that still have to be confirmed- also don't show anything like this on linux VM's (FC4-6). So my best guess at the moment is that it's a windows or vmware issue, but we haven't tripple checked every single thing yet.

One of the things I still want to try is use one of the onboard NIC's to rule out the quad port nic/driver.

Message was edited by: wally (typo's)

Reply
0 Kudos
juchestyle
Commander
Commander
Jump to solution

Wally,

Thank you for your input, great information. I was reading your post and thought maybe it is a windows patch the screwed us, then I realized that we are testing on linux vm's. So I think this rules out a windows patch. I agree with you that I don't think this is a hardware issue, though I will try to keep thinking about everything in the mix. I do however think this is a vmware / ESX issue. I bet that the majority of people try to use gig full whenever they can so perhaps others are having this issue and don't know it because they either haven't tested or aren't using 100 full.

Wally, we have received two suggestions so far, wanted to share them with you so we can team work this issue.

1. Switch the physical ports we are plugged into to see if results change. I was also thinking maybe switch out the switch as well to see if that could be influencing the situation.

2. Change from 100 Full forced to auto- negotiate (we have changed to auto)

Let us know if you are able to try these, we will see what we can do on our side.

Respectfully,

Kaizen!
Reply
0 Kudos
wally
Enthusiast
Enthusiast
Jump to solution

You're welcome.

1. we did this by vmotioning the VM 3 times to another physical server which is by nature connected to other switchports (even tried one on another 6513 slot/blade).

2. with Gbs ports I believe auto is a requirement.

3. we even put one esx server in maintenance and gave it a reboot, this also didn't fix anything.

4. also made sure that "storage" wasn't the limiting factor (since the storage graph shows the same drop). But when running a disk benchmark the vc graph's easily climb to 170.000KB/s which is way faster than our highest backup peak ever.

By now I think our headache's are comparable Smiley Wink

Message was edited by: wally (storage test results added)

Reply
0 Kudos
wally
Enthusiast
Enthusiast
Jump to solution

Well over here the mystery is solved. We have an OPN between our datacenters of around 35km (<1ms latency) which was appearantly switched to a backup path of around 200km (5ms latency). It appears no-one noticed this increased latency (and hence lower transfer speeds).

The path was just switched back to the 'short' one and hey presto: 70MByte/sec backup speeds again.

So over here only the "Tx unit hang" problem is still open (for which I started another thread http://www.vmware.com/community/thread.jspa?messageID=584037)

Reply
0 Kudos
juchestyle
Commander
Commander
Jump to solution

Hey Guys,

We are still struggling with this issue.

We are hoping someone may be able to try something on their network and report back the results? I am monitoring ESXTOP using the n (for networking)and transferring a couple of big files through unc (start run
servername\c$) and looking to see what kind of bandwidth we are getting through our 100F vswitch on a single physical nic (Intel 82546EB).

In ESX top I am viewing the number 3 nic, and I never get better than .58 Mb TX/ s and never better than 1134 PKTTX/s.

Does anyone else get better than that?

Respectfully,

Kaizen!
Reply
0 Kudos
oreeh
Immortal
Immortal
Jump to solution

I'm trying it and post the results

Two questions regarding your setup:

what switches are you using and how are these switches / the ports configured (flow control, lacp, ...)

Reply
0 Kudos
juchestyle
Commander
Commander
Jump to solution

Hey Oreeh,

2 things:

1. I haven't forgotten about the memory test, just life gets in the way of lab time sometimes. Hoping to try this weekend.

2. The tests from above arn't accurate. My laptop was set to auto and auto combined with the network switch I was on was skewing the results. There appear to be all sorts of interesting things about networking with the physical world. It seems our networking issues here are largely the result of some irregularities surrounding our network.

2 more things:

1. Check your network settings on the nic, the switches in between and don't use Autonegotiate.

2. Check the paths that you are using through a network.

Respectfully,

Kaizen!
Reply
0 Kudos
oreeh
Immortal
Immortal
Jump to solution

Hey juchestyle,

>I haven't forgotten about the memory test, just life gets in the way of lab time sometimes. Hoping to try this weekend.

I totally understand.

Now some results from a few quick tests.

Setup:

HP DL380G3, 8GB, Dual Xeon 3.06, Local VMFS (5*72GB 10k drives, RAID 5),

Intel Dual Port (PILA 8472) Adapter (only 1 Port used)

ESX 3.0.1 with all patches

File/NFS/FTP server: P4D 3GHz, 1GB RAM, SATA RAID

Switch: HP2524.

All network cards and switch ports set to 100Full.

1. copy from server to virtual XP (using SMB protocol)

3020 PKTTX/s 1.58 MbTX/s

2. FTP from server to virtual BSD VM (vmxnet, tools installed):

2312.66 PKTTX/s 1.17 MbTX/S with 5.46 MB/s

3. FTP from server to virtual BSD VM (e1000, no tools installed)

1549.45 PKTTX/s 0.78 MbTX/S with 7.16MB/s

Interesting (to me) is the performance drawback of the vmxnet in comparison to the e1000. Shouldn't this be the other way round?

Note: Both BSD VMs are the same cloned system (besides the installed tools).

Message was edited by:

oreeh

Forgot to mention: During the tests only the tested VM was powered on.

Reply
0 Kudos
juchestyle
Commander
Commander
Jump to solution

Those transfer rates seem ridiculous to me!

I have also noticed that if you pull or push from the vm to the physical and from the physical to the vm you also get different results.

I have found that you get the most steady rates, (different from higher transfer rates) when you push information from a vm to a physical.

So I wonder if your rates would be the same, or more steady if you went from your vms back to the physical?

Respectfully,

Kaizen!
Reply
0 Kudos
oreeh
Immortal
Immortal
Jump to solution

Same setup as before but pushing files to the server:

BSD VM with vmxnet

4303.99 PKTTX/s 49.52 MbTX/s with 8.30 MB/s

BSD VM eith e1000:

2323.46 PKTTX/s 26.80 MbTX/s with 9.48 MB/s

This seems to be near the maximum you can get over a 100MBit connection using the FTP protocol.

And again a performance drawback using vmxnet.

Message was edited by:

oreeh

the result for the Windows VM

6280.29 PKTTX/s 70.99 MbTX/s

Reply
0 Kudos
juchestyle
Commander
Commander
Jump to solution

So this time you moved the files from the vm to the physical and doing so inside the vm?

Looks like your numbers were higher this time right?

Weird huh?

Ok, enough for now, off to drink!

Cheers!!! ;{)

Kaizen!
Reply
0 Kudos
oreeh
Immortal
Immortal
Jump to solution

Weird huh?

yes

Ok, enough for now, off to drink!

Cheers!!! ;{)

yeah, a nice german beer

Prost

Reply
0 Kudos
acr
Champion
Champion
Jump to solution

This is something we are seeing on a customers site.. With 4 * HP C-Class Blades each with 10G of RAM and Gigabit HP Switch Module..

VM to VM[/b] (FTP or CIFS)

Physical to VM[/b] (FTP or CIFS)

ESX to ESX[/b] (SCP)

Transfer rate always 7MB/s to 9MB/s[/b]

It makes no difference if the vNics are auto or fixed (at 1G)

The traffic can be bursty..

All the VMs are new builds with 2G of Ram..

But two strange points..!!

If we remove the CD as a device then the transfer rate goes up to 20MB/s

Also when going from VM to Physical its always 20MB/s..

It would be nice to get some deeper understanding of the networking..?

Again what sort of figures is everyone else seeing..??

Reply
0 Kudos