Re: ESX 3.5 network performance issues

WAS-BrownHW · ‎07-16-2008

I have a 3.5 ESX server with a vm running RHEL5.2 on it. The vm is a mirror, in terms of logical configuration, to a physical server(actually 4 of them) which I am using for comparison.

VM RHEL5.2

4x2.33GHz

5GB

PHYSICAL RHEL5.2

4x3.16GHz

16GB

The physical servers have 2x1gig NICs setup in an balance-alb channel bond.

When transferring a 88m file via scp between 2 identical physical servers, I get ~88mb/sec.

When transferring the same file from VM to physical(and vice versa) I get ~29mb/sec. The VM has one vnic setup with the e1000 driver.

EDIT: All nics and switchports set to autonegotiate (1000/full)

I have setup a team with vmnic4 and vmnic5 and set the loadbalancing to ip hash on the vswitch. I setup the corresponding ports on the switch to etherchannel with non-negotiate. I then assigned the new team to my RHEL5.2 vm and tested again. I got the same results.

I then setup up the same channel bond scenario in the vm as with the physical servers. One channel(e1000 driver) of the the bond goes to a single homed vmnic and the other channel(e1000 driver) goes to the vmnic team. Same results. I have PDFs from vmware stating very little overhead associated with virtualized networking compared to physical networking.

What am I missing? I thought these changes would have made more of an impact(or any), but they haven't.

Thanks in advance.

gary1012 · ‎07-16-2008

Have the latest VMware tools been installed on the VM?

Community Supported, Community Rewarded - Please consider marking questions answered and awarding points to the correct post. It helps us all.

WAS-BrownHW · ‎07-16-2008

Yes they have.

WAS-BrownHW · ‎07-16-2008

duplicate post

kjb007 · ‎07-16-2008

In a file copy, you are transferring the files from one source to one destination, so you will only be using one NIC out of the team, so that isn't helping you here. It's not hurting you either, but it's not going to help you increase your transfer speed. Have you used the vmxnet driver instead of the e1000 to see the difference there?

Are the physical servers and esx servers going through the same physical switches?

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

WAS-BrownHW · ‎07-16-2008

Sorry, I failed to mention these are 64 bit RHEL servers. From researching I found that the vxnet driver isn't support on the 64 bit OS.

So, no, I haven't tried that driver and yes same switching infrastructure.

WAS-BrownHW · ‎07-18-2008

Also, add to this. I get the same lackluster performance to another RHEL5.2 vm on the same vswitch.

WAS-BrownHW · ‎07-21-2008

bump.

jpoling · ‎07-21-2008

We had some issues with network latency as well

Ours were never completely resolved but we did find some things from the community and from VMware support that helped. I am pretty sure that in our case a very tight consolidation ratio affects the network throughput significantly (140 VMs spread across 4 physical hosts).

* How many VMs are running on your host?

* What is the service console memory set to?

* Are you running management agents, backup agents, etc on the service console?

Jeff

rmrobert · ‎07-22-2008

I would suspect storage more than networking. Most setups are easily capable of doing line rate (1000Mbit/s) which would give you 125MBytes/sec.

You can eliminate networking by running "iperf" or "netperf" (there are windows and linux versions freely available). If these are getting much better than 29MB/s (as I suspect they are) then your problem may be your storage speed. Perhaps you are using a RAID on the physical servers but your VMs are just on a single, perhaps slower disk?

Another way to test this if you are running linux is to copy the file to a ram disk and do the transfer from there. That should eliminate the effect of your underlying storage speed.

WAS-BrownHW · ‎07-22-2008

Thanks Jeff. I read your thread.

I only have 7 VMs running off of local disk.

Service console memory is set at the default(272 I believe).

I'm only running a local script on the ESX server for backups.

WAS-BrownHW · ‎07-22-2008

Here are some quick iperf stats.

VM->phy

# iperf -c w-pre-tw01 -t 1

-

Client connecting to w-pre-tw01, TCP port 5001

TCP window size: 16.0 KByte (default)

-

local 172.16.135.20 port 60494 connected with 172.16.135.21 port 5001

0.0- 1.0 sec 112 MBytes 939 Mbits/sec

phy->phy

root@wpretw02 /]# iperf -c w-pre-tw01 -t 1

-

Client connecting to w-pre-tw01, TCP port 5001

TCP window size: 16.0 KByte (default)

-

local 172.16.135.22 port 44303 connected with 172.16.135.21 port 5001

0.0- 1.0 sec 88.0 MBytes 736 Mbits/sec

The VM is installed on a local disk which are SAS 15k drives. This is also the case for the local disk on the physical servers.

When copying a file directly to /dev/shm and then copying from that I see simliar, slower results.

Do you think the bus adapter driver on the VM could have something to do with it?

Thanks for your help thus far.

jpoling · ‎07-22-2008

I do not want to distract from the original post; however, I would like to get some clarification. Rmrobert, you mentioned, "Most setups are easily capable of doing line rate (1000Mbit/s) which would give you 125MBytes/sec." If I run iperf in my environment between two VMs, I get 133 Mbits/sec that is a far cry from what you say most setups should see. Am I missing something? Obviously there are a lot of variables. . .

Jeff

jpoling · ‎07-22-2008

As was mentioned, storage may be a bottleneck. You mentioned the VM is running on SAS drives. . .in our environment, we are using fiber channel SAN. Maybe some others can give you their experiences using SAS?

WAS-BrownHW · ‎07-22-2008

Here is another iperf nugget:

vm -> vm same virtual switch

# iperf -c w-pre-tt01 -t 1

-

Client connecting to w-pre-tt01, TCP port 5001

TCP window size: 16.0 KByte (default)

-

local 172.16.135.20 port 57914 connected with 172.16.135.26 port 5001

0.0- 1.0 sec 138 MBytes 1.16 Gbits/sec

rmrobert · ‎07-22-2008

jpoling, you should probably see more than 133Mbit/s. Make sure your NICs are 1Gbit and they are set to autonegotiate properly and that they come up as 1000BaseT (can see from esxcfg-nics -l) although if this were the problem, you shouldn't see more than 100Mbit, so I doubt this is the problem.

Things to try are:

Suspend other VMs- perhaps they are chewing all your CPU at the time you ran the test. Or perhaps your hardware is really old? (doubt it). Anything Core2 DUO/Xeon/Opteron > 2Ghz shouldn't have any problems. Nor even would I expect problems with less than 2Ghz, but still.

When you say between 2 VMs, do you mean on the same vSwitch or on separate ESX boxes?

The other thing to check would be that you use more than 1 thread in iperf. I've seen some OSes (for example Windows Server 2008) give anemic performance on a single connection, and this happens on physical as well as in a VM (I've noticed it only on Tx, meaning the client is Windows Server 2008). Presumably Windows has a "fairness" built incompletely, such that they have to handicap a single connection even in the presense of no other traffic. You can tell iperf to run multiple sessions if you run it like "iperf -c hostname -P 4". Then you should see some improvement.