This thread is a follow-up to the following threads since these seem to be related:
http://www.vmware.com/community/thread.jspa?threadID=74329
http://www.vmware.com/community/thread.jspa?threadID=75807
http://www.vmware.com/community/thread.jspa?threadID=77075
Description of the issues and "results" we have so far.
juchestyle and sbeaver saw a significant degradation of network throughput on 100 full virtual switches.
The transfer rate never stabilizes and there are significant peaks and valleys when a 650 meg iso file
gets transferred from a physical server to a vm.
Inspired from this I did some short testing with some strange results:
The transfer direction had a significant impact to the transfer speed.
Pushing files from VMs to physical servers was always faster (around 30%) than pulling files from servers.
The assumption that this is related to the behaviour of Windows servers was wrong, since this happened
regardless of the OS and protocol used.
Another interesting result from these tests: e1000 NICs always seem to be 10-20% faster than the vmxnet
and that there is a big difference in PKTTX/s with vmxnet and e1000.
After that acr discovered real bad transfer speeds in a Gigabit VM environment.
The max speed was 7-9 MB/s, even when using ESX internal vSwitches.
A copy from ESX to ESX reached 7-9 MB/s too.
The weird discovery in this scenario: when disabling CDROMs in the VMs the transfer speed goes up to 20 MB/s.
Any ideas regarding this?
I'll mark my question as answered and ask Daryll to lock the thread so we have everything in one thread.
did another test instead
in my first tests I used a 100M Intel dual-port card where only 1 port was used
now I did the same tests against localhost bonding the 2 ports
Against localhost I have an increase of 100% ! (487 Mbits/sec)
the same test against the real IP increased about 400% (226 Mbits/sec)
Is this a bug in the Intel drivers that only affect non-bonded NICs?
Now I'm totally confused.
Message was edited by:
oreeh
Interesting.
All my servers have bonds of 2 or 4 ports. (with many VLANs per bond). I don't see why bonding should have any effect on localhost, but maybe...?
I don't see any relationship between bonding and localhost too - therefore I ran the test 10 times
I have no bonding setup yet but will try it. I ran some more tests on an HP DL585 G2 with only the 2 on-board NICs (also NetXtreme II's) and got similar results as I did with my NetXtreme PCI NIC's in all the other systems. I am leaning towards this being a driver/NIC issue with Intel based cards....
VM to VM
8k: 647 MBytes 543 MBits/s
64k: 864 MBytes 724 MBits/s
Interestingly a 128k windows size yields (purely for the heck of it):
1.26 GBytes and 1.08 GBits/s
Ok so here was before using single:
VM to VM
8k: 401 MBytes 337 MBits/s
64k: 958 MBytes 804MBits/s
Now after configuring two on-board Intel based NIC's for load balancing the numbers basically double, so no real conclusive evidence here:
VM to VM
8k: 548 MBytes 460 MBits/s
64k: 1.20 GBytes 1.03 GBits/s
I am load balancing based on "originating virtual port ID" but this looks better than a single NIC by itself.
So Oreeh I agree this may be an issue with the Intel based NIC's.
Message was edited by:
JonT
I had a chat with one of my networking guys and he brought up a couple good ideas for this effort. I am going to do some more tests with Intel based PCI NIC's to see if the problem follows or if it is just localized to the On-board ones. I expect it will but want to verify. Also he mentioned that we could setup some SNMP traps on a VM or two to have network information collected by Concordia (sp?) which I guess is an application they use for troubleshooting.
Ok now I am really confused. My original assumtion about Intel based was not totally correct. I just tested using an HP NC 7170 which is an Intel based dual port NIC with the 82546EB chipset. Here are the results.
VM to VM
8k: 631 MBytes 529 MBits/s
64k: 1.20 GBytes 1.03 GBits/s
VM to Localhost
8k: 821 MBytes 689 MBits/s
64k: 1.30 GBytes 1.11 GBits/s
I now think the problem is just whatever chipset the onboard (NC 7782) is based on. I will look into this further and may open up my own SR with VMWare.
Jon,
If you open another SR please reference the other SR's listed and please keep us all informed
JonT, your results which every NIC are compared to the 4 environments i have 100% better..
I have HP C-Class Blades as ESX Servers (4 off)
IBM Blades (HS21) as ESX Servers (2 off)
IBM xSeries as ESX servers (2 off)
Im using Cisco Switches and HP switches (during the tests)
And none of these get more than 10MB of transfer..!!
I received an email today from VMware support asking to close my SR since the parent SR (from Matthew) is already open and escalated.
Quote from the mail
I see that the parent SR 374322 is still open and already escalated.
I would be best if we do not start a new investigation and provide any
contradicting views.It would take even more time to get to the point where
you already are at with the SR 374322.Hence I plan to close this SR as a
duplicate of SR 374322.
Let me know your thoughts on this.
I've mailed them the following
I see that the parent SR 374322 is still open and already escalated.
I hope so - besides this SR (374322) wasn't opened from me - therefore I can't
say anything about that.
It would be best if we do not start a new investigation ...
I totally understand these thoughts.
Let me know your thoughts on this.
My thoughts:
You might agree that the problems and discoveries in the mentionend thread are quite strange.
At the moment it seems that these problems are related to Intel NICs only - but we don't know for sure (yet).
I've opened the SR to get someone from the VMware Tech Support Team (in this case you)
to take a look at this thread and (hopefully) post some comments / thoughts.
Of course these comments / thoughts can only be personal views and in no way official.
You might close this SR but it would be nice if VMware could keep us up to date regarding
this issue and post to this thread too.
My impression:
At the moment JonT, acr and I are doing some (limited) debugging.
There's nothing wrong with that, but I expect that VMware at least honors this by using the
forum to communicate.
Many people in these forums supply free support to the VMware community in their spare time.
I don't kwow any numbers but my guess is, that the forum saved VMware from thousands of support calls.
Daryll once stated "Thanks for using the forums to their fullest potential" - this should imply VMware too.
I'm not upset - only a bit confused.
Regards,
Oliver
I haven't heard any answer to this yet (don't know if this a good sign, since this is the first SR I had to file).
Agreed Oliver..
Ok now I am really confused.
me too
Does anybody know, if 3COM cards still work (though they aren't on the HCL)?
Reason: under /usr/lib/vmware/vmkmod/ the 3c90x.o module is still there.
Before opening a SR take a look at the result from mine
ACR,
I have all of the same hardware you do but not as ESX Hosts yet. Currently my Hosts consist of:
HP DL385 G1 (3)
HP DL585 G1 (2)
HP DL585 G2 (1)
HP DL380 G4 (1)
IBM x3950 (1)
I haven't had a chance to use ESX on blades simply because our infrastructure isn't setup for Fiber to the chassis yet, whether its the IBM or HP c-Class blades.
All of my hosts are not very overworked as far as network traffic so that might be why the numbers look better. I have all Cisco based switching gear (Catalyst 6500 series) but no HP.
Duly noted, no SR will be opened with the hope that we hear from VMware through the original SR or this thread.
Not sure about the 3Com cards. I will check to see if I have any, no guarantee though.
Message was edited by:
JonT
I have some 3COM cards at hand but I'm not sure if they still work.
I'm now sending Ken a PM to take a look at this thread (if he didn't already).
Maybe HP already has some knowledge regarding this.
JonT we may have similar H/W but thats where the similarity ends..?
Can i buy your ESX Servers.. There out performing mine..?
One of the motto's at the company I work for is:
Explain what is happening.
I feel like we are in the dark and I feel like there is absolutely a problem here! It would be nice to get a simple post, phone call, email, or smoke signal as to what vmware is thinking on this issue. Are they trying to replicate the issue with the hardware we are using? Are they still trouble shooting this issue as if it is our setup and not a coding issue? Are they looking at the code to see if there is a problem?
It would be great to just get an FYI!
Respectfully,
ACR, sure you can have them. I will order new ones??? haha. I am now building an HS21 blade as an host to see what results I can get on it. Just so everyone knows I visually verified the chipset for the onboard NIC's in the HP DL385 G1. It is in fact a Broadcom chipset, not Intel as I previously assumed, so if this has thrown anyone off I am sorry.
Scratch the HS21, its a DOA blade....I only had one anyhow. All my others are HS20 and LS20 blades. I will build one of each of those and let you all know the results on them.
Message was edited by:
JonT
I would be happy if they at least show some signs of reading this thread
JonT i'll be very interested in seeing the result with the HS21 Blades.. If there not the same as mine (which is very poor) im going to be finding the nearest bridge..
Thx..