Wierd Connectivity Issue - Could it be VMware related?

Wierd Connectivity Issue - Could it be VMware related?

Hello,

For the past few months I have been trying to figure out a connectity  Issue that I have been running into. I am running SAP on Windows. SAP is  running on a physical box on Windows Server 2003 Enterprise R2 and SQL  2005. There is also an SAP App server running on another physical  Windows Server 2003 R2 machine. In a VM, I am running SAP's WebConsole  which allows remote RF devices to run SAP transactions over the web. The  WebConsole is running on a VM (Windows Server 2003 Standard R2) running  on ESX 3.01. During moderate to heavy usage times in the system, I get  just a ton of dropped connections, between the app servers and the VM.  The ESX box isn't incredibily busy from a network perspective, and the  ESX host is plugged into the same physical Cisco Catalyst 2960G-48 as  the app servers. I have replaced network cards in teh app servers, and  all of the cables. I have 4 different app servers, and no matter which  one I point the web box at, I get the same result. Is there any way to  see if my NIC in the ESX box is having issues, or any other way to  monitor this situation?


Not sure if you can, but have you looked into installing ntop on your  host to get a better view of the traffic? Never used it just ran accross  it the other day.


Have you tried using ethereal?

That should let you know what the traffic is and the source/destination for the packets.

HTH


Is the speed and duplex matched from the pSwitch and the phy. Windows boxes?

If the Switch ports are set to 100/1000 Full Duplex and the windows  boxes are set to Auto Auto the Windows machines will default to 100/1000  (what every you have there) and Half duplex. Cisco does not send out  info for the server about duplex settings if its ports are set to full  duplex. Then the Windows interprets this a half duplex. This can cause  problems.


How is your ESX Network setup? How many NICs do you have and what are they sharing?


All of the Nics and switchports on all servers are hard coded to 1000/Full.

I have run ethereal, and not seen any dropped packets etc. The only  clues to this that I get is a "Connection lost to terminal (Name of Web  Server)" message in the SAP APP servers.

ESX is setup with about 12 hosts sharing a single NIC that is plugged  into the switch. Should I try to add a second nic to the Vswitch? Can I  tell it to load balance it? What would I have to do on the switch side  to make this work?


Is your Service Console / VMotion sharing the same NIC? If so, that is  your problem. You should have at least two vswitches setup. 1 for  SC/VMotion and 1 for Network traffic. They should all be gig connected.


They are seperate.


You would have to trunk the ports on the pswitch side. The thing with  the load balancing is that ESX doesn’t do it on the fly. It only does it  when a VM is rebooted. At that time it checks the load on the pNIC and  assigns the VM to the lowest utilized one at that time. You may also be  able to get the same effect if you create a vSwitch that has no pNIC  assigned to it and move the VM to that port group briefly and then back  again. ESX doesn't continuously load balance.

Here is a good pdf on it.

www.vmware.com/files/pdf/virtual_networking_concepts.pdf


You would have to trunk the ports on the pswitch  side. The thing with the load balancing is that ESX doesn’t do it on  the fly. It only does it when a VM is rebooted. At that time it checks  the load on the pNIC and assigns the VM to the lowest utilized one at  that time. You may also be able to get the same effect if you create a  vSwitch that has no pNIC assigned to it and move the VM to that port  group briefly and then back again. ESX doesn't continuously load  balance.

This is completely false: we (at least in the current version of ESX)  never look at the load of the physical NICs in order to make a teaming  decision, would it be statically (when a VM is connected to a vswitch)  or dynamically ("on the fly"). This is something we're trying to improve  though, but it's a much more complicated problem than what it looks  like.


Just get that Virtual Cisco switch up and running!!!

Duncan
My virtualisation blog:
http://www.yellow-bricks.com


I know that you have to assign a vm to a port group as well as a vswitch  that is going to have one or more pNIC assigned to it. What I was  talking about is if there are multiple pNICs attached to a vSwitch then  the VMs virtual NIC is assigned to one of these physical NICs once it is  booted. If this is incorrect then that is great, but you should tell  your tech support to stop giving out false information. When we put in a  ticket to discuss this very subject with your tech support this is  exactly how it was explained to me. So if I am incorrect then that is  good to know. I am always looking to learn and help if I can that is why  I am participating in this forum.

Maybe you can help me understand how it is actually done. This is  important to a lot of us. Do the packets jump between the pNICs  depending on the current load being put on that pNIC as long as they  have been attached to the same vSwitch? Any info that you can offer is  great?

Take a look at page 8 and 9 of the pdf that I linked earlier. It should give you a good description of what you are looking for.


If you have more than one pNic attached to the vSwitch the way the pNics  are used by the VM's is down to the three possible methods that can be  selected as a function of the Vswitch or portgroup.

Port ID and MAC Hash.

When a VM starts up and network traffic livens up a port on the  vSwitch/portgroup and an association is made between that port and one  of the pNic.

Until the VM is rebooted or the pNic fails that pNic will always be used  for that VM. No account is made of the traffic generated by the VM.

The pNICs are allocated on a simple round robin basis as VM's are  started. A slight variation on this is source Mac hash were the MAC  address of the vNic on the VM is used instead of the Port ID. Port ID  has a slight advantage as the host does not have to open the packet read  the src MAC and therefore is lighter on the CPU. If you have 4 VM's on a  vSwitch/portgroup with 2 pNics and you use either of these options the  following could occur.

VM1 starts- high Network load -> pNIC 1

VM2 starts - low Network load -> pNIC 2

VM3 starts - high Network load -> pNIC 1

VM4 starts - low Network load -> pNIC 2

So as see this is not really load balancing, but its is simple and needs  no special config on the core switch attached to the pNICs.

IP Hash.

If however you chose IP Hash the association is made between the pNIC and source-destination IP pairs.

What this means is that if VM with a IP of 10.0.0.1 talks to 10.0.0.2 it  would use pNic 1. If the same VM then opens a channel with 10.0.0.3 it  could be directed to pNic2. A third connection to 10.0.0.4 might be back  to pNic 1.

The downside to this is that the core switch will see the MAC address of  the VM on at least of it's ports. With the same MAC address on two  ports which one does it use for the return packets. Unless your switch  is configured to expect this behaviour (port aggregation) it's going to  get very confused.

Of course if your VM talks primarily to one server - back end database,  iscsi device, default gateway, how much load balancing do you get  anyway?

Well none. As the mans says is not really load balancing its more fault tolerence.


I know that you have to assign a vm to a port  group as well as a vswitch that is going to have one or more pNIC  assigned to it. What I was talking about is if there are multiple pNICs  attached to a vSwitch then the VMs virtual NIC is assigned to one of  these physical NICs once it is booted. If this is incorrect then that is  great, but you should tell your tech support to stop giving out false  information. When we put in a ticket to discuss this very subject with  your tech support this is exactly how it was explained to me. So if I am  incorrect then that is good to know. I am always looking to learn and  help if I can that is why I am participating in this forum.

Hey, sorry if I sounded a bit rude, I just don't want to see false  knowledge spread on these forums. I'm browsing them for some time and  I'm always amazed to find people convinced of something which sounds  horribly wrong to me, and I always end up wondering where it came from  (answer: the same forum). And I certainly don't want to see people  writing scripts to periodically connect/disconnect a VM from a portgroup  thinking it will load-balance traffic, I wouldn't be able to sleep at  night knowing that Smiley Happy

If you precisely wrote down what the VMware support told you, then I  personally apologize and we should make sure none of our support folks  thinks this way.

Maybe you can help me understand how it is  actually done. This is important to a lot of us. Do the packets jump  between the pNICs depending on the current load being put on that pNIC  as long as they have been attached to the same vSwitch? Any info that  you can offer is great? It actually depends on the teaming policy you  choose.With the PortID-based policy, there is indeed a "bind this VM to  this pNic" logic going on, but it is only based on the numerical ID the  VM port is given: we don't look at the real-time load of each pNic in  order to determine which pNic to choose. This policy is actually pretty  similar to a round-robin one where we assign the pNics one after the  other. In the majority of the cases it works well, but there are  pathological cases where the load isn't evenly spread.

With the srcMAC and IP-Hash based policies, it depends on the content of  each packet. ESX quickly hashes a packet and the result determines  which pNic will output it. Again, we don't look at the load of each pNic  when making this decision, this is purely semi-random.

Note that the IP-Hash policy is the only one theoretically allowing a VM  to get more bandwidth than what you get with one uplink.

Also note that the srcMAC & IP-Hash policies have a slightly bigger  CPU overhead compared to the PortID-based one, because ESX has to look  at the content of each packet individually.

Again, don't think ESX will stay that way: we know that you're interested in leveraging your pNics

Smiley Happy


Thanks to the both of you for the information and clearing things up for  me. I most definitely do not want to give out false information.


I have added some additional Nics to the Vswitch. I will report if this fixes the issue. I also upgraded the host to 3.5.

This document was generated from the following thread: Wierd Connectivity Issue - Could it be VMware related?

Version history
Revision #:
1 of 1
Last update:
‎05-07-2008 09:54 PM
Updated by: