Wierd Connectivity Issue - Could it be VMware related?

Hello,

For the past few months I have been trying to figure out a connectity Issue that I have been running into. I am running SAP on Windows. SAP is running on a physical box on Windows Server 2003 Enterprise R2 and SQL 2005. There is also an SAP App server running on another physical Windows Server 2003 R2 machine. In a VM, I am running SAP's WebConsole which allows remote RF devices to run SAP transactions over the web. The WebConsole is running on a VM (Windows Server 2003 Standard R2) running on ESX 3.01. During moderate to heavy usage times in the system, I get just a ton of dropped connections, between the app servers and the VM. The ESX box isn't incredibily busy from a network perspective, and the ESX host is plugged into the same physical Cisco Catalyst 2960G-48 as the app servers. I have replaced network cards in teh app servers, and all of the cables. I have 4 different app servers, and no matter which one I point the web box at, I get the same result. Is there any way to see if my NIC in the ESX box is having issues, or any other way to monitor this situation?

Not sure if you can, but have you looked into installing ntop on your host to get a better view of the traffic? Never used it just ran accross it the other day.

Have you tried using ethereal?

That should let you know what the traffic is and the source/destination for the packets.

HTH

Is the speed and duplex matched from the pSwitch and the phy. Windows boxes?

If the Switch ports are set to 100/1000 Full Duplex and the windows boxes are set to Auto Auto the Windows machines will default to 100/1000 (what every you have there) and Half duplex. Cisco does not send out info for the server about duplex settings if its ports are set to full duplex. Then the Windows interprets this a half duplex. This can cause problems.

How is your ESX Network setup? How many NICs do you have and what are they sharing?

All of the Nics and switchports on all servers are hard coded to 1000/Full.

I have run ethereal, and not seen any dropped packets etc. The only clues to this that I get is a "Connection lost to terminal (Name of Web Server)" message in the SAP APP servers.

ESX is setup with about 12 hosts sharing a single NIC that is plugged into the switch. Should I try to add a second nic to the Vswitch? Can I tell it to load balance it? What would I have to do on the switch side to make this work?

Is your Service Console / VMotion sharing the same NIC? If so, that is your problem. You should have at least two vswitches setup. 1 for SC/VMotion and 1 for Network traffic. They should all be gig connected.

They are seperate.

You would have to trunk the ports on the pswitch side. The thing with the load balancing is that ESX doesn’t do it on the fly. It only does it when a VM is rebooted. At that time it checks the load on the pNIC and assigns the VM to the lowest utilized one at that time. You may also be able to get the same effect if you create a vSwitch that has no pNIC assigned to it and move the VM to that port group briefly and then back again. ESX doesn't continuously load balance.

Here is a good pdf on it.

www.vmware.com/files/pdf/virtual_networking_concepts.pdf

You would have to trunk the ports on the pswitch side. The thing with the load balancing is that ESX doesn’t do it on the fly. It only does it when a VM is rebooted. At that time it checks the load on the pNIC and assigns the VM to the lowest utilized one at that time. You may also be able to get the same effect if you create a vSwitch that has no pNIC assigned to it and move the VM to that port group briefly and then back again. ESX doesn't continuously load balance.

This is completely false: we (at least in the current version of ESX) never look at the load of the physical NICs in order to make a teaming decision, would it be statically (when a VM is connected to a vswitch) or dynamically ("on the fly"). This is something we're trying to improve though, but it's a much more complicated problem than what it looks like.

Just get that Virtual Cisco switch up and running!!!

Duncan
My virtualisation blog:
http://www.yellow-bricks.com

I know that you have to assign a vm to a port group as well as a vswitch that is going to have one or more pNIC assigned to it. What I was talking about is if there are multiple pNICs attached to a vSwitch then the VMs virtual NIC is assigned to one of these physical NICs once it is booted. If this is incorrect then that is great, but you should tell your tech support to stop giving out false information. When we put in a ticket to discuss this very subject with your tech support this is exactly how it was explained to me. So if I am incorrect then that is good to know. I am always looking to learn and help if I can that is why I am participating in this forum.

Maybe you can help me understand how it is actually done. This is important to a lot of us. Do the packets jump between the pNICs depending on the current load being put on that pNIC as long as they have been attached to the same vSwitch? Any info that you can offer is great?

Take a look at page 8 and 9 of the pdf that I linked earlier. It should give you a good description of what you are looking for.

If you have more than one pNic attached to the vSwitch the way the pNics are used by the VM's is down to the three possible methods that can be selected as a function of the Vswitch or portgroup.

Port ID and MAC Hash.

When a VM starts up and network traffic livens up a port on the vSwitch/portgroup and an association is made between that port and one of the pNic.

Until the VM is rebooted or the pNic fails that pNic will always be used for that VM. No account is made of the traffic generated by the VM.

The pNICs are allocated on a simple round robin basis as VM's are started. A slight variation on this is source Mac hash were the MAC address of the vNic on the VM is used instead of the Port ID. Port ID has a slight advantage as the host does not have to open the packet read the src MAC and therefore is lighter on the CPU. If you have 4 VM's on a vSwitch/portgroup with 2 pNics and you use either of these options the following could occur.

VM1 starts- high Network load -> pNIC 1

VM2 starts - low Network load -> pNIC 2

VM3 starts - high Network load -> pNIC 1

VM4 starts - low Network load -> pNIC 2

So as see this is not really load balancing, but its is simple and needs no special config on the core switch attached to the pNICs.

IP Hash.

If however you chose IP Hash the association is made between the pNIC and source-destination IP pairs.

What this means is that if VM with a IP of 10.0.0.1 talks to 10.0.0.2 it would use pNic 1. If the same VM then opens a channel with 10.0.0.3 it could be directed to pNic2. A third connection to 10.0.0.4 might be back to pNic 1.

The downside to this is that the core switch will see the MAC address of the VM on at least of it's ports. With the same MAC address on two ports which one does it use for the return packets. Unless your switch is configured to expect this behaviour (port aggregation) it's going to get very confused.

Of course if your VM talks primarily to one server - back end database, iscsi device, default gateway, how much load balancing do you get anyway?

Well none. As the mans says is not really load balancing its more fault tolerence.

I know that you have to assign a vm to a port group as well as a vswitch that is going to have one or more pNIC assigned to it. What I was talking about is if there are multiple pNICs attached to a vSwitch then the VMs virtual NIC is assigned to one of these physical NICs once it is booted. If this is incorrect then that is great, but you should tell your tech support to stop giving out false information. When we put in a ticket to discuss this very subject with your tech support this is exactly how it was explained to me. So if I am incorrect then that is good to know. I am always looking to learn and help if I can that is why I am participating in this forum.

Hey, sorry if I sounded a bit rude, I just don't want to see false knowledge spread on these forums. I'm browsing them for some time and I'm always amazed to find people convinced of something which sounds horribly wrong to me, and I always end up wondering where it came from (answer: the same forum). And I certainly don't want to see people writing scripts to periodically connect/disconnect a VM from a portgroup thinking it will load-balance traffic, I wouldn't be able to sleep at night knowing that

If you precisely wrote down what the VMware support told you, then I personally apologize and we should make sure none of our support folks thinks this way.

Maybe you can help me understand how it is actually done. This is important to a lot of us. Do the packets jump between the pNICs depending on the current load being put on that pNIC as long as they have been attached to the same vSwitch? Any info that you can offer is great? It actually depends on the teaming policy you choose.With the PortID-based policy, there is indeed a "bind this VM to this pNic" logic going on, but it is only based on the numerical ID the VM port is given: we don't look at the real-time load of each pNic in order to determine which pNic to choose. This policy is actually pretty similar to a round-robin one where we assign the pNics one after the other. In the majority of the cases it works well, but there are pathological cases where the load isn't evenly spread.

With the srcMAC and IP-Hash based policies, it depends on the content of each packet. ESX quickly hashes a packet and the result determines which pNic will output it. Again, we don't look at the load of each pNic when making this decision, this is purely semi-random.

Note that the IP-Hash policy is the only one theoretically allowing a VM to get more bandwidth than what you get with one uplink.

Also note that the srcMAC & IP-Hash policies have a slightly bigger CPU overhead compared to the PortID-based one, because ESX has to look at the content of each packet individually.

Again, don't think ESX will stay that way: we know that you're interested in leveraging your pNics

Thanks to the both of you for the information and clearing things up for me. I most definitely do not want to give out false information.

I have added some additional Nics to the Vswitch. I will report if this fixes the issue. I also upgraded the host to 3.5.

This document was generated from the following thread: Wierd Connectivity Issue - Could it be VMware related?