Solved: NLB issues

NuggetGTR · ‎08-10-2010

Hi All,

This is a weird one and I can’t really find anything when searching. I just wanted to know if other people have seen this? and if there is a work around?

Ok the virtual environment i administer is quiet large and the developers love windows NLB where i prefer f5 or hardware NLB. So there is quiet allot of windows NLB clusters in the virtual environment The issue I have come across is if the critical path for an application goes from one nlb cluster to another( so they are talking to each other). If a node from each of these clusters are on the same physical host NLB falls over.

why?

well windows NLB when a request is sent to the VIP all the nodes of that cluster must respond before it is actioned by one of the nodes, from testing done what I have found is that if a node from each cluster are on the same host the request seams to stay internal to the ESX host, like it goes i know where that IP is and passes it to that one node, meanwhile that node is waiting for the other node(s) in the cluster to acknowledge the request but they never get it and the whole thing is halted and times out, and from the packet capturing done by the networks team it doesn’t even leave the host it seams.

ok a few affinity rules would fix this but Im talking about 8 node clusters talking to a 4 node cluster which then in turn talks to another 4 node cluster and im talking about another 50 like that to the point where DRS cant move anything and the rules are an administrave nightmare specially since only 2 machines can be in one rule.

All hosts are running ESX 4 update 1 and running on HP blades, unfortunately we are running on unicast mode, due to the size of the environment networks don’t want to or cant add the arp entries to all the switches/routers. It is setup as per recommendations using unicast.

I can reproduce this issue every time even in testing. would it be because of unicast? i dont see how.

Also I should add it doesn’t have to be 2 nlb clusters, if a client server is trying to hit the VIP of a windows NLB cluster and it is on the same host as one of the nodes it will time out as only one node gets the requests. F5 NLB works flawlessly and when they are on separate hosts windows NLB works fine too.

Only came across this because jobs would come in about the application not responding, when trying to hit the vip on the correct port for the requesting server it wouldn’t connect and the support guys would vmotion a node and it would usually fix the issue, and if it didn’t they would move all nodes onto the same esx host and it would work every time (dodgy I know but until fulltime vmware resources came on board like me no one had time to really look at it)

any ideas or input would be great hope i explianed the issue clearly enough

Cheers

________________________________________ Blog: http://virtualiseme.net.au VCDX #201 Author of Mastering vRealize Operations Manager

AndreTheGiant · ‎08-10-2010

See: http://www.vmware.com/files/pdf/implmenting_ms_network_load_balancing.pdf

You have to use multicast.

Andre

Andrew | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro

View solution in original post

AndreTheGiant · ‎08-10-2010

See: http://www.vmware.com/files/pdf/implmenting_ms_network_load_balancing.pdf

You have to use multicast.

Andre

Andrew | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro

NuggetGTR · ‎08-11-2010

haha ill give you the points there andre, I know that document soooo well ... well i thought I did, must have read it 100 times.

after you linked it I thought Im sure that document says both are supported, which they are but then re reading it for the 101 time this stood out under unicast:

"Because the virtual switch operates with complete data about the underlying MAC addresses of the

virtual NICs inside each virtual machine, it always correctly forwards packets containing a MAC address

matching that of a running virtual machine. As a result of this behavior, the virtual switch does not

forward traffic destined for the Network Load Balancing MAC address outside the virtual environment

into the physical network, because it is able to forward it to a local virtual machine."

this is exactly the issue its not sending outside the host when a node from each cluster is on it so the other nodes dont get it which would require it to go out to the physical netowrk

well its time i go talk to comms and see if they can handle going multicast, im sure their routers will support it they are just being lazy

Cheers!!!

________________________________________ Blog: http://virtualiseme.net.au VCDX #201 Author of Mastering vRealize Operations Manager