VMware Cloud Community
rmontyq
Contributor
Contributor
Jump to solution

Best possible networking configuration for HA and DRS - any networking guru's like to comment?

Evening:

Until recently, I had the following configuration for networking on most of my ESX servers (DL580's, shared storage on Netapp 3020's, 6 GB network cards):

vSwitch0:

Service Console | vmnic0 connected physically to Cisco switch0

vKernel | vmnic1 connected physically to Cisco switch1

vmnic0 backup nic to vmnic1

vmnic1 backup nic to vmnic0

Service Console properties: Security, Traffic Shaping at defaults, Nic Teaming set to:

Route based upon Source Mac Hash, Beacon Probing, Notify switches, rolling failover

vSwitch1:

2 port groups, pg_01 and pg_02.

pg_01:

vmnic2 physically connected to Cisco Switch0

vmnic3 physically connected to Cisco Switch1

Port Group properties: Security, Traffic Shaping at defaults, Nic Teaming set to:

Route based upon Source Mac Hash, Beacon Probing, Notify switches, rolling failover

pg_02:

vmnic4 physically connected to Cisco Switch0

vmnic5 physically connected to Cisco Switch1

Port Group properties: Security, Traffic Shaping at defaults, Nic Teaming set to:

Route based upon Source Mac Hash, Beacon Probing, Notify switches, rolling failover

vmnic2 and vmnic3 backup to pg_02, vmnic4 and vmnic5 backup to pg_01

So, I had 7 servers configured this way for about 3 months when this week, we began experiencing issues with Cisco Switch0 and the networking group saw the same Mac addresses on BOTH switches, traced back to the Console an vKernel of all 7 servers and insisted I 1) Split the vKernel and Console to seperate virtual switches, route based upon virtual port ID and turn off beacon probing.

Since they cannot explain why there were no issues until this week (remember, I had these configured this way up to 3 months), can anyone else attempt to explain to me where the above configuration is incorrect that would cause ONE switch to begin acting flaky and shutting down? Keep in mind that Cisco Switch0, and NOT Cisco Switch1 began having issues.

Any enlightenment would be appreciated.

Thanks!

0 Kudos
1 Solution

Accepted Solutions
mike_laspina
Champion
Champion
Jump to solution

What you have now is very close to the best practice with a few exceptions and one unknown for me.

To better meet the best practice we would need to create a redundacy across all service classes of VMnet, SC and vmkernel while keeping them separate - add one dual port card and you are there. This way you can survive more

fault events. Most commonly human error. You may not even need to add the dual port depending on the load. You may be able to do three pairs. Two 1 gig adapters can carry a lot of traffic.

The last part is unknown, are you running public access any of the VM's? If so the last component of best practice would be to physically separate the public VLAN in both the NICS and the physical switches.

I like using two intel quads and then you usually end up with 10 ports. (2 Mobo ones)

You can place SC and vmkernel together with no serious issues, it just has some packet eavesdropping security concerns if thats even relevent.

The highest priority is fault tolerance, it's the highest risk item in the security domain for every system.

Of your list 3 & 5 will provide the highest availability and thats what I would focus in on. Option 5 is up to the load capacity requirements. VMWare perfs stats should help with that.

The reason I would use the dual pswitch method is the network devices need patches too so this way you can tolerate one switch going down at a time and keep the systems alive.

The draw back is it will be more complex to deal with and you need to test the failure scenarious to verify it works before a real event occurs.

http://blog.laspina.ca/ vExpert 2009

View solution in original post

0 Kudos
9 Replies
mike_laspina
Champion
Champion
Jump to solution

Hi,

You have done a great job in describing the ESX side but what is happening on the physical switch side?

How is it configured?

http://blog.laspina.ca/ vExpert 2009
rmontyq
Contributor
Contributor
Jump to solution

Well, I am not a networking engineer and therefore do not have access to any configurations/logs on the physical side, hence my clearer explanation on the VMware side <g>. What I have been told is that there is "flapping" going on between the two physical switches where both switches see the same MAC address as having traffic. MAC addresses that are traced back to the vKernel and Service Console. They contribute this to the "beaconing" being turned on on my side.

Why I have configured the VMware side as I have is that starting 8 months ago, one of the switches would flake out where it would have a link status as connected, but not passing any traffic. This took down any VM/ESX server that was connected via Link Status load balancing only.

After about 2 incidents like this, and additional reading and discussion with other VMware users, I settled upon the above configuration. Until the network group finally figured out what was going on with the one switch, I never again suffered an outage when the switch did flake.

Now the OTHER switch is doing the same thing and a code update did not help it this time and the networking group is pointing out that my ESX configuration is causing the switch to "flap" and therefore go offline. It is interesting to note that again, I never suffered an outage on the ESX/VM side (physical servers with non-redunant links or DB's went down hard) while configured as above and only the second switch would go offline, not both.

So my query to the community at large is to better understand what is going on.

0 Kudos
mike_laspina
Champion
Champion
Jump to solution

Ok,

I can see why there are issues. The networking group does not see what is needed and you as well are not sure of whats required on that side.

It's actually a bit of both worlds. The switch config can impact the ESX environment just as the ESX config can do to the physical network.

There are two very elemental components. 802.1q VLAN trunking and port aggregates( LACP or AgP). It can run with or without them but VLAN's are a must if you what to make the most of your ESX environment.

Regardless of where the separate functional element of SC's VMK's VMnets sit the two fundementals need to be done correctly to mainitain a stable network.

I can only give you a left or right on this one without the switch details.

If you are only using 802.1q VLAN trunks then you should run team settings as follows.(Defaults)

Route on original port ID

Link status only

Notify - Yes

Failback - Yes

With aggregates the possibilities expand to a wider range.

The more common approach is to use IP hash routing with port trunks on Cisco switches using FEC and it must have static trunk configuration settings only.

http://blog.laspina.ca/ vExpert 2009
rmontyq
Contributor
Contributor
Jump to solution

Great so far. I have left a query with one of our network engineers asking exactly what Cisco hardware we are running as well as IOS version and any configurations that may help. As additional info,of the 7 ESX servers, 4 of the them are work horses with 4 dual core CPU's and 36 GB RAM where I have spec'd them out to run upwards of 48 vm's each in an emergency, 24 vm's each day to day under normal circumstances. Three are low end dl360's used for test/qa/PC class vm's w/4 nics and 4 GB RAM, normally running 6-8 vm's concurrently.

Until I get more info about the switches, which is the best way to divide my 6 connections

up (and I am not above adding up to 4 more per server!)?

1. Combine Service Console and vKernal w/2 nics configured as you've stated and 4 for the VM's under two port groups split with 2 nics each, backing the other two, also configured the way you've suggested

2. Split Service Console into a vSwitch to itself, w/2nics configured, vKernel shares the 4 nics with the VM's (one as primary, the other three for fail over)

3. Split Service Console, vKernel and VM's into seperate switches with 2 nics each configured

4. Split Service Console w/1 nic, vKernel w/1 nic and VM's w/4 nics configured

5. Add 4 additional nics and go with #3 adding the four new nics for VM's. (I use 4 because it is cost effective to buy a single addon w/4 ports over a dual or single card setup)

6. Some other configuration you would suggest.

BTW, thanks for all the info and assist. I really believe in the VMware product and need to ensure it works at it's peak performance.

0 Kudos
mike_laspina
Champion
Champion
Jump to solution

What you have now is very close to the best practice with a few exceptions and one unknown for me.

To better meet the best practice we would need to create a redundacy across all service classes of VMnet, SC and vmkernel while keeping them separate - add one dual port card and you are there. This way you can survive more

fault events. Most commonly human error. You may not even need to add the dual port depending on the load. You may be able to do three pairs. Two 1 gig adapters can carry a lot of traffic.

The last part is unknown, are you running public access any of the VM's? If so the last component of best practice would be to physically separate the public VLAN in both the NICS and the physical switches.

I like using two intel quads and then you usually end up with 10 ports. (2 Mobo ones)

You can place SC and vmkernel together with no serious issues, it just has some packet eavesdropping security concerns if thats even relevent.

The highest priority is fault tolerance, it's the highest risk item in the security domain for every system.

Of your list 3 & 5 will provide the highest availability and thats what I would focus in on. Option 5 is up to the load capacity requirements. VMWare perfs stats should help with that.

The reason I would use the dual pswitch method is the network devices need patches too so this way you can tolerate one switch going down at a time and keep the systems alive.

The draw back is it will be more complex to deal with and you need to test the failure scenarious to verify it works before a real event occurs.

http://blog.laspina.ca/ vExpert 2009
0 Kudos
rmontyq
Contributor
Contributor
Jump to solution

Really just one more clarification if you would - how do you survive a link status failure when the switch itself provides the information that it is up but it is not passing traffice. That is why I chose beacon probing to begin with - we were hit too many times with that exact scenario in the past year and would not like to suffer it again. Is this something that can be configured on the switch side ( I know you would like more info on the switches) - but then I am trusting the switch setup to be correct.

Thanks for the assist - I will mark this answered for now until I can get more info.

0 Kudos
java_cat33
Virtuoso
Virtuoso
Jump to solution

One thing to be aware of is that if you have a port channel configured for say 4 of your ESX nics, and you have beacon probing enabled, when you ping an address you will receive a standard ping response and 3 duplicates. I found that if I use link status only and turn off beacon probing - problem no longer exists.

I've got no technical explanation for this as I'm not a network dude, but just something to be aware of.

0 Kudos
mike_laspina
Champion
Champion
Jump to solution

To beacon or not to beacon that is the question.

Beacons when the work correctly are great they can detect failures in the physical switch like VLAN misconfigurations and layer 2 failures outside of the port based RX/TX carriers aka link.

The problem with them is that they dont' always behave as we expect them to.

For a beacons to work we must be very dilegent in the physical switch configuration and testing. You can't just turn it on and hope for the best.

Beacons send out a packet that needs to cross the physical switch and come back in on a different targeted port other than the source port.

Switches do not always conform to this rule flow with features like flow control QOS etc.and eventually one mishandled event creates a false positive and complexity kicks you.

Link status is detected by both the switch and the NIC one cannot occur without the other. It's a sepcific RF signal established between transmitter/receiver pairs so it's not just the switch generating it.

http://blog.laspina.ca/ vExpert 2009
0 Kudos
rmontyq
Contributor
Contributor
Jump to solution

Link status is a specific RF signal established between transmitter and reciever pairs. I can understand that the switch is not the one generating it, I guess my observation should be that we have experienced failure at a higher level/layer than link status that was not detectable until systems went down.

That is why I liked beaconing - but I now have a better understanding of how it actually works and some of the pro's and cons. I need to establish a dialog with our Network Director (a Cisco freakin' guru) and see where we can go from here.

I do believe I am going to place a request for 4 quad port nic cards in the near future to beef up the networking. While the observation was made that 2 gig nics can handle a lot of vm's, we experienced drop packets when 18 or more VM's utilized these 2 cards - hence the config with 4 and split port groups to balance the vm load. My understanding was that TCP/IP contention was the culprit.

0 Kudos