Solved: Best possible networking configuration for HA and ...

rmontyq · ‎06-06-2008

Evening:

Until recently, I had the following configuration for networking on most of my ESX servers (DL580's, shared storage on Netapp 3020's, 6 GB network cards):

vSwitch0:

Service Console | vmnic0 connected physically to Cisco switch0

vKernel | vmnic1 connected physically to Cisco switch1

vmnic0 backup nic to vmnic1

vmnic1 backup nic to vmnic0

Service Console properties: Security, Traffic Shaping at defaults, Nic Teaming set to:

Route based upon Source Mac Hash, Beacon Probing, Notify switches, rolling failover

vSwitch1:

2 port groups, pg_01 and pg_02.

pg_01:

vmnic2 physically connected to Cisco Switch0

vmnic3 physically connected to Cisco Switch1

Port Group properties: Security, Traffic Shaping at defaults, Nic Teaming set to:

Route based upon Source Mac Hash, Beacon Probing, Notify switches, rolling failover

pg_02:

vmnic4 physically connected to Cisco Switch0

vmnic5 physically connected to Cisco Switch1

Port Group properties: Security, Traffic Shaping at defaults, Nic Teaming set to:

Route based upon Source Mac Hash, Beacon Probing, Notify switches, rolling failover

vmnic2 and vmnic3 backup to pg_02, vmnic4 and vmnic5 backup to pg_01

So, I had 7 servers configured this way for about 3 months when this week, we began experiencing issues with Cisco Switch0 and the networking group saw the same Mac addresses on BOTH switches, traced back to the Console an vKernel of all 7 servers and insisted I 1) Split the vKernel and Console to seperate virtual switches, route based upon virtual port ID and turn off beacon probing.

Since they cannot explain why there were no issues until this week (remember, I had these configured this way up to 3 months), can anyone else attempt to explain to me where the above configuration is incorrect that would cause ONE switch to begin acting flaky and shutting down? Keep in mind that Cisco Switch0, and NOT Cisco Switch1 began having issues.

Any enlightenment would be appreciated.

Thanks!

mike_laspina · ‎06-07-2008

What you have now is very close to the best practice with a few exceptions and one unknown for me.

To better meet the best practice we would need to create a redundacy across all service classes of VMnet, SC and vmkernel while keeping them separate - add one dual port card and you are there. This way you can survive more

fault events. Most commonly human error. You may not even need to add the dual port depending on the load. You may be able to do three pairs. Two 1 gig adapters can carry a lot of traffic.

The last part is unknown, are you running public access any of the VM's? If so the last component of best practice would be to physically separate the public VLAN in both the NICS and the physical switches.

I like using two intel quads and then you usually end up with 10 ports. (2 Mobo ones)

You can place SC and vmkernel together with no serious issues, it just has some packet eavesdropping security concerns if thats even relevent.

The highest priority is fault tolerance, it's the highest risk item in the security domain for every system.

Of your list 3 & 5 will provide the highest availability and thats what I would focus in on. Option 5 is up to the load capacity requirements. VMWare perfs stats should help with that.

The reason I would use the dual pswitch method is the network devices need patches too so this way you can tolerate one switch going down at a time and keep the systems alive.

The draw back is it will be more complex to deal with and you need to test the failure scenarious to verify it works before a real event occurs.

http://blog.laspina.ca/ vExpert 2009

View solution in original post

mike_laspina · ‎06-06-2008

Hi,

You have done a great job in describing the ESX side but what is happening on the physical switch side?

How is it configured?

http://blog.laspina.ca/ vExpert 2009

rmontyq · ‎06-07-2008

Well, I am not a networking engineer and therefore do not have access to any configurations/logs on the physical side, hence my clearer explanation on the VMware side <g>. What I have been told is that there is "flapping" going on between the two physical switches where both switches see the same MAC address as having traffic. MAC addresses that are traced back to the vKernel and Service Console. They contribute this to the "beaconing" being turned on on my side.

Why I have configured the VMware side as I have is that starting 8 months ago, one of the switches would flake out where it would have a link status as connected, but not passing any traffic. This took down any VM/ESX server that was connected via Link Status load balancing only.

After about 2 incidents like this, and additional reading and discussion with other VMware users, I settled upon the above configuration. Until the network group finally figured out what was going on with the one switch, I never again suffered an outage when the switch did flake.

Now the OTHER switch is doing the same thing and a code update did not help it this time and the networking group is pointing out that my ESX configuration is causing the switch to "flap" and therefore go offline. It is interesting to note that again, I never suffered an outage on the ESX/VM side (physical servers with non-redunant links or DB's went down hard) while configured as above and only the second switch would go offline, not both.

So my query to the community at large is to better understand what is going on.

mike_laspina · ‎06-07-2008

Ok,

I can see why there are issues. The networking group does not see what is needed and you as well are not sure of whats required on that side.

It's actually a bit of both worlds. The switch config can impact the ESX environment just as the ESX config can do to the physical network.

There are two very elemental components. 802.1q VLAN trunking and port aggregates( LACP or AgP). It can run with or without them but VLAN's are a must if you what to make the most of your ESX environment.

Regardless of where the separate functional element of SC's VMK's VMnets sit the two fundementals need to be done correctly to mainitain a stable network.

I can only give you a left or right on this one without the switch details.

If you are only using 802.1q VLAN trunks then you should run team settings as follows.(Defaults)

Route on original port ID

Link status only

Notify - Yes

Failback - Yes

With aggregates the possibilities expand to a wider range.

The more common approach is to use IP hash routing with port trunks on Cisco switches using FEC and it must have static trunk configuration settings only.

http://blog.laspina.ca/ vExpert 2009

rmontyq · ‎06-07-2008

Great so far. I have left a query with one of our network engineers asking exactly what Cisco hardware we are running as well as IOS version and any configurations that may help. As additional info,of the 7 ESX servers, 4 of the them are work horses with 4 dual core CPU's and 36 GB RAM where I have spec'd them out to run upwards of 48 vm's each in an emergency, 24 vm's each day to day under normal circumstances. Three are low end dl360's used for test/qa/PC class vm's w/4 nics and 4 GB RAM, normally running 6-8 vm's concurrently.

Until I get more info about the switches, which is the best way to divide my 6 connections

up (and I am not above adding up to 4 more per server!)?

1. Combine Service Console and vKernal w/2 nics configured as you've stated and 4 for the VM's under two port groups split with 2 nics each, backing the other two, also configured the way you've suggested

2. Split Service Console into a vSwitch to itself, w/2nics configured, vKernel shares the 4 nics with the VM's (one as primary, the other three for fail over)

3. Split Service Console, vKernel and VM's into seperate switches with 2 nics each configured

4. Split Service Console w/1 nic, vKernel w/1 nic and VM's w/4 nics configured

5. Add 4 additional nics and go with #3 adding the four new nics for VM's. (I use 4 because it is cost effective to buy a single addon w/4 ports over a dual or single card setup)

6. Some other configuration you would suggest.

BTW, thanks for all the info and assist. I really believe in the VMware product and need to ensure it works at it's peak performance.

mike_laspina · ‎06-07-2008