VMware Cloud Community
bmekler
Contributor
Contributor
Jump to solution

Please validate my planned setup

Preparing a migration + expansion, here's the list of what I have to work with:

  1. 3x Dell R710 servers with 8xGbE interfaces each (4xBCM5716C onboard, 4x Intel PRO/1000 add-on card)
  2. 2x Dell 2950 servers with 4xGbE interfaces each (2x onboard, 2x add-on card)
  3. 2x NetApp FAS2040 (one chassis with two clustered heads, 4xGbE interfaces per head) with 1xDS4243 disk shelf (total 36x300GB 15k drives)
  4. 4x 24-port gigabit switches (three are HP ProCurve E2510-24G, one is a 3COM 3824)
  5. 2x FortiGate 200B (active/passive cluster)
  6. vSphere Essentials Plus 4.1 bundle

The setup is going to run several fairly high load  websites on IIS + MS SQL (both production and development), Exchange  2007 for about a hundred users, and SQL back-end for several internal  applications. It will reside in a colocation facility, all users are  remote, accessing it either via public internet or site-to-site VPN  links. IIS runs on several load-balanced servers and uses NetApp CIFS shares for shared storage.

Current plan is:

  • Two switches are designated for application traffic, named LAN1 and LAN2
  • Two switches are designated for storage traffic, named SAN1 and SAN2
  • On SAN1 and SAN2, define the following VLANs:
    • VLAN2 - NFS
    • VLAN3 - CIFS
    • VLAN21 - iSCSI1
    • VLAN22 - iSCSI2
  • On LAN1 and LAN2, define the following VLANs:
    • VLAN4 - VMotion
    • VLAN5 - DMZ1
    • VLAN6 - DMZ2
    • VLAN7 - DMZ3
    • VLAN8 - LAN
    • VLAN9 - Management
  • On SAN1 and SAN2, trunk ports 23-24 on each, assign VLAN 2 and 3 to the trunk, run two cables between switches
  • On LAN1 and LAN2, trunk ports 21-22 on each, assign VLAN 4 to the trunk, tag ports 23-24, assign VLANs 5-9 to the trunk, run four cables between switches
  • On each NetApp head, configure networking as follows:
    • vif0 - Single mode VIF on e0a and e0b
      • VLAN2 on vif0
      • VLAN3 on vif0
    • VLAN21 on e0c
    • VLAN22 on e0d
    • Plug e0a and e0c into SAN1, e0b and e0d into SAN2
    • Assign e0c and e0d to one target portal group
  • On each vSphere host, configure networking as follows:
    • vSwitch0 - vmnic0 and vmnic4, use explicit failover order, vmnic0 active, vmnic4 standby
      This is something that I'm not completely clear on - if I'm using two non-stacked switches with a link between them, do I have to use explicit failover order with active/standby on the switch, do I leave it on default settings of routing based on originating virtual port ID, or something else entirely? Can I, and do I need to use beacon probing for failover detection?
      • Port group DMZ1, tagged 5
      • Port group DMZ2, tagged 6
      • Port group DMZ3, tagged 7
      • Port group LAN, tagged 8
      • VMkernel port Management, tagged 9
      • Port group Management, tagged 9
    • vSwitch1 - vmnic1 and vmnic5, use explicit failover order, vmnic1 active, vmnic5 standby
      • VMkernel port VMotion, tagged 4
    • vSwitch2 - vmnic2 and vmnic6, use explicit failover order, vmnic2 active, vmnic6 standby
      • VMkernel port NFS, tagged 2
      • Port group CIFS, tagged 3
    • vSwitch3 - vmnic3
      • Port group iSCSI1, tagged 21
    • vSwitch4 - vmnic7
      • Port group iSCSI2, tagged 22
    • Plug vmnic0 and vmnic1 into LAN1, vmnic2 and vmnic3 into SAN1, vmnic4 and vmnic5 into LAN2, vmnic6 and vmnic7 into SAN2
  • FortiGate 1 plugs into LAN1 and SAN1, FortiGate 2 plugs into LAN2  and SAN2; each trio shares a power feed so in case a feed goes down, a  complete path is left intact
  • vSphere uses NFS for VMDK access
  • Every VM that needs CIFS access gets a vNIC connected to CIFS port group
  • Every VM that needs iSCSI access (SQL, Exchange) gets one vNIC  connected to iSCSI1 port group, and one vNIC connected to iSCSI2 port  group, MCS is configured on two links
  • MSSQL is configured as a two-node cluster with one instance, nodes are kept on different vSphere hosts
  • IIS hosts are grouped into web farms with nodes on different vSphere hosts, FortiGate is used as load balancer and SSL proxy
  • One 2950 server (with 6x2TB SATA drives in RAID5) is running  vCenter, an SMB share for VMDK backups with PHD Virtual, and an SMB  share for replicating the contents of NetApp CIFS share with robocopy;  two onboard Broadcom NICs are configured with BACS3 into an  active/standby SLB team and plugged into LAN1 and LAN2, two add-on Intel  NICs do the same with SAN1 and SAN2
  • One 2950 server (with 6x300GB SAS drives in RAID5) is a domain controller (two more virtual DCs are also present),  runs an SQL server instance used purely as mirror target for production  SQL, and MS iSCSI target serving LUNs for Exchange LCR, networking is  same as vCenter
  • 2950 servers have DRAC5 cards, but DRAC5 does not support VLAN tagging. I'm going to try configuring it to use onboard BCM5708 ports (NIC selection: shared with failover) and set the LAN1 and LAN2 ports assigned to these servers for both tagged and untagged VLAN9, but I'm not sure whether that will work. If it doesn't, I can fall back to the dedicated port, though that will reduce redundancy. Still, DRAC is not a critical production function. R710s running vSphere have iDRAC6 Enterprise cards which do support VLAN tagging.

Possible failures that I'm accounting for:

  • If one power feed dies, one FortiGate and two switches drop, their  counterparts pick up the load, CIFS and NFS connections are quickly  re-established, iSCSI loses one path, all the servers, the filer chassis  and the disk shelf have dual PSUs, each power feed is sufficient to run  all the gear (120V/30A)
  • If a switch or a FortiGate die, same thing as above except more limited in scope
  • If a port or an entire NIC dies, same thing
  • If a host dies, MSCS fails over the SQL instance, FortiGate load  balancer detects dead members and stops sending traffic their way,  VMware HA restarts affected VMs on two surviving hosts
  • If a filer head or an IOM die, the other head takes over and picks up CIFS and NFS connections as well as iSCSI sessions
  • If the entire storage eats itself for whatever reason, I have backups of all VMDKs valid to within last 24 hours (PHD), a copy of all CIFS data valid to within last 6 hours (robocopy), and a copy of all SQL and Exchange databases valid to within last few seconds (mirroring/LCR); return to operations is going to take a while though, mostly limited by the time needed to fix or procure and install new hardware
  • If the vCenter host dies, management is impacted until I spin up a new  one, but production is not affected; I also lose my VMDK and file  backups until they're recreatet, but again, this does not affect  production
  • If the physical DC host dies, I lose my SQL mirrors and Exchange LCR copies until it's replaced, but production is not affected (virtual DCs are used)
  • If I need to cold-start the entire system (say, if there was a facility power outage), I bring up physical DC first, then vCenter and NetApp, then vSphere hosts, then all the VMs
  • If I need to reboot a host (patches, hardware maintenance, etc), I use MSCS to vacate running SQL instances, shut down and migrate DCs, and VMotion everything else
  • If I need to reboot or replace an active switch, I use vif favor commands on NetApp and rearrange active/standby adapters on vSphere hosts prior to taking it down
  • If application data is deleted or corrupted, it can be recovered from NetApp snapshots (FAS2040 is purchased with a complete bundle, so I can use SMSQL, SME, SMBR, FlexClone, etc)
  • One more failure point that I haven't touched on yet - the facility provides only one external feed. There's a fifth switch that's used exclusively for splitting that feed between the two FortiGates' WAN ports, and that switch (and its power feed) is a single point of failure that can take down access to the entire cluster. There is no budget for a dual redundant feed (they want over $1k/month for that feature), so this is a known risk. If it does go down, the plan is to utilize "remote hands" at $150/incident to restore connectivity via a different path (plug the feed directly into a FortiGate or move the switch power feed to another PDU or move WAN cables to another switch)

There is no budget for a second site or a tape library with offsite tape storage to  protect against site failures; this is a known risk that the management  is aware of.

Reply
0 Kudos
23 Replies
arturka
Expert
Expert
Jump to solution

bmekler wrote:

Artur wrote:

Looks good, just a one question  - Port Group management  - what type of traffic will be used for ?

Some applications running in VMs need to access the management network - IPSentry for OS monitoring, MRTG to keep an eye on network load, Dell IT Assistant to process SNMP traps from hardware, that sort of thing.

So in general those type of traffic shouldn't occupy to much bandwidth but maybe would be better to put different vmnic order , something like that :

    • VMkernel port VMotion, tagged 4, use explicit failover order, vmnic1 active, vmnic5 standby
    • VMkernel port Management, tagged 9, use explicit failover order, vmnic5 active, vmnic1 standby
    • Port group Management, tagged 9, use explicit failover order, vmnic1 active, vmnic5 standby

Just to make sure that in case some suddenly pick on your VM mgmt network (for what ever reason) your ESX mgmt network will not be affected, what you think ?

Cheers
Artur

Please, don't forget the awarding points for "helpful" and/or "correct" answers.
VCDX77 My blog - http://vmwaremine.com
Reply
0 Kudos
bmekler
Contributor
Contributor
Jump to solution

But then a VMotion has the potential to overload the NIC used for management VM, monitoring checks will fail and false positive alerts will start flying every which way.

Reply
0 Kudos
arturka
Expert
Expert
Jump to solution

bmekler wrote:

But then a VMotion has the potential to overload the NIC used for management VM, monitoring checks will fail and false positive alerts will start flying every which way.

Right, on your place I would do tests, provision few VM's with full monitoring etc, and do several times vmoton with different vmnic order

Cheers
Artur

Please, don't forget the awarding points for "helpful" and/or "correct" answers.

VCDX77 My blog - http://vmwaremine.com
Reply
0 Kudos
bmekler
Contributor
Contributor
Jump to solution

They've been running for a few months now (since I P2V'ed the ancient monitoring server) on my current management network (ESX 3.5, 1 NIC management, 1 VMotion, 1 LAN, 1 DMZ, 2 storage) without any issues, so I consider it tested.

Reply
0 Kudos