My environment:
5 HP DL380 G6 servers, 60gb RAM, 8 network ports
- 4 onboard NICs
- vmnic0, vmnic1,vmnic2,vmnic3
- All NICs go to switch 1
- PCI card w/4 port NICs
- vmnic4, vmnic5, vmnic6, vmnic7
- All NICs go to switch 2
Fiber attached SAN through Emulex HBAs
My network config as follows:
vswitch0: Service Console Only
Contains vmnic0 and vmnic7
vswitch1: vmotion network, service console 2
Contains vmnic1 and vmnic 6
vdistributed switch: Production Server Network
Contains vmnic2 and vmnic5
So, as you can see, I have redundant network connections from each vswitch to each physical switch. Physical switch layout is:
Cisco 2350 in each server rack (hence switch 1 and switch 2 etc...)
uplink to Cisco 6506 core via 20gb etherchannel from each 2350
one esx server, esx10 is connected to switch 1 and switch 2.
other esx servers, esx01, 02, 03 , and 04 are connected to switches 3, 4, 5 and 6
I recently had one of my 2350 switches go down that connected esx10. Instead of esx10 just failing to the other NICs on the server, the rest of the cluster saw esx10 as down and performed an HA failover of the guests that were on it, resulting in downtime of those servers as they rebooted on the other esx hosts.
My question, maybe problem, is why didn't the connections failover instead of migrating the guests? The failure in the switch was in the etherchannel uplink, not the switch itself, so the physical NIC connections were still live. But shouldn't there be some kind of heartbeat to account for downstream network failures? I put in all this network redundancy just for failures like this, but it all seems moot when I get downtime anyways from an HA migration with reboot.
Any ideas are greatly appreciated. Thanks!
By default only the link state is tracked. Depending on how many NICs are involved this issue may be avoided by either configuring Beacon Probing on the ESX host or Link State Tracking on the physical switch (see http://kb.vmware.com/kb/1005577)
André
Discussion moved from VMware ESX™ 4 to Availability: HA & FT
