Hi folks,
Wondering if anyone on the boards has experienced this particular issue, a quick run-through of our environment:
7 node cluster running VirtualCenter 2.5 Update 3
Hardware 7 x DL585 G2's(BIOS up-to-date)
Storage: NetApp FAS3070c - NFS mounts used for storage
Each host running ESX 3.5 Update 3 (4 critical patches added-on)
150 virtual machines running
5 vSwitches per host (each with 2 pNICs patched to 2 separate physical network switches (2 x Catalyst 6509)
vSwitch Configuration
Load Balancing: Route based on the Originating Virtual Port ID
Network Failover Detection: Link Status only
Notify Switches: Yes
Failback: Yes
1 vSwitch SC
1 vSwitch VMotion (private VLAN)
1 vSwitch VM Network
1 vSwitch NFS (separate VLAN)
1 vSwitch VM Network (redundant)
Have run repeated physical cable checks to ensure the vmnics are patched properly, all check out and running in their proper VLANs)
HA/DRS/VMotion all running fine.
Logged a ticket with VMware to verify our storage configuration was fine (included settings made from Netapp Best Practices guide for ESX) - confirmed running best practices.
VM's generally running fine (no disk errors/reported)
Issue:
When performing a single network switch outage in around a quarter of the VM's lose access to their VMDKS(effectively if you go on the console of the VM it will display a PXE boot message in DOS). Failover of traffic from one switch to the other can take over 15 minutes on average.
Now to replicate the issue both the "active" NFS vswitch vmnic and the 10gb Fibre connection running from the physical switch to the netapp filer from one physical switch need to be unplugged - the issue does not occur if just one is unplugged.
Tried:
Setting Failback to No
Active/Standby for NFS vswitch vmnics
Made no difference.
Have tested in a lab environment using a single ESX host and 30 dummy VM's
See the VM's hang for roughly 2 minutes before returning back to life - in the event logs a series of disk errors (symmpi) will be reported during the hang period.
Our networking group have confirmed Portfast is enabled on the ports and port security is disabled.
Any suggestions gratefully received - forgive the rather lengthy submission!