Lost Access to Volume - inconsistent behavior across 4 ESXi servers

We have a small setup dedicated to a specific purpose:

Cluster 1 - two Dell R815 servers each with 6 x 1 GbE NICs (2 for management/vMotion, 2 for iSCSI and 2 for VM traffic)

Cluster 2 - two Dell R820 servers each with 4 x 10 GbE NICs (2 for management/vMotion/iSCSI, 2 for VM traffic)

SAN - Dell PowerVault MD3200i - shared by all 4 servers

The servers had been running ESXi 4.0 (Cluster 1) and ESXi 5.1 (Cluster 2) and were all upgraded to ESXi 5.5 several months ago.  The hypervisor is installed on embedded SD cards and all the datastores are on iSCSI LUNs.

The switches that this (and other gear) connects to are a pair of Cisco Nexus 5010 10 GbE switches along with a pair of Cisco Nexus 2148T 1 GbE fabric extenders.

The initial behavior was that VMs on Servers A and B (in cluster 1) were behaving slowly and eventually the VMs on server A became impossible to connect to.

While looking at the Event tabs on these servers we saw messages about various datastores having loss of connectivity and the recovering connectivity around 90-120 seconds later.  We eventually looked at servers C and D as well (in cluster 2) and they were having similar issues.  Some data stores were having issues on multiple servers, some datastores were having issues only on 1-2 servers.

The messages in the Event tab looked like:

   Lost access to volume <UUID> (<Volume Name>) due to connectivity issues.  cover attempt is in progress and outcome will be reported shortly.

   Successfully restored access to volume <UUID> (<Volume Name>) following connectivity issues.

We started looking at the virtual disks on the MD3200i and saw that some of them were not on the preferred controller.  Because most of the VMs were off on cluster 2 at this time and we had shut down most of the VMs on cluster 1 while troubleshooting, we went ahead and used the "Change -> Ownership/Preferred Path -> RAID Controller Module 1 (Preferred)” menu option to move these Virtual Disks back to their preferred RAID controller.   After doing so, the lost access and restored access messages continued to occur on all four servers.

We eventually tried to reboot, and then power cycled, the problematic server A and once it shutdown, the lost access and restored access messages stopped occurring on the other three servers and remained absent once server A rebooted and was back online.   These would seem to indicate that something about the state of server A was impacting the iSCSI access for the other three servers.

After getting it back online and powering on VMs, we looked at the SAN interface again and saw that these Virtual Disks had moved back to Controller 0 on a non-preferred path.  At this point we realized that half of the Virtual Disks were on non-preferred paths and all of them were being handled by Controller 0.  Checking the status of the Physical components, Controller 1 showed as Online.  Refreshing the Event Log showed messages like this shortly after moving Virtual Disks from Controller 0 to Controller 1:

    Enclosure 0, Slot 0:   Virtual Disk not on preferred path due to AVT/RDAC failover.

I’m looking to see what that message indicates and contacting Dell Support to see if additional diagnostics need to be run to determine if Controller 1 needs replacing, but I was wondering is anyone had any ideas about how one misbehaving server could have been impacting the iSCSI LUN access for the other three servers?

I checked the switch ports for all four ESXi servers for their iSCSI connections and the switch ports for all eight 1 GbE connections from the MD3200i and there were 0 errors, overruns, underuns, collisions or any other kind of bad packet counts on all of the involved switch ports.



Tags (3)
0 Kudos
0 Replies