VMware Communities
DONROPER
Contributor
Contributor

Multiple Hosts Losing Connection to multiple VNX LUNS after creating a new LUN on the VNX and attaching to Hosts.

We have put in a new VMWare 5.5 Infrastructure with 3 new Dell R530 Servers with Qlogic 2662 Dual Port FC cards. The Servers have 256 Gb RAM and we boot VMware 5.5U3 from spinning disk. We use Cisco 16GB Fibre Channel switches which have redundant links to a new EMC VNX5200 and to each host. We use Vcenter 6 Appliance to manage VMware. The Appliance resides inside the Cluster on a datastore.

Here is the problem: We put in this new infrastructure and tested and everything looked great. It was fast and appeared stable. We created 5 - 3 TB LUNS on the VNX and attached them to the Hosts. HA is enabled. We also enabled encryption on the VNX as well as Dedupe. The LUNS we created were thin. We moved over 85 VM's to this infrastructure and everything was great. No problems for 2 weeks. Then we needed to add an additional LUN and attach it to the Hosts. We created a new 3TB LUN and attached it to the Hosts.On Host 1 and 2 we could see the new LUN. On Host 3 all the LUNS disappeared even the new LUN immediately when we asked to do a refresh of storage. Nothing showed except the builtin disks. We freaked. This is not supposed to happen. We then noticed that the servers started moving extremely slow. We checked events and noticed that the LUNS were disconnecting and then reconnecting constantly every 6 to 7 seconds. We could no longer get into vCenter because vCenter was on a LUN on Host 3. We then noticed that servers became "inaccessible" that were located on Host 3. We logged in as root on each individual Host and still could not see the LUNS on host 3. We shut down the vCenter Server and removed it from inventory. We added it to Host 2 and vCenter came back up. That was when we noticed that servers were being migrated automatically to host 1 and 2. We shut down Host 3 and rebooted. Nothing. We thought it might be a bad HBA card and ordered a replacement. When it came in we installed the new card and configured it in the Cisco and the VNX. We restarted the server. The LUNS still were not there. We moved Host 3 out of the Cluster and tried to attach a new LUN to it. It worked. We moved some servers to the new LUNs. It worked. We then tried to create a new LUN and attach it to Host 1 and 2. The LUNs that had been on Host 1 was still visible but they disappeared on Host 2. We got VMware, Cisco and EMC support on the phone. They all pointed at each other as the cause. We escalated to level 3.  Each could prove that it was not their problem. We requested a EMC CE be onsite the next morning and EMC obliged. The CE came in and gathered logs and sent them to a senior Engineer. The Senior Engineer could not find a problem. The onsite CE rebooted both the SP's and all the hosts could again see all the LUNS. Amazing. That was the problem. I reattached Host 3 back to the cluster. I then created a new LUN and attached it to the cluster. It worked. I then deleted that LUN and created another. No go. Host 3 immediately lost all LUNS when I did a Scan. I disconnected Host 3 immediately from the Cluster because Servers were starting to gray out and get marked as inaccessable. This time Host 2 also lost connections to the LUNs as evidenced when we did a scan. We also noticed that scans to several seconds to finish now. The whole time we are getting the "Lost Connection to" and "Reconnected to the" every 5-6 seconds. We rebooted the SPs again and when we did the scans again the LUNS were there. We stopped at this point. Any ideas on what is going on?

Reply
0 Kudos
1 Reply
bwuser2014
Enthusiast
Enthusiast

Hello Donroper, I suggest that you can verify the following items,

1. Fix the speed of each port on Cisco SAN Swich, I assume that it is 8GB.

2. Fix the speed of each front end port is 8GB on each VNX5200 controller, I assume that it is installed 8GB FC module on each SP, by default each port it is auto.

3. Fix the speed of port on each HBA adapter, also it is 8GB

4. Make sure each FC zoning is Single Initiator Zoning, eg each zone has two members only.

  • HBA1 -> VNXSPA0, HBA1 -> VNXSPB1 on SW1
  • HBA2 -> VNXSPA1, HBA2 -> VNXSPB0 on SW2

5. One VNX Storage group map to one ESXi host only, not suggested to put three ESXi hosts into one Storage group.

  • Storage Group1 - ESXi host 1 -> Access LUN1, LUN2..LUNx
  • Storage Group2 - ESXi host 2 -> Access LUN1, LUN2..LUNx
  • Storage Group3 - ESXi host 3 -> Access LUN1, LUN2..LUNx

6. Make sure the HBA driver/firmware is EMC recommeded version on each ESXi host.

7. Make sure VNX code of VNX5200 is 05.33.009.5.155 (latest version), it can fix VMware VAAI features on ESXi 5.x host. 

VMware vExpert 2014 to 2023 and Influencer 100
Reply
0 Kudos