VMware Cloud Community
Kimbie
Contributor
Contributor
Jump to solution

Lost access to volume following 4.1 upgrade

Setup

1 x HP c7000 Blade enclosure

3 x HP BL460cG6 with dual 5540 Xeons and 48Gb RAM, QLogic ISCSI HBA cards

3 x HP P4300 Lefthands

4 x Cisco 3020 blade switches in the back of the c7000, 2 x dedicated for iSCSI traffic

vSphere Server running 4.1

The Problem

We have just gone through the process of upgrading our vSphere server from 4.0 to 4.1 to manage a standalone ESXi 4.1 system, so our attentions turn to our 3 blades running ESXi 4.0u1 we used the built in Update Manager, downloaded the 4.0 to 4.1 upgrade file and we upgraded our first blade and we did not notice any issues as the servers put on there were low use ones. We then upgraded the second and moved our primary mail server onto it.

It was when we did this we started to get errors where people were losing connection to the exchange server, after some investigation we Event Views and we were seeing the error:

"Lost Access to volume due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly"

Then approx 10 seconds later we get the message:

"Successfully restored access to volume following connectivity issues.

This error only occurs on the 4.1 blades, we rolled back a blade to 4.0 and the errors did not get displayed and no errors were reported with the servers on that blade. So as far as we can tell it is not a networking issue since the iSCSI traffic for all blades flows over the same switches to the Lefthands, and we were loosing connection to volumes on both Lefthands.

We have a call logged with HP on this, but as of yes we can not determine what is causing this issue, nor how to resolve it.

So any help is greatly appricated

Thanks

Dave

Tags (4)
0 Kudos
86 Replies
hyien
Contributor
Contributor
Jump to solution

I noticed that when the problem happens on my end, I get a bunch of

6179    13:30:51.258786    D-Link_xx:xx:xx    Spanning-tree-(for-bridges)_00    STP    60    RST. Root = 32768/0/00:xx:xx:xx:xx:xx  Cost = 2000  Port = 0x801d

frames on the switch port that my ESXi iSCSI connection is on.

0 Kudos
fcarballo
Contributor
Contributor
Jump to solution

This week I upgraded my environment from vSphere 4 to 5 and ESXi 4.1 to 5. Nothing has changed.

In my SAN environment I have three virtual disks on a single hosts group. Besides my four ESXi hosts, I have a CentOS host used as a mail server but not in production. It works more like a "backup", the end-users don't access it directly.

DELL told me that this was "probably" the cause of my connectivity issues, because the CentOS was locking the LUNs and causing the problem. The solution was create another hosts group just for my CentOS host and the virtual disk used by it, in such a way that the ESXi hosts could no longer "see" this virtual disk in the same way that CentOS could not see the virtual disks used by ESXi hosts as storage.

Unfortunately it has not solved my problem. DELL was wrong, again...

I'm waiting for the next chapter of this drama.

[]s

Felipe Carballo - VCP5 - VCP-Cloud
0 Kudos
kiwijj
Contributor
Contributor
Jump to solution

Hi Guys,

Dell have given me the configuration for our switches (Dell 5424's) and I will apply it this weekend.

They said to run the delete startup-config command to reset the switch back to factory defaults and then run the following commands to configure the switch

(Our iSCSI stuff is plugged into ports 1 to 12, you can use any VLAN number and you need to change the IP address to your own switch address)

spanning-tree mode rstp

interface range ethernet g(1-12)

spanning-tree portfast

exit

interface range ethernet g(1-12)

flowcontrol on

exit

port jumbo-frame

interface range ethernet g(1-12)

switchport mode general

exit

vlan database

vlan 11

exit

interface ethernet g1

switchport general pvid 11

exit

interface ethernet g2

switchport general pvid 11

exit

interface ethernet g3

switchport general pvid 11

exit

interface ethernet g4

switchport general pvid 11

exit

interface ethernet g5

switchport general pvid 11

exit

interface ethernet g6

switchport general pvid 11

exit

interface ethernet g7

switchport general pvid 11

exit

interface ethernet g8

switchport general pvid 11

exit

interface ethernet g9

switchport general pvid 11

exit

interface ethernet g10

switchport general pvid 11

exit

interface ethernet g11

switchport general pvid 11

exit

interface ethernet g12

switchport general pvid 11

exit

interface range ethernet g(1-12)

switchport general allowed vlan add 11 untagged

exit

iscsi target port 860 address 0.0.0.0

iscsi target port 3260 address 0.0.0.0

no iscsi enable

interface vlan 1

ip address 192.168.1.10 255.255.255.0

exit

username admin password 5f4dcc3b5aa765d61d8327deb882cf99 level 15 encrypted

snmp-server community Dell_Network_Manager rw view DefaultSuper

0 Kudos
Josh26
Virtuoso
Virtuoso
Jump to solution

This configuration basically says:

Enable portfast

Enable jumbo frames

Enable flow control

Use a trunk port, where the only allowed VLAN is the native VLAN.

There isn't anything in this that's likely to correct this issue or that hasn't been done before - although the last one is a bit unusual.

I would also ensure that you change the password and snmp community from the one that was just posted on a public forum.

0 Kudos
Darin777
Contributor
Contributor
Jump to solution

Kiwijj - Curious if you were able to make the switch change and the outcome?

Skyward-willw - Curious if you are still fixed since your last post?

0 Kudos
kiwijj
Contributor
Contributor
Jump to solution

Hi Darin,

I have been off sick for a few days so have not managed to make the changes yet. The other issue I have is I cannot get a serial connection to the switch to work and as Dell want me to wipe the configuraion then I need that working. These switches do not have a default IP address so you need the serial connection to work. It maybe the cable, I will bring in a multimeter and see if the cable is a crossover or not. I will update you once it has been done.

cheers,

JJ

0 Kudos
Skyward-willw
Contributor
Contributor
Jump to solution

Been out of the office for the last week.  Checked the Events and I don't see any issues - haven't been getting any complaints either.  Everything looks good for us.

The really odd this was that we had two seperate arrays and only the machines connecting to the MDL drives had this issue.  The servers connecting to the SAS array didn't have any issues.

My issue is definitely resolved at this point...we'll see what happens when I upgrade to vSphere 5 this week.

0 Kudos
somethingelse4
Contributor
Contributor
Jump to solution

I had an issue with the same symptoms, so posting my solution; hope this helps people.

I have a Dell PE R610 with Intel X520-T2 NIC (dual 10Gb) running ESXi 5, connecting to a Dell PV MD3620i through a Dell PC8024 (10Gb, L3).

I set up 2 seperate iSCSI VLANs on the PC8024 switch.  My mistake was assigning the VLANs an IP address and then configuring gateways on the MD3620i SAN to the iSCSI VLAN IPs.  VMware does not set a gateway for iSCSI traffic (as far as I can tell).

So basically what happened was, vmware and the SAN were trying to send iSCSI traffic through different routes...hence the disconnection.  So the solution, is to just not assign any IP to the VLAN interfaces on the switch and remove the gateway addresses for the iSCSI adapters on the SAN.  This worked for me.

Longer explanation for those who care...

Keep in mind that the PC8024 is a layer3 switch, so it will attempt to route your traffic if you configured the SAN to use the VLAN IP as the gateway.  I'm pretty sure that at least a few people who haven't found a solution to this problem yet, may have made this mistake.  VMware and Dell can't reproduce this because they know better then to route iSCSI traffic through different gateways :smileysilly:

Before I made this discovery, I tried a lot of things.  Enable/Disable jumbo frames, flow control, RSTP, ensure storm-control for unicast is disabled...nothing helped.  I was almost certain the Intel X520 was causing the issue since it's not listed on VMware HCL for iSCSI adapter...so I rewired to the R610 1Gb Broadcom adapters which are on the HCL...same behaviour.  So as a last ditch effort, I quickly configured the same VLANs for iSCSI on our 1Gb 3COM 2952 switch (supports static routing) - I DID NOT ASSIGN AN IP TO THE iSCSI VLANs...and everything worked great!  Knowing the only difference in the switch configs was the IP for the VLAN interfaces, I wired everything back to the 10Gb switch/nic, removed the VLAN IP and now all runs well and stable :smileycool: ... 10Gb, jumbo frames, multipath round-robin and all :smileygrin:

0 Kudos
fcarballo
Contributor
Contributor
Jump to solution

After all the time spend with DELL, my team and I realized that every time we had a problem a specific virtual machine was always involved. So, we put this virtual machine in another environment and the issue stoped.
I don't know why this virtual machine was causing the connectivity issue, maybe a problem during the deployment or something else.
We upgraded our environment to vSphere 5 as a try to solve the issue, but the solution was really the migration of the problematic virtual machine.

Felipe Carballo - VCP5 - VCP-Cloud
0 Kudos
kiwijj
Contributor
Contributor
Jump to solution

Hi,

I finally made the changes to the iSCSI switches as suggested by Dell over the weekend. Will now monitor and see if I still get iSCSI disconnects in ESX

cheers,

JJ

0 Kudos
Krede
Enthusiast
Enthusiast
Jump to solution

DDunaway-> Did you manage to solve this on your vSphere + EVA environment?

I think that we have the same problem.

0 Kudos
kiwijj
Contributor
Contributor
Jump to solution

Hi,

I made the changes suggested by Dell to the switch configurations but this has not made a difference, in fact last Friday night one of the controllers decided to reboot itself, this is also starting to happen on a regular basis. Went back to Dell who want the RAID controllers on the MD3000i firmware updated and they have also sent the issue up to engineering. As usual, will play the waiting game. 

JJ

0 Kudos
somethingelse4
Contributor
Contributor
Jump to solution

Can you guys check and ensure there is no gateway IP address set on the SAN?  You are not routing the iSCSI traffic, therefore the gateway IP should be blank (or just 0.0.0.0).

Also, remove the VLAN settings from the SAN controllers if you set any.  The controllers don't need this setting since they are plugged in directly to access ports and are not trunking to other VLANs.

This was causing issues for me even after I removed the VLAN Interface IPs from the PC8024 switch.  After removing the gateway IP and the VLAN tags on the MD3620i controllers, everything started to work great and I haven't had an issue since.

0 Kudos
kiwijj
Contributor
Contributor
Jump to solution

Hi,

I just checked the SAN and there are no VLAN's set. Now, as for the gateway, there are two place to set a gateway on the MD3000i.

Under the iSCSI tab, Configure iSCSI Host Ports we have no gateway set, ie it is 0.0.0.0

and under the Configure Ethernet Managment Ports there is a gateway set, are you saying to change this to 0.0.0.0 ?

Capture1.JPG

Under the Tools tab, Configure Ethernet Management ports the RAID Controller Module gateway has a gateway set

Capture.JPG

0 Kudos
somethingelse4
Contributor
Contributor
Jump to solution

Hi Kiwiji

I meant only on the iSCSI tab.  You need to have the gateway on the management tab since you are likely to manage the SAN from other VLANs; so your settings on the SAN are correct.

I found there was a bug on the MD3620i firmware where the network settings were not applied correctly (I discovered this while screwing around with the iSCSI gateway and VLAN settings).  If I changed only the gateway IP address on an iSCSI controller, the change does not get applied properly.  In order for all the changes to be applied, I needed to change the VLAN value.  So can you try this quickly; leave the iSCSI IP settings as t hey are now, but apply a VLAN (any VLAN, doesn't have to be what your actual one) to the iSCSI controllers.  Then, go back and unset VLAN support in the iSCSI interfaces again and apply the settings again.

I've discovered this on the newest MD3620i firmware, I'm really not sure how similar the MD3000i firmware is as I've never used it before.

0 Kudos
kiwijj
Contributor
Contributor
Jump to solution

Hi,

Thanks for that, I will change the VLAN setting this weekend while I am upgrading the firmware. as it requires a controller port reboot for VLAN change to take effect.

cheers,

JJ

0 Kudos
fcarballo
Contributor
Contributor
Jump to solution

I think that the gateway don't make any difference, because the hosts has an IP in the same subnet. The ESXi has no reason to send the packages to the gateway.

Felipe Carballo - VCP5 - VCP-Cloud
0 Kudos
somethingelse4
Contributor
Contributor
Jump to solution

Logically you are correct, and I agree with you.

But in my case, this is what solved the problem.  When I had the gateway set on the controllers and an IP for the vlan interfaces on the switch, my ping times were 1.2-1.4ms and I was experiencing the random disconnection issues.  After I removed the vlan IP addresses on the switch and the gateways from the iSCSI controllers on the SAN, ping times became 0.2-0.4ms and there have been no more disconnections.

Note, vmware wasn't at fault in my case, everything on the vmware server was fine.  It was the adjustments I made on the switch and the iSCSI controllers on the SAN that made the difference.  And as I mentioned, there definitely some bugs in the MD3620i firmware that I am able to reproduce; so it appears having the gateway set on the Dell iSCSI controllers were tripping the connection one way or another (I haven't tried analyzing the traffic with Wireshark or the like, but that may have shed some light on the issue).

0 Kudos
Darin777
Contributor
Contributor
Jump to solution

Update of my findings thus far.  R710 hosts, PowerConnect 5424 switches, MD3000i latest firmware.  Experiencing lost access to LUNs with 4.1 and even tested 5.0 with same results.   Rebuilt both hosts with 4.1 U2.  With U2 the events add additional messages that the path was degraded before losing access to the LUN (which then restores access within 10 seconds or so).  Found the issue for us only occurs during periods of low I/O with Round Robin paths.  Issue never occurrs with MRU.  Both Dell and VMWare reviewed the config and said everything was setup correctly.  After some troubleshooting finally tried setting the iSCSI device type to IOPS and IOPS=1.  This post here probably the best I can pass on http://www.yellow-bricks.com/2010/03/30/whats-the-point-of-setting-iops1/  With type IOPS and IOPS=1 or 10 (tried both), the issue has yet to occur while testing over the last month.  If I change the device back to the default policy, the issue starts occurring again up to a dozen times a day.  I'm going to open a new ticket with VMWare support with my findings, but wanted to pass this along in case it helps someone experiencing the same issue.

0 Kudos
flash600
Contributor
Contributor
Jump to solution

Hi,

I had the same problem with the storage.

We have an HP Eva 4400 and the storage is connected over FC. Always when i tried to migrate a vm from the eva-storage1 to eva-storage2 the error occurs. Now I have replaced the FC-cable, and the error is gone (I've tried to migrate 2 VMs without an error). I will testing again, but for the moment the error is gone.

0 Kudos