Solved: Re: Lost access to volume following 4.1 upgrade - Page 2

Kimbie · ‎10-03-2010

Setup

1 x HP c7000 Blade enclosure

3 x HP BL460cG6 with dual 5540 Xeons and 48Gb RAM, QLogic ISCSI HBA cards

3 x HP P4300 Lefthands

4 x Cisco 3020 blade switches in the back of the c7000, 2 x dedicated for iSCSI traffic

vSphere Server running 4.1

The Problem

We have just gone through the process of upgrading our vSphere server from 4.0 to 4.1 to manage a standalone ESXi 4.1 system, so our attentions turn to our 3 blades running ESXi 4.0u1 we used the built in Update Manager, downloaded the 4.0 to 4.1 upgrade file and we upgraded our first blade and we did not notice any issues as the servers put on there were low use ones. We then upgraded the second and moved our primary mail server onto it.

It was when we did this we started to get errors where people were losing connection to the exchange server, after some investigation we Event Views and we were seeing the error:

"Lost Access to volume due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly"

Then approx 10 seconds later we get the message:

"Successfully restored access to volume following connectivity issues.

This error only occurs on the 4.1 blades, we rolled back a blade to 4.0 and the errors did not get displayed and no errors were reported with the servers on that blade. So as far as we can tell it is not a networking issue since the iSCSI traffic for all blades flows over the same switches to the Lefthands, and we were loosing connection to volumes on both Lefthands.

We have a call logged with HP on this, but as of yes we can not determine what is causing this issue, nor how to resolve it.

So any help is greatly appricated

Thanks

Dave

MaximZ · ‎03-30-2011

Hi,

Dell specialists review configuration and found nothing wrong.

However, tonight I had the outage 😞

At this moment Dell suspect we have too many iSCSI sessions so the errors may be related to disk array (MD3000i) is droping connections.

We have other conf call today to check this idea.

-- Maxim

AlexLudwig · ‎04-04-2011

Update:

We updated our Blade Enclosure and all Blades. Nothing better. We went back to Version 4.1 260247, still not better. Don´t know what to do next....maybe we´ll try ALUA or Fixed to access our Luns, because we updated our CX4 to Flare 29....

AlexLudwig · ‎04-11-2011

We switched to ALUA like we were told from VMWare. Now it´s worse than ever.... And not only on our BladeEnclosure. Got same probs on our Fujitsu RX300 S4 as well. So the failure has to be up to VMWare or EMC and now both say that everything is green on their sites....hate that....

GeorgeVolentir · ‎04-18-2011

We also are experiencing a similar problem:

ESXi 4.1 Update 1 on two Dell m610 blade servers

EqualLogic PS6010XV and PS6010E iSCSI storage arrays

Dell CX300 FC storage array

Dell PowerConnect 8024F switches

Firmware on all devices is current

Our system successfully operated for more than a year with the FC array and ESX 4.0. The iSCSI component was added in January 2011, and we have had a ticket open with Dell ever since. We have sent countless series of logs to Dell and VMware without success. We have essentially dedicated an FTE to the problem for 3 months now.

We made some progress a few weeks ago by removing the FC from the equation, but the network flapping ultimately returned.

One firm resolution that we have found is to remove Jumbo Frames in ESXi. In our configuration:

esxcfg-vswitch -m 1500 vSwitch2

esxcfg-vswitch -l

esxcfg-vmknic -m 1500 iSCSI1

esxcfg-vmknic -m 1500 iSCSI2

esxcfg-vmknic -l

This isn't preferred as we truly believe that there is still an underlying issue, but it does calm down traffic.

kiwijj · ‎05-19-2011

Hi,

Just wanted to say that we have the same problem. Problem occured with ESX4.0 Update 2. All hosts have since been upgraded to ESXi 4.1 Update 1 as I understand that iSCSI is better with 4.1.

Connecting to a Dell MD3000i iSCSI storage array.

Mulitpathing has been setup using round robin. Connected targets = 4 Devices=3 Paths=12 for each of the 5 hosts.

I have opened support calls with both VMware and Dell and these have been open for two months now. VMware think it's storage and Dell says they can't find anything wrong with the storage. So basically they are both saying they can't see anything.

The problem is very random, different hosts at different times, different LUN's, sometimes the access is restored within the timeout and nothing is affected, at other times random VM's lose their disk lock and power off. I have checked the switch logs and nothing shows up there.

VMware have now requested a conference with Dell so shall see what happens with that.

Having our production VM's randomly powering off is causing us big headaches.

regards,

John

Maynard94 · ‎06-06-2011

We have a very similar set up. Did you find a resolution or are you still running 1500 MTU? Any recommendations or successful configuration changes would be helpful. -Thanks.

GeorgeVolentir · ‎06-06-2011

No resolution yet. The Dell engineers have now seen this a few times and are building a lab with identical equipment. In the mean time, they've recommended that we bump the MTU up in 500 increments until the problem recurs, but we're holding at 1500 until firm resolution is found.

AlexLudwig · ‎06-07-2011

Am I the only one with these problems on FibreChannel?! Seems to me that there is a huge bug in VMWares´ ESXi, but the only thing they´ll tell me is that the problem is up the storage vendor....

hyien · ‎06-18-2011

I'm experiencing a similar issue too after upgrading to ESXi 4.1-update1 (from ESXi-4.-0-update2) and ESXi410-201104001

applied. I'm on an iSCSI connection using SANmelody 3.0 PSP4 Update2.

fcarballo · ‎08-17-2011

I'm having the same issue, and it started after the last update to ESXi 4.1. We have DELL R610 and MD3200i.

Felipe Carballo - VCP5 - VCP-Cloud

kiwijj · ‎08-28-2011

Hi,

Our issues with this are continuing and have been for 6 months, both Dell and VMware have been looking at this with no resolution so far. Have tried numerous settings changes suggested by Dell and VMware and upgraded this and that and tried this and that, sent so many copies of logs. Very frustrating that this cannot be resolved. Looks like they do not have a clue and I am getting sick of coming in after hours to implement their latest "here try this"

John

fcarballo · ‎08-28-2011

It's really frustating. I did everything that DELL told me to do, enable flow control on Switch, disable spanning tree on switch's ports that are connect to storage and even a firmware upgrade on my Switchs.
The issue persists.
My hope is that a upgrade to ESXi 5 and vSphere 5 maybe help me solve this problem.

Thank you so much for your feedback.

Felipe Carballo - VCP5 - VCP-Cloud

Maynard94 · ‎08-29-2011

Our environment consists of an Equallogic’s PS4000 arrays, Dell R710 servers, and Dell 6224 iSCSI switches. While running firmware 4.x on our arrays we had been getting these errors and had some guest VM actually become unresponsive because of the error. After updating to version 5.0.x of the array firmware we have had no issues in 180+ days.

That being said we are on an extremely old version of the switch firmware and we are running ESXi, 4.1 260247 on the hosts. I would suspect that your issues are with your storage firmware or your version of ESX.

Do you have the ability to deploy a previous version of ESX or ESXi in your environment for testing purposes? If so it may help your Support Engineers at Dell or VMware narrow down the cause of the errors.

hyien · ‎09-04-2011

Update: Have applied ESXi410-201107001 and still getting the same problem.

kiwijj · ‎09-06-2011

Hi,

Right after having a call open for 6 months with Dell and VMware the result is :

VMware says it's the storage and will not take it any futher.

Dell says it's VMware but at least are still looking at it.

Thanks VMware , must be time to start looking at Hyper-V

John

fcarballo · ‎09-06-2011

I have a ticket open with DELL to solve this.

What I can't understand is that we have two MD3200i and four hosts R610. All the four hosts access the two MD3200i, but only on one MD3200i the hosts lost connection with the lun.

My team and I investigated that it is not due a network fault.

We will wait to hear what DELL has to say.

Felipe Carballo - VCP5 - VCP-Cloud

kiwijj · ‎09-06-2011

We have one MD3000i with an attached MD1000. We have 5 hosts 2950's running ESXi 4.1

When I ran everything through one controller on the MD3000i it affected only host 2.

When I ran everything through the other controller it only affected host 4.

Three servers (Production) are in vCenter, two are standalone. No problems with the standalone hosts.

I was at a Dell conference yesterday and talked with a senior techo guy who suggested the issue was most likely the switch configuration. I said Dell support had checked the switch and okayed it. He said he will send me what it should be configured as in the real world. Will try that when I get it (he is still doing conferences) and let you know.

fcarballo · ‎09-07-2011

DELL support instructed me to enable flow control on my PowerConnect and disable the spanning tree on ports that is connected to MD3200i controller.

But yesterday they told me that the Switch's log has some spanning tree errors and they will investigate this information.

Felipe Carballo - VCP5 - VCP-Cloud

Josh26 · ‎09-08-2011

If the SAN expects flow control, having it on the switch is a good idea.

Disabling spanning tree anywhere however, is a terrible idea.

fcarballo · ‎09-09-2011

Disable spanning tree on switches ports that is connected to storage array is a recomendation from DELL technicians.

Felipe Carballo - VCP5 - VCP-Cloud