Solved: Re: Lost access to volume following 4.1 upgrade - Page 3

Kimbie · ‎10-03-2010

Setup

1 x HP c7000 Blade enclosure

3 x HP BL460cG6 with dual 5540 Xeons and 48Gb RAM, QLogic ISCSI HBA cards

3 x HP P4300 Lefthands

4 x Cisco 3020 blade switches in the back of the c7000, 2 x dedicated for iSCSI traffic

vSphere Server running 4.1

The Problem

We have just gone through the process of upgrading our vSphere server from 4.0 to 4.1 to manage a standalone ESXi 4.1 system, so our attentions turn to our 3 blades running ESXi 4.0u1 we used the built in Update Manager, downloaded the 4.0 to 4.1 upgrade file and we upgraded our first blade and we did not notice any issues as the servers put on there were low use ones. We then upgraded the second and moved our primary mail server onto it.

It was when we did this we started to get errors where people were losing connection to the exchange server, after some investigation we Event Views and we were seeing the error:

"Lost Access to volume due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly"

Then approx 10 seconds later we get the message:

"Successfully restored access to volume following connectivity issues.

This error only occurs on the 4.1 blades, we rolled back a blade to 4.0 and the errors did not get displayed and no errors were reported with the servers on that blade. So as far as we can tell it is not a networking issue since the iSCSI traffic for all blades flows over the same switches to the Lefthands, and we were loosing connection to volumes on both Lefthands.

We have a call logged with HP on this, but as of yes we can not determine what is causing this issue, nor how to resolve it.

So any help is greatly appricated

Thanks

Dave

Skyward-willw · ‎09-20-2011

Sorry, I noticed this was started for a different SAN model - post deleted.

fcarballo · ‎09-20-2011

I once had the same issue. Try pull off the controller that is presenting the problem, wait a minute and put it in again.

However, my problem of connectivity lost persists. I'm starting to think that my MD3200i has a hardware problem.

Felipe Carballo - VCP5 - VCP-Cloud

kiwijj · ‎09-20-2011

Hi WillW,

I don't think it matters what SAN model you have, the problem is still the same. Virtual disk not on preferred path is a common issue and is easily resolved by using the MDSM and selecting the Support tab, Manage Raid Controller Modulkes, Redistribute Virtual Disks.

Hi fcarballo,

Dell have got back to me and asked to turn on Spanning tree mode (RSTP). This is after a few months ago where they asked me to turn it off !

Another person from Dell sent me some document where it says to turn it off. I have gone back to them and said, well which is it ? on or off and am still waiting.

VMware have said it 's nothing to do with them and have washed their hands of it. At least Dell keep going. I am starting to think it is something to do with the switch configuration. Just wish Dell could give a definitive answer on how the switches should be configured.

Flow Control is enabled on the switches and is enabled by default on the NIC's in VMware. Dell have also said to turn Jumbo Frames on. I have turned them on on the switches and on the SAN but need to delete and re-add the iSCSI on the ESXi hosts to turn on Jumbo frames which involves taking them down to redo all the iSCSI networking. Will do that over the next few days and see how it goes.

I am trying to talk Dell into sending someone onsite to check that the configuration of all equipment is setup correctly to try an isolate the issue.

This just goes on and on ... into the seventh month and still counting

Skyward-willw · ‎09-20-2011

I'm aware of that, but the SAN just moves it again in a couple minutes. No idea why other than the error about it not being on the prefered path in Event Log. Not very impressed with this SAN so far, for more reasons than this. Let us know if you get any progress, sounds like we aren't alone in this issue (which surprises me that Dell doesn't have any resolution on it).

Darin777 · ‎09-21-2011

Hi kiwijj,

VMWare tech used the following commands to change the MTU without needing to rebuild the networking. Saves time, so hope that helps.

Putty into host
Run esxcfg-vswitch -l
Run esxcfg-vswitch -m 9000 vSwitchName
Confirm switches at expected MTU using esxcfg-vswitch -l
Run esxcfg-vmknic -l
Run esxcfg-vmknic -m 9000 PortName
Confirm using esxcfg-vmknic –l

(vSwitch and Port names are case sensitive)

We are trying to switch from Hyper-V to ESXi 4.1 and are experiencing the same "Lost access to volume" errors when trying to use Round Robin. Has yet to occurr if using MRU. VMWare also says everything is good and it's a Dell problem. Been working with Dell for months and no resolution in sight. Using 5424 switches. Tried both Intel and Broadcom NICs. Expect it might be the switch. I'm willing to compare notes with anyone interested in doing so.

kiwijj · ‎09-21-2011

Hi Darin,

Thanks for that, the command "esxcfg-vmknic -m 9000 PortName" does not work if the NIC is already configured. It just comes up with an error.

I had to delete the VMkernel Port from the vSwitch and then readd it using the following commands:

esxcfg-vswitch vSwitch1 -A iSCSI1

esxcfg-vmknic -a -i xxx.xxx.xxx.xxx -n 255.255.255.0 -m 9000 iSCSI1

So Jumbo Frames is now set on the SAN. Switches, and ESXi hosts.

Still waiting on Dell to get back to me regarding the proper switch configuration.

Are you saying the the problem goes away if you use Most Recently Used (VMware) instead of Round Robin (VMware) ?

I'd be interested in comparing notes , which country are you in ?

cheers,

John

Skyward-willw · ‎09-22-2011

I have a hard time believing that this is a switch config issue. We had a different iSCSI solution in place before using the MD3200i on the same switches and never had this issue. This *has* to be something related to the MD3200i. I think I might try a direct connect test today to see if it resolves the issues. Ours has the Lost Access issue no matter what the pathing method is (Round Robin, MRU, Fixed, etc) - they all cause it.

fcarballo · ‎09-22-2011

In my case I have two DELL MD3200i connected on multipath at two DELL PowerConnect switches.

All my four hosts has access to both storages, but they only report connectivity issues with only the same storage. Both storages was bought together, and both has the same firmware version.

Yesterday I convinced DELL to send me other two controllers. So I can change the controllers of my problematic storage and see if it is a hardware problem.

By the way, the DELL analyst tell me that there is a firmware update to DELL MD3200i that corrects some iSCSI timeout issues.

Felipe Carballo - VCP5 - VCP-Cloud

Darin777 · ‎09-22-2011

I should have posted my hardware: Two R710's, PowerConnect 5424's, MD3000i (dual controller), ESXi 4.1. Everything on latest update and latest firmware. We tried 9000 and 1500 MTU on the hosts and the issue remains. However, I believe someone said 1500 was working for them. I have tried 1500, but only on the ESXi hosts. I did not set the switches and SAN down to 1500 as well, so maybe it wasn't a true test. Any experience anyone?

If I switch ALL paths on a given host to MRU (vmware) it has never reported the issue. If a host is on Round Robin (vmware) we experience the issues, roughly 12 events at random posted during a 24 hour period. The only time I've seen a lost access event on a MRU volume is if we have another volume on that same host using RRobin. VMware support closed the case since MRU was working, howerver, MRU is not acceptable for our production use.

I also have doubts on the switch config since we've been using Hyper-v for 3 years without issue to this same SAN and through the same switches. Problem only exists on the ESXi hosts. For troubleshooting purposes, I installed Starwind SAN on a spare server and added these as additional volumes to an ESXi host with Round Robin. The StarWind connected volumes never reported the issue while at the same time the MD300i volumes did. All of this was from the same host through the same switches, but to the two different SAN targets.

Skyward-willw: please keep us posted on the direct connect test. It's something I cannot easily test since our SAN is in product use from Hyper-v hosts.

Kiwijj: We are in USA PST.

fcarballo · ‎09-22-2011

After read the recommendations that DELL has passed to Kiwijj, I changed the MTU from 9000 back to 1500. But the problem persists.

Felipe Carballo - VCP5 - VCP-Cloud

Skyward-willw · ‎09-22-2011

I did the same - switch, SAN, and VMware at 1500 for MTU and no change.

fcarballo · ‎09-23-2011

Is everybody using a broadcom network adapter? There is a driver released at 2011/07/21 for ESX/ESXi hosts that the DELL support told me today.

They told me to do this update and install the firmware update to MD3200i released at 8/3/2011. I'll do it on monday.

Broadcom network adapter driver

http://downloads.vmware.com/d/details/dt_esxi40_broadcom_bcm_netxtreme/dHdlYnQlZXBiZGhwZA

Dell MD32x0i iSCSI RAID Controller

http://support.dell.com/support/downloads/download.aspx?c=us&cs=04&l=en&s=bsd&releaseid=R310104&Syst...

Felipe Carballo - VCP5 - VCP-Cloud

fcarballo · ‎09-25-2011

Well, I applied the last firmware/NVRAM update available for DELL MD3200i which did not solve my connectivity issues.

Now I'll install the Broadcom driver update for ESXi 4.1. Hope this help me in some way.

Felipe Carballo - VCP5 - VCP-Cloud

Skyward-willw · ‎09-26-2011

My setup includes Intel and Broadcom NICs, ESX 4.1 and ESXi 4.1 in various states of patching, latest firmware and NVRAM on the MD3200i, and still have the issue on all servers. I can't see how this is anything but a flaw in the MD3200i and I just don't have time to call Dell and ream them about it. Thankfully this is a test environment that so far has been able to tolerate this kind of issue.

fcarballo · ‎09-26-2011

I don't know if this is a problem with DELL MD3200i, because on my environment I have two MD3200i but the issue only happens on one of them.

Felipe Carballo - VCP5 - VCP-Cloud

Skyward-willw · ‎09-26-2011

My guess is that a certain percentage of MD3200i's are affected by this issue. Too many others with the same issue on the same SAN hardware.

kiwijj · ‎09-26-2011

Hi Felipe,

Did you do the update to the Broadcom NIC's ?

cheers,

John

hyien · ‎09-26-2011

I'm not on dell hw but am also experiencing the same thing.

DDunaway · ‎09-27-2011

I am also having the issue using all HP gear running ESXi 4.1 u1

HP c7000 enclosure with BL460c G6 blades.

HP EVA 6400 Fibre channel SAN.

Over 120 disks, should be able to keep up with our demand.

Something as small as a storage vMotion will cause this issue for other LUNS and other hosts.

HP’s “master technologists” have looked at our environment and could not figure out the issue.

Hopefully this is something that vSphere 5 will fix.

Skyward-willw · ‎09-27-2011

I just went through and updated all of my ESXi/ESX servers to the latest patches via Update Manager. Knock on wood, so far the issue has not come back. Round Robin enabled, 1500 MTU (will try 9000 soon). vMotion/DRS/HA are all working as they should now. Not sure what the issue really was....not convinced it won't come back either