Lost access to volume after upgrade from ESXi 3.5 ...

MaximZ · ‎03-23-2011

Setup

A number of Dell PE1950/R610 servers connected to MD3000i disk array via dedicated iSCSI switches Cisco 3750.

Problem

After upgrade from ESXi 3.5 to ESXi 4.1.0 hosts may lost and restore connection to one of LUN.

In most cases connection lost/restored in same second so no real problem.

However few times it took long time ( >60 sec) and as result VMs freeze/become unresponsive.

We installed latest updates, upgrade BIOS/firmware, tried different path policy: Round Robin, Most Recent Used, Fixed.

Hopefully no more real outages, but still see these messages, which scare us 😞

What else we can do?

Thank you in advance,

-- Maxim

Lost access to volume 4a152e6b-8bc7843f-8500-001b21351ec0 (raid5_1) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

26/02/2011 7:34:04 AM

Successfully restored access to volume 4a152e6b-8bc7843f-8500-001b21351ec0 (raid5_1) following connectivity issues.

26/02/2011 7:33:47 AM

idle-jam · ‎03-24-2011

http://kb.vmware.com/kb/1001577 .. check this KB out eventho' the problem is not the same but resignaturing a LUN helps most of the time. if this still does not help then i would highly suggest you logging a call with vmware support as any more trial and error would result in permanent data lost.

MaximZ · ‎03-25-2011

We found we miss to bind iSCSI initiator with vmkXXX, which is core changes in the iSCSI software initiator from ESX 3.x (more details at http://virtualgeek.typepad.com/virtual_geek/2009/09/a-multivendor-post-on-using-iscsi-with-vmware-vs...).

Apllied for few servers. Unfortunately, behavior still same 😞

-- Maxim

MaximZ · ‎03-30-2011

Hi,

Dell specialists review configuration again and found nothing wrong.

However, tonight I had the outage 😞

At this moment Dell suspect we have too many iSCSI sessions so the errors may be related to disk array (MD3000i) is droping connections.

We have other conf call today to check this idea.

-- Maxim

idle-jam · ‎03-30-2011

do let us know the progress. all the best ..

CieNum · ‎03-30-2011

Hi,

Please let me know! we are with this configuration with exactly the same problems and the same behaviors :

R610 ESXi 4.1 (348481) with a ISCSI MD3200i SAN and these messages some times on some LUN, and some crash sometimes...

I've tested Round Robin, MRU... no changes.

Hindisvik

MaximZ · ‎03-30-2011

Hi,

Just finish meeting with Dell.

They confirm recommendation to limit number of iSCSI sessions up to 32 for MD3000i and up to 64 for MD3200i.

Which is very small due each host need 4/8 sessions for redundant configuration.

Anything about this limit may lead to unexpected session termination.

Hindisvik, how many iSCSI sessions in your case?

Regards,

Maxim

CieNum · ‎03-30-2011

Hi MaximZ,

Really really interesting. I think it could be a very interesting idea to search around ISCSI sessions, 'cause those symptoms have begun when we have had some ESXi (therefore ISCSI sessions) on our MD3200i.

Today we have 8 ESXis with sevral paths to the MD3200. On each ESXis we have :

- Connected targets : 8

- Devices : 5

- Paths: 29

How do you think it's the good way to decrease the number of sessions? doing what?

Thank you very much

regards

Hindisvik

CieNum · ‎03-30-2011

Waouh, the MD3200i indicates we have for the moment about 80 ISCSI sessions...

Do you know how to tell VMWare to limit the number of ISCSI sessions (we have minimum 4 sessions for each host...)

MaximZ · ‎03-30-2011

That is what I have as well.

Each host is using up to 8 sessions, if remove bind it will use only 4.

I'm looking around how to segment environment/create isolated group 4-8 servers per disk array.

If you find other way, please, let me know.

Regards,

-- Maxim

CieNum · ‎03-31-2011

Hello Maxim,

I unconfigured 2 cables on the MD3200 in order to only have 2 paths (instead of 4). Consequently, the number of ISCSI sessions has been divided by 2. And I only have for now 40 ISCSI sessions on the MD3200. The goal of that is to see if I always have the "lost access to volume..." messages. I will tell you if it's the case.

The "problem" I have now is that, before, on each datastore I had :

Round Robin activated with :

Active I/O

Standby

Now I have

Active I/O

Standby.

I think the traffic is divided with that kind of conf...

Let me know,

Thanx

Hindisvik

MaximZ · ‎04-09-2011

Hi,

We removed binding and reduced number of hosts connected to disk array to have no more that 32 iSCSI sessions.

As result we dont see any new lost/restore messages on most hosts.

However, we still have 2 hosts where this solution doesnt work 😞

Regards,

-- Maxim

kiwijj · ‎05-19-2011

Hi,

Just wanted to say that we have the same problem. Problem occured with ESX4.0 Update 2. All hosts have since been upgraded to ESXi 4.1 Update 1 as I understand that iSCSI is better with 4.1.

Connecting to a Dell MD3000i iSCSI storage array.

Mulitpathing has been setup using round robin. Connected targets = 4 Devices=3 Paths=12 for each of the 5 hosts.

I have opened support calls with both VMware and Dell and these have been open for two months now. VMware think it's storage and Dell says they can't find anything wrong with the storage. So basically they are both saying they can't see anything.

The problem is very random, different hosts at different times, different LUN's, sometimes the access is restored within the timeout and nothing is affected, at other times random VM's lose their disk lock and power off. I have checked the switch logs and nothing shows up there.

VMware have now requested a conference with Dell so shall see what happens with that.

Having our production VM's randomly powering off is causing us big headaches.

regards,

John

oldiemotors · ‎01-26-2012

I know this thread is getting stale, but kiwijj, did you or anyone else ever solve this problem?

kiwijj · ‎01-26-2012

Hi,

Still no resolution to this. The problem is still occuring. Dell have asked me to setup different subnets on the MD3000i's controllers which I will do this weekend.

I do not think this will resolve the issues, as we are currently using Round Robin and there are two Active (I/O) paths to each controller anyway and the current setup was the Dell recommended setup when the SAN was installed, though they have now said that the recommended setup is to use different subnets. We will see.

Cannot upgrade VMware to version 5 as the MD3000i is not on the VMware HCL. It is only 2.5 years old and already Dell have told me it is an end of life product as far as they are concerned and they will only release bug fixes and not feature releases for it.

cheers,

JJ

oldiemotors · ‎01-26-2012

We first noticed this problem when we moved all our VM's to datastores to an nSeries (NetApp box). VMware told us it is not an ESX issue, it is a storage system issue. So, we moved the VM's back to our DS8300. Now, we see the 'Lost access' errors occurring on the DS8300 storage. So, it does not seem to matter which storage system is hosting the datastores, the problem exists where ever we place the heaviest load. We have even seen these disconnects occur on the local storage.

I think we have had the 'Lost Access' issues all along we just didn't notice it until we migrated to the nSeries box. In our situation, I am convinced that these are ESX issues and/or configuration issues, not problems with the storage system. By ESX issues I mean things like timeout settings, queue depths, number of luns per port, and number of HBA's per port, etc.

kiwijj · ‎01-26-2012

I also think these are ESXi issues but VMware said it was nothing to do with ESXi and have closed the call so I am going down the Dell route now. It has been almost one year since this call was open with Dell and it's try this and try that which means I have lots of weekends where I have to shut everything down and try this and that which never resolves the issue. I think both VMware and the storage vendors have put it in the too hard basket. Now I am in a catch 22 situation where I cannot upgrade to vSphere 5 because the Dell MD3000i is not on the VMware HCL. And as the SAN is only 2.5 years old it is not being upgraded anythime soon So looks like we are stuck with the errors.

All

Lost access to volume after upgrade from ESXi 3.5 to ESXi 4.1.0