VMware Cloud Community
MaximZ
Contributor
Contributor

Lost access to volume after upgrade from ESXi 3.5 to ESXi 4.1.0

Setup
A number of Dell PE1950/R610 servers connected to MD3000i disk array via dedicated iSCSI switches Cisco 3750.
Problem
After upgrade from ESXi 3.5 to ESXi 4.1.0 hosts may lost and restore connection to one of LUN.

In most cases connection lost/restored in same second so no real problem.

However few times it took long time ( >60 sec) and as result VMs freeze/become unresponsive.

We installed latest updates, upgrade BIOS/firmware, tried different path policy: Round Robin, Most Recent Used, Fixed.

Hopefully no more real outages, but still see these messages, which scare us 😞

What else we can do?

Thank you in advance,

-- Maxim

Lost access to volume 4a152e6b-8bc7843f-8500-001b21351ec0 (raid5_1) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

26/02/2011 7:34:04 AM

Successfully restored access to volume 4a152e6b-8bc7843f-8500-001b21351ec0 (raid5_1) following connectivity issues.

26/02/2011 7:33:47 AM

0 Kudos
16 Replies
idle-jam
Immortal
Immortal

http://kb.vmware.com/kb/1001577 .. check this KB out eventho' the problem is not the same but resignaturing a LUN helps most of the time. if this still does not help then i would highly suggest you logging a call with vmware support as any more trial and error would result in permanent data lost.

0 Kudos
MaximZ
Contributor
Contributor

We found we miss to bind  iSCSI initiator with vmkXXX, which is core changes in the iSCSI software  initiator from ESX 3.x (more details at http://virtualgeek.typepad.com/virtual_geek/2009/09/a-multivendor-post-on-using-iscsi-with-vmware-vs...).

Apllied for few servers. Unfortunately, behavior still same 😞

-- Maxim

0 Kudos
MaximZ
Contributor
Contributor

Hi,

Dell specialists review configuration again and found nothing wrong.

However, tonight I had the outage 😞

At  this moment Dell suspect we have too many iSCSI sessions so the errors  may be related to disk array (MD3000i) is droping connections.

We have other conf call today to check this idea.

-- Maxim

0 Kudos
idle-jam
Immortal
Immortal

do let us know the progress. all the best ..

0 Kudos
CieNum
Contributor
Contributor

Hi,

Please let me know! we are with this configuration with exactly the same problems and the same behaviors :

R610 ESXi 4.1 (348481) with a ISCSI MD3200i SAN and these messages some times on some LUN, and some crash sometimes...

I've tested Round Robin, MRU... no changes.

Hindisvik

0 Kudos
MaximZ
Contributor
Contributor

Hi,

Just finish meeting with Dell.

They confirm recommendation to limit number of iSCSI sessions up to 32 for MD3000i and up to 64 for MD3200i.

Which is very small due each host need 4/8 sessions for redundant configuration.

Anything about this limit may lead to unexpected session termination.

Hindisvik, how many iSCSI sessions in your case?

Regards,

Maxim

0 Kudos
CieNum
Contributor
Contributor

Hi MaximZ,

Really really interesting. I think it could be a very interesting idea to search around ISCSI sessions, 'cause those symptoms have begun when we have had some ESXi (therefore ISCSI sessions) on our MD3200i.

Today we have 8 ESXis with sevral paths to the MD3200. On each ESXis we have :

- Connected targets : 8

- Devices : 5

- Paths: 29

How do you think it's the good way to decrease the number of sessions? doing what?

Thank you very much

regards

Hindisvik

0 Kudos
CieNum
Contributor
Contributor

Waouh, the MD3200i indicates we have for the moment about 80 ISCSI sessions...

Do you know how to tell VMWare to limit the number of ISCSI sessions (we have minimum 4 sessions for each host...)

0 Kudos
MaximZ
Contributor
Contributor

That is what I have as well.

Each host is using up to 8 sessions, if remove bind it will use only 4.

I'm looking around how to segment environment/create isolated group 4-8 servers per disk array.

If you find other way, please, let me know.

Regards,

-- Maxim

0 Kudos
CieNum
Contributor
Contributor

Hello Maxim,

I unconfigured 2 cables on the MD3200 in order to only have 2 paths (instead of 4). Consequently, the number of ISCSI sessions has been divided by 2. And I only have for now 40 ISCSI sessions on the MD3200. The goal of that is to see if I always have the "lost access to volume..." messages. I will tell you if it's the case.

The "problem" I have now is that, before, on each datastore I had :

Round Robin activated with :

Active I/O

Active I/O

Standby

Standby

Now I have

Active I/O

Standby.

I think the traffic is divided with that kind of conf...

Let me know,

Thanx

Hindisvik

0 Kudos
MaximZ
Contributor
Contributor

Hi,

We removed binding and reduced number of hosts connected to disk array to have no more that 32 iSCSI sessions.

As result we dont see any new lost/restore messages on most hosts.

However, we still have 2 hosts where this solution doesnt work 😞

Regards,

-- Maxim

0 Kudos
kiwijj
Contributor
Contributor

Hi,

Just wanted to say that we have the same problem. Problem occured with ESX4.0 Update 2. All hosts have since been upgraded to ESXi 4.1 Update 1 as I understand that iSCSI is better with 4.1.

Connecting to a Dell MD3000i iSCSI storage array.

Mulitpathing has been setup using round robin. Connected targets = 4 Devices=3 Paths=12 for each of the 5 hosts.

I have opened support calls with both VMware and Dell and these have been open for two months now. VMware think it's storage and Dell says they can't find anything wrong with the storage. So basically they are both saying they can't see anything.

The problem is very random, different hosts at different times, different LUN's, sometimes the access is restored within the timeout and nothing is affected, at other times random VM's lose their disk lock and power off. I have checked the switch logs and nothing shows up there.

VMware have now requested a conference with Dell so shall see what happens with that.

Having our production VM's randomly powering off is causing us big headaches.

regards,

John

0 Kudos
oldiemotors
Contributor
Contributor

I know this thread is getting stale, but kiwijj, did you or anyone else ever solve this problem?

0 Kudos
kiwijj
Contributor
Contributor

Hi,

Still no resolution to this. The problem is still occuring. Dell have asked me to setup different subnets on the MD3000i's controllers which I will do this weekend.

I do not think this will resolve the issues, as we are currently using Round Robin and there are two Active (I/O) paths to each controller anyway and the current setup was the Dell recommended setup when the SAN was installed, though they have now said that the recommended setup is to use different subnets. We will see.

Cannot upgrade VMware to version 5 as the MD3000i is not on the VMware HCL. It is only 2.5 years old and already Dell have told me it is an end of life product as far as they are concerned and they will only release bug fixes and not feature releases for it.

cheers,

JJ

0 Kudos
oldiemotors
Contributor
Contributor

We first noticed this problem when we moved all our VM's to datastores to an nSeries (NetApp box).  VMware told us it is not an ESX issue, it is a storage system issue.  So, we moved the VM's back to our DS8300.  Now, we see the 'Lost access' errors occurring on the DS8300 storage.  So, it does not seem to matter which storage system is hosting the datastores, the problem exists where ever we place the heaviest load.  We have even seen these disconnects occur on the local storage.

I think we have had the 'Lost Access' issues all along we just didn't notice it until we migrated to the nSeries box.  In our situation, I am convinced that these are ESX issues and/or configuration issues, not problems with the storage system.  By ESX issues I mean things like timeout settings, queue depths, number of luns per port, and number of HBA's per port, etc.

0 Kudos
kiwijj
Contributor
Contributor

I also think these are ESXi issues but VMware said it was nothing to do with ESXi and have closed the call so I am going down the Dell route now. It has been almost one year since this call was open with Dell and it's try this and try that which means I have lots of weekends where I have to shut everything down  and try this and that which never resolves the issue. I think both VMware and the storage vendors have put it in the too hard basket. Now I am in a catch 22 situation where I cannot upgrade to vSphere 5 because the Dell MD3000i is not on the VMware HCL. And as the SAN is only 2.5 years old it is not being upgraded anythime soon  So looks like we are stuck with the errors.

0 Kudos