Problem with ESX 3.01 using LUNS on an IBM DS8100 ...

dinny · ‎07-17-2007

Hiya,

I have been using ESX 3.01 across two sites with a HA/DRS cluster in each site for several months with no problems.

I have nineteen LUNs configured - 10 in one site and 9 in the other.

All ESX 3.01 servers are zoned to see all 19 LUNs - just for DR purposes.

Over the weekend something seemed to happen (no idea what yet...) that caused all of my ESX servers to effectively lose contact with three of the LUNs.

Two in one site and two in the other.

Nothing has changed to prompt this that I'm aware of.

The LUNs still appear under VI client configuration/storage adapters (even after several rescans).

But the VMFS volumes do not show under VI client configuration/storage.

The vmkernel and vmkwarning logs on all the ESX servers - show scsi reservation and scsi lock issues for each of the three LUNs.

Fortunately none of the three LUNs are hosting vital VMs...

My other VMs all seem to be running OK - but the SCSI reservations issue seems to be causing lots of timeouts and hangs in the VI client - so it is now awkward to troubleshoot the VMware environment.

I had hoped that our SAN team might have tools that would show me what was causing the reservations on the three LUNs - but no luck so far.

They were also unable to tell me what disk the LUN IDs displayed on ESX corresponded to, on their IBM disk allocation GUI.

Eventually I got them to allocate the LUNs in turn to a test ESX server - then by checking the vmkwarning logs I was able to find out which three LUNs were causing the problem - and they now know which LUNS are affected in the IBM GUI.

(Hopefully IBM support can now do some further troubleshooting on these LUNs....)

I had initially planned to ask them to remove these three LUNs temporarily from all of my production ESX servers - and just allocate them to my test ESX server.

The idea being that I could then do a rescan on each server and my prd ESX env would then be free of the timeout errors - and hence be manageable again.

I was then hoping to be able to troubleshoot the SCSI reservation problems on the test ESX server (unless the locks had been freed up by unallocating them from the other ESX servers anyway) in my own time.

I have since realised that because (as far as I'm told?) there is no way of setting a permanent LUN ID on the IBM DS8100 GUI - the LUN IDs of the other LUNS will change when the three problem LUNs are removed.

(unless by luck they were the highest LUNs - which they aren't)

I imagine that this would cause re-signature issues and that by default ESX would presume that the LUNS that had changed were now snapshots and not mount them.

I appreciate that there are switches in ESX that I can set to prevent this default behaviour - but I was hoping that maybe someone else used IBM DS8100 disk arrays - and knew of a way to preserve the LUN IDs presented to ESX via them?

Has anyone else experienced similar issues - and found an effective way of dealing with them?

Dinny

All

Problem with ESX 3.01 using LUNS on an IBM DS8100 disk array