HP StorageWorks MSA1x00 LUNs on VMware - Excessive SCSI Bus Reservations

btok · ‎07-30-2009

HP has finally released a document on the MSA1500cs lockup problems copied below.

Has anyone followed these steps to see that this works where the controller no longer locks up crashing ESX and VMs????

HP StorageWorks MSA1x00 LUNs on VMware - Excessive SCSI Bus Reservations

Issue

The HP StorageWorks MSA1000 or MSA1500cs may lockup every three to four weeks. Customers are required to power-cycle the whole environment to get access to the storage again.

Timeouts and excessive SCSI reservation conflicts are logged in the /var/log/vmkwarning file:

WARNING: Migrate: 1346: 1229224338134649: Migration considered a failure by the VMX. It is most likely a timeout, but check the VMX log for the true error.

WARNING: Migrate: 1243: 1229224338134649: Failed: Migration determined a failure by the VMX (0xbad0091) @0xa148e5

WARNING: MigrateNet: 323: 1229224338134649: 9-0x3501f8b8:Received only 0 of 68 bytes: Timeout

WARNING: SCSI: 119: Failing I/O due to too many reservation conflicts

Memory allocation errors are logged on the MSA CLI debug serial console:

~~44d 06:38:21.9~~ Error allocating persistent memory data set handle-table full

~~44d 06:38:22.0~~ parse/pr_aptpl.c: Failed to persist reservation type 10 for ITL 6:0:1 (PDLA). APTPL = 0

~~44d 06:38:22.0~~ (Err=0, IDH=0x0, TPDH=0x0, IH=0x9, TPH=0xFC00)

~~44d 06:38:23.3~~ Error allocating persistent memory data set handle-table full

~~44d 06:38:23.5~~ Error allocating persistent memory data set handle-table full ~~44d 06:38:23.5~~ Error allocating persistent memory data set handle-table full ~~44d 06:38:23.5~~ parse/pr_aptpl.c: Failed to persist reservation type 10 for ITL 5:1:1 (PDLA). APTPL = 0

~~44d 06:38:23.6~~ (Err=0, IDH=0x0, TPDH=0x0, IH=0x10000, TPH=0xFC10)

~~44d 06:38:23.9~~ Error allocating persistent memory data set handle-table full ~~44d 06:38:23.9~~ parse/pr_aptpl.c: Failed to persist reservation type 10 for ITL 15:0:1 (PDLA). APTPL = 0

~~44d 06:38:24.0~~ (Err=0, IDH=0x0, TPDH=0x0, IH=0xD, TPH=0xFC00)

Solution

NOTE: Before you start resolving the issue, make a full verified backup of all MSA data.

1. Update MSA1000 or MSA1500cs controllers firmware to v5.30 active/passive or v7.10 active/active either using the corresponding MSAFlash utilities or alternatively from the CLI using the corresponding binary file.

2. Make sure that all FC connections profile name coming from VMware hosts are set to Linux rather than left as default . Make sure that they are given a name. In order to achieve this, boot one of the servers by MSA Support Software CD v7.76. To download MSA Support Software CD v7.76, go to the MSA1500 Support Software CD (ISO) webpage. Click here to visit the MSA1500 Support Software CD (ISO) webpage (http://h20000.www2.hp.com/bizsupp... .

Run the Array Configuration Utility , and choose Selective Storage Presentation to set the host connection profile appropriately to the correct operating system name.

3. Disable the HP SIM Fibre Agent:

a. Log in to the ESX Server host service console.

b. Open the file /opt/compaq/cma.conf in a text editor.

c. Add exclude cmahostd to the top of the file.

d. Save the file, and exit the editor.

e. Restart the management agents on the host by running the following three commands:

service hpasm restart
service hpsmh restart
service cmanic restart

4. Double-check the multipath policy in use and make sure it is set to mru as the storage is running active/passive firmware v5.30. You can use the vmkmultipath command, or the VMware Management User Interface (MUI) to set the multipathing policy for a LUN. For example, the following command sets the multipathing policy on the fly for all LUNs on the SAN vmhba0:0:1 to MRU:

vmkmultipath -s vmhba0:0:1 -p mru

5. Make sure that the customer serializes any backups they might be doing (as opposed to running simultaneous or parallel backup jobs).

6. VMware ESX has an advanced setting that will use a LUN reset instead of a bus reset. The parameter UseDeviceReset should be set to 0 and the UseLunReset should be set to 1 . Using these settings will change the bus reset to only affect a single LUN (the one that the reset was issued against) instead of affecting all LUNs. This setting should greatly reduce the number of check conditions generated by bus resets, which will give the controller a lot less items to handle.

The setting is also documented in the SAN Configuration Guide from VMware (in the appendix on page 118). Click here to view "SAN Configuration Guide" (http://www.vmware.com/pdf/vi3_301_201_san_cfg.pdf) .

The default settings will be listed in the logs as:

UseDeviceReset (Use device reset (instead of bus reset) to reset a SCSI device) 0-1: default = 1: 1

UseLunReset (Use LUN reset (instead of device/bus reset) to reset a SCSI device) 0-1: default = 1: 1

The setting can be changed either through the VirtualCenter server or it can be changed from the command line on the ESX service console.

Page 78 of the Version 2.5 VMware ESX Server SAN Configuration Guide has the procedure on how to change the parameter using VirtualCenter. Click here to go to "Version 2.5 VMware ESX Server SAN Configuration Guide" (http://www.vmware.com/p... .

To set the value from the command line, it is:

[]$ esxcfg-advcfg -g /Disk/UseDeviceReset <== to GET the current value

Value of UseDeviceReset is 1 <== Current value

[]$ esxcfg-advcfg -s 0 /Disk/UseDeviceReset <== To SET value to 0

Value of UseDeviceReset is 0 <== New value

Either way you choose to change the parameter, you need to restart the vmware-hostd daemon to activate the changes by typing the service console command:

service mgmt-vmware restart

Alternatively, just reboot the ESX server.

Once the settings have been modified, they will appear in the logs as:

UseDeviceReset (Use device reset (instead of bus reset) to reset a SCSI device) 0-1: default = 1: 0

UseLunReset (Use LUN reset (instead of device/bus reset) to reset a SCSI device) 0-1: default = 1: 1

7. In VMware be sure to properly set the SCSI controller sharing mode based on the system configuration because these will also reduce SCSI Bus reservation conflicts.

Here is the configuration guideline as listed in the HP VMware Best Practices document:

Sharing LUNs in order to share LUNs between VMs within a single ESX server, set the SCSI controller to Virtual mode. To share LUNs across multiple ESX servers or in a virtual to physical configuration, set the SCSI controller to Physical mode.

erickmiller · ‎10-17-2009

I'm just bumping this thread since I'd also love to know whether anyone has had 100% success in not crashing the MSA1500cs controllers. I'm pretty sure we've tried everything on their list with no success, but I'll review again.

Eric K. Miller, Genesis Hosting Solutions, LLC

- Lease part of our ESX cluster!

Eric K. Miller, Genesis Hosting Solutions, LLC http://www.genesishosting.com/ - Lease part of our ESX cluster!

jhanekom · ‎06-04-2010

Hi Eric

You've probably long since moved on from the MSA1500, but I've just noticed that HP finally posted the v7.20 firmware for public download on their web site.

One of the fixes is supposedly:

"Resolved issue of MSA persistent reservation table becoming full when used with various supported levels of VMware."

Maybe that is (finally) the silver bullet that will resolve the "timebomb" facility the array has...

If you still have an MSA1500 available, maybe you can give it a bash. I still have an MSA1000 in production and will be applying the update in the next week or two. Hopefully I can then stop scheduling a 3-monthly reboot (I know, ours isn't all that busy, so we can stretch longer between reboots.)

The firmware is available here: http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=us&prodTypeId=12...

(wanted to post this to your MSA1500cs web site as well, but the registration process is broken...)

All

ESX 3.5 and MSA1500cs

HP StorageWorks MSA1x00 LUNs on VMware - Excessive SCSI Bus Reservations

Issue

Solution