I have an IBM DS3524 Storage Subsystem connected with one fiber channel adapter to an IBM server with ESXi 5.0.
I am seeing some large numbers when I run esxtop. My DAV/CMD goes up to 168 or 90 every minute or so. Most of the time it is hovering around 14.75.
I am seeing in the logs:
cpu20:4116)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x28 (0x412440385100) to dev "naa.60080e50002364aa000002bc4eb181de" on path "vmhba5:C0:T0:L6" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x94 0x1.Act:FAILOVER
cpu6:4102)WARNING: NMP: nmp_DeviceRetryCommand:133:Device "naa.60080e50002364aa000002b44eb14998": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
2013-09-05T13:49:12.526Z cpu8:4282)WARNING: vmw_psp_fixed: psp_fixedSelectPathToActivateInt:464:Selected current STANDBY path vmhba5:C0:T0:L6 for device naa.60080e50002364aa000002bc4eb181de to activate. This may lead to path thrashing.
In Events I am seeing that I am losing connection to the storage device immediately followed by a reconnect.
I logged onto my storage device and noticed that the write cache is off. Could the write cache being off cause this level of IO issue?
I am not sure of IBM storage, but generally disabled write cache should not cause the VM to lose connectivity (unless it causes very high latency for disk access). Please check the fabric side, connectivity and policies assigned for disk access. Ensure you are following the best practices.
Is this DS3524 a multiple controller unit or just a single controller? Directly connected to the host, or connected through a switch?
I wonder if your DS3524 may be trespassing the LUN to the other controller. What is the OS type that you are using on the DS3524 side?
Also, check this article, it matches the sense codes you are getting:
VMware KB: Troubleshooting LUN connectivity issues due to IBM DS3500 using an incorrect multipathing... is for 4.1, but may still apply)