- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We experienced this problem (NetApp, Cisco UCS, VMware 6.x). As someone else pointed out, the solution is to change the storage Multipath setting from "Round Robin" to "Most Recently Used (VMware)" setting. The issue seems to be that the Windows Failover Cluster Manager service running on each node of the cluster periodically checks for disk ownership by sending a SCSI-3 protocol command to set "persistent reservation". It is part of how the storage failure detection mechanism works. Normally the owning node will get a SCSI acknowledgement signal. However in round-robin the reservation set/check command goes out one channel and the reply comes back on the other channel then the owning cluster node never receives a response and assumes the cluster is down. Other nodes in the cluster also check by sending SCSI commands to see if the LUN's persistent ownership is set and may or may not receive an a response creating a situation where none of the cluster nodes knows if any particular node has suffered storage access failure. It's all documented in the Microsoft Failover Clustering storage management information. This seems to only be an issue in virtualized environments like VMware. In a physical multi-server Windows Failover Cluster where the Windows OS is installed on real servers with shared RDM disks one would install a Windows multipath I/O driver provided by the storage vendor to solve the problem of SCSI commands going out one channel and replies coming back on another channel.