We have a number of clusters that each contain about 15 hosts. We utilise RDM's for Microsoft failover clusters quite heavily in our environment as well - up to 70 RDM's. Our SAN array is a VNX 7500. All hosts within each cluster are defined in a host group on the array.
ESXi hosts are Dell M620's M630's and R730's. Running ESXi 5.5 Update 3.
All works well on a day-to-day basis however we have been having issues with random clusters experiencing a failure/failover whenever we add a new host to the host group on the SAN array. It appears that when the host is added to the storage group it automatically kicks off a storage scan (as i can see because the Datastores on the host start appearing automatically). Some time after the host is added to the storage group, sometimes 15 minutes, sometimes up to 5 hours, some of the clusters start failing due to the physical disks which they use being unavailable. Errors we are seeing in the event log:
Cluster resource 'INST01_Log' of type 'Physical Disk' in clustered role 'SQL Server (clustername\INST01)' failed.
Ownership of cluster disk 'INST02_Data' has been unexpectedly lost by this node. Run the Validate a Configuration wizard to check your storage configuration.
In most cases the cluster will successfully fail over to the passive node. In other instances I'll need to manually bring the disk resource back online if it hasnt automatically recovered.
The reason for the extremely long time it is taking before it causes an issue is seen is because as the RDM's are being scanned for the first time, there is a SCSI reservation on them which does not allow them to be read. It waits until it times out before move onto the next device. As good practise we perennially reserve all of our cluster RDM's however its not possible to do this until the disk has been added for the first time. If we happen to reboot a host that hasnt had the disk perennially reserved yet it can take up to 6 hours for it to start responding.
We logged a job with VMware however they came back saying that the issue is being caused by the array and we should contact EMC. I dont necessarily agree with this as things operate fine usually - its just when a host is added for the first time and a scan takes place does it cause some sort of lock on the RDM that prevents the MSCS cluster from being able to read/write to it. No issue with the VMFS data-stores themselves has been seen.
Has anyone else seen this or know what could be causing the issue? Should a host performing a scan on an RDM being used in an MSCS cluster cause it to fail?