I have a 2x Node Windows Failover Cluster using Pure FC SAN storage. I receive multiple errors on each RDM DISK set to Physical Compatibility across multiple hosts.
Error - Failure issuing call to Persistent Reservation READ Reservation on Test Disk 3 from node "server name" when the node has successfully reserved. It is expected to succeed. The request could not be performed because on an I/O device error.
This occurs on all my RDM Disks and I only have this issue on MS Server 2016, not 2008 or 2012 R2.
Any suggestions would help and feel free to see attached screenshot.
This is what Pure support said:
When running validation tests on RDM volumes presented directly to a Windows Server Guest VM running on ESXi, the validation test fails with SCSI-3 persistent reservation failures. There are are many different failure modes, but as of this writing, we are aware of two.
1. When running this test, the array will send a unit attention back down a path to let the path and host know that there is a change; we fail the next operation, forcing the the Windows Server host to retry the command 4 times, but down 4 different paths. If the customer has a round robin multi-pathing policy set up on ESXi to the array and they have more than 4 paths to the array, they will have no successful IO down a path for the duration of the test. The reason this is not normally concerning is because there will almost always be other IO down a path during regular operations. Because the Windows disk is offline and there is no other IO on it, only the test IO will go down each of the 4+ paths, showing the Windows Server nothing but failures down every path.
2. If the customer is using vVols and running the Cluster Validation tests, there is a known issue with versions less than ESXi 6.7 U2 (6.7 U1 and lower) that prevents the test from completing successfully.
1. If the customer is not using vVols and getting this error, they might need to change their IO count that is sent via Round Robin down each path from 1 to 2 and run the test again. If this is a test they'll be running often, they are OK to leave it as 2. If this is a test they only plan to run once, they can change this value back to 1 or leave it as 2. Please open a Jira and validate they are having the same issue as we see in PURE-140182.
2. It is recommended for the customer to be on ESXi 6.7 U2 or higher.
3. In 5.3.6+, we've also fixed PURE-148458: