Last night we had some maintenance done on our Clariion SAN, on which our ESX 3.02 hosts have VMFS volumes. One storage processor was rebooted at a time, and the hosts appeared to handle the path swapping well. However, our Windows guest VM's got several event log messages:
The device, \Device\Scsi\symmpi1, is not ready for access yet.
The driver detected a controller error on \Device\Harddisk1.
The only Red Hat Linux VM actually got a read-only file system and required a reboot. So this event is having at least some effect on the guests, which I was surprised to hear. The Windows guests do have the TimeOutValue registry value set to 60, so perhaps that's what prevented them from actually crashing (?). I suppose I understand that ESX needs to have a SCSI request timeout before it knows to failover, therefore the I/O that a VM may have been incurring must time out, and the guest would understandably be aware of that.
Is this expected behavior? Any way to prevent it?
there are a few details about options including Disk.PathEvalTime & Disk.MaxVCNotReadyTime both of which can improve failover rates when paths go down during SP reboots but bare in mind these settings will cause the COS to poll the bus more frequently which will increase the load on it. Additionally understanding the layout of the disks as regards which SP they are on, which are the active paths, which FC switch/fabric they are accessed via can all minimize the effect of SP maintenance.
As a clariion and VMware guy i try to limit the IO levels during maintenance by shutting down any unesseccary VMs (Test/Dev ones) making sure we are outside of the Backup window and avoiding periods of heavy processing.