I have servers with only 1 Fibre channel HBA. If that 2-port HBA adapter fails, all VMs on the server lose connectivity to storage. How can I estimate the risk of data loss that might occur if a VM is in the middle of a write to storage when the HBA fails? Due to HA the VM would be restarted, and backups occur nightly. I'm concerned about data corruption / data loss that might occur on the current days transaction and how likely that is if the HBA fails.
VMinator, I am sorry to say that you will not be able to predict the exact amount of data loss in bytes or in number of WRITE IO commands. It depends on the number of outstanding WRITE commands at the time of failure. In case of ALL PATH DOWN condition, if your VM recover within the timeout value which Guest Operating System (GOS) can tolerate (Refer to OS documents for exact timing details), then GOS will take care of retrying IO operation. If you hit a APD (Permanent Device Loss - PDL), with IO on wire, ie, not committed you will see data loss. But again, as I said you will not be able to predict the exact IO loss.
PS:- Mark the thread as Answered if this help to clarify your question.
Supposing I did know the number of outstanding write commands, how would you use that data to determine the practical impact on an application such as sharepoint losing it's connection to storage? How likely is it that when the VM is restarted on another host, that it will be able to resume where it left off, versus needing to be restored from last nights backup, and thus losing all data since the last backup?
Has anyone experienced this and wants to comment?
Evening,
Great question. I assume you are trying to create a justification for two HBA's or for a lower RPO. I have had hundreds of storage failures even on redundant array's and rarely have I run into issues with machines 100% hosed due to this interruption. This is due to the method storage arrays use store and forward. That being said the problems I have seen are 100% on Windows. The file system below Windows is not journaling based and thus suffers badly from these type of crashes. I have seen a Windows machine have to return to a backup due to this type of crash it is rare. I have never had a problem with Linux.
What it really comes down to is design for failure and requirements. Is your business willing to return to a backup from last night due to a driver or single card failure? If the answer is that is too costly or unreasonable then more availablity options via CapEX should be made available. It's not a issue of how much data is lost it's are you willing to loose data.
I hope it helps,
J
Ok great input - thanks
If anyone else has experience on this feel free to chime in