VMware Cloud Community
TheVMinator
Expert
Expert

Impact of lost storage connection

I have servers with only 1 Fibre channel HBA.  If that 2-port HBA adapter fails, all VMs on the server lose connectivity to storage.  How can I estimate the risk of data loss that might occur if a VM is in the middle of a write to storage when the HBA fails?  Due to HA the VM would be restarted, and backups occur nightly.  I'm concerned about data corruption / data loss that might occur on the current days transaction and how likely that is if the HBA fails. 

0 Kudos
5 Replies
admin
Immortal
Immortal

VMinator, I am sorry to say that you will not be able to predict the exact amount of data loss in bytes or in number of WRITE IO commands. It depends on the number of outstanding WRITE commands at the time of failure. In case of ALL PATH DOWN condition, if your VM recover within the timeout value which Guest Operating System (GOS) can tolerate (Refer to OS documents for exact timing details), then GOS will take care of retrying IO operation. If you hit a APD (Permanent Device Loss - PDL), with IO on wire, ie, not committed you will see data loss. But again, as I said you will not be able to predict the exact IO loss.

PS:- Mark the thread as Answered if this help to clarify your question.

TheVMinator
Expert
Expert

Supposing I did know the number of outstanding write commands, how would you use that data to determine the practical impact on an application such as sharepoint losing it's connection to storage?  How likely is it that when the VM is restarted on another host, that it will be able to resume where it left off, versus needing to be restored from last nights backup, and thus losing all data since the last backup?

Has anyone experienced this and wants to comment?

0 Kudos
Gortee
Hot Shot
Hot Shot

Evening,

Great question.  I assume you are trying to create a justification for two HBA's or for a lower RPO.   I have had hundreds of storage failures even on redundant array's and rarely have I run into issues with machines 100% hosed due to this interruption.  This is due to the method storage arrays use store and forward.  That being said the problems I have seen are 100% on Windows.  The file system below Windows is not journaling based and thus suffers badly from these type of crashes.  I have seen a Windows machine have to return to a backup due to this type of crash it is rare.  I have never had a problem with Linux.  

What it really comes down to is design for failure and requirements.   Is your business willing to return to a backup from last night due to a driver or single card failure?   If the answer is that is too costly or unreasonable then more availablity options via CapEX should be made available.   It's not a issue of how much data is lost it's are you willing to loose data.

I hope it helps,

J

Joseph Griffiths http://blog.jgriffiths.org @Gortees VCDX-DCV #143
TheVMinator
Expert
Expert

Ok great input - thanks

0 Kudos
TheVMinator
Expert
Expert

If anyone else has experience on this feel free to chime in

0 Kudos