Re: Dealing with a disk loss

ITServ_GmbH · ‎02-07-2018

Hello community members.

I'm new to VMWare, and evaluate the free ESXi to proof my concept of moving to VMWare Essentials.

My server has eight SATA disks, each configured as a singular datastore. I create a guest OS (in my case, Solaris) and assign two virtual disks, each on a different datastore, thus, different physical disk.

Next, I mirror the disks using Solaris Volume Manager. Now I run disk I/O and can clearly see both disks are busy.

Now I'd simulate a fatal disk error by removing a disk, assuming the volume manager will take care of it. To my surprise, VMWare halts the VM. For further surprise, ESXi restarted the VM after exactly four minutes, which is great for me.

In the VM log, I see the following lines:

2018-02-07T16:33:29.439Z| vmx| I125: Msg_Question:

2018-02-07T16:33:29.439Z| vmx| I125: [msg.hbacommon.askonpermanentdeviceloss] The storage backing for virtual disk '/vmfs/volumes/5638ff99-591db485-df48-00259032e410/s10/s10_1.vmdk' has been permanently lost. You may be able to hot remove this virtual device from the virtual machine and continue after clicking _Retry. Click Cancel to terminate this session.

2018-02-07T16:33:29.439Z| vmx| I125: ----------------------------------------

2018-02-07T16:37:29.705Z| vmx| I125: Timing out dialog 25919

2018-02-07T16:37:29.705Z| vmx| I125: MsgQuestion: msg.hbacommon.askonpermanentdeviceloss reply=0

Question 1: How can I influence the time the VM becomes suspended, or, to be more preceise, the timeout of this dialog?

I also prepare on read/write errors without having a completely failed disk. For that reason,

Question 2: can I assume a read/write error on a datastore is forwarded to the VM?

Thank you very much.

daphnissov · ‎02-07-2018

Just out of curiosity, why would you be interested in creating a configuration with separate drives providing separate datastores and performing in-guest mirroring? This solution creates additional levels of abstraction and increase complexity. You could reduce that complexity and satisfy your goals for some resiliency by using either a shared storage array or a local RAID solution to present a single datastore object.

------------------
How to Ask for Help on Tech Forums
https://neonmirrors.net

ITServ_GmbH · ‎02-07-2018

Because it would be a financial overkill in terms of infrastructure (need for a redundand RAID server) and power (more devices to run). I have two identical servers here. If one fails, I'll grab the disks and put it into the other.

I have no requirements for high availabity, but I'd like to mirror the disks for data redundancy and protection against the most normal error: disk failure.

I know I could add a RAID card, but this would increase the complexity, too. So why solve a problem with hardware when it can be solved in software?

However, that discussion does not bring me ahead on the question, even though I'm grateful for your input.

continuum · ‎02-07-2018

Hi Peter
> I have no requirements for high availabity, but I'd like to mirror the disks for data redundancy and protection against the most normal error: disk failure.
I complete understand your needs - but I do not agree with one of your assumptions!
> protection against the most normal error: disk failure.
This is NOT the "most normal" error. In small singlehost setups like the one you have the most frequent error is not a harddisk-failure but a VMFS-failure.
So the first thing to do is to make sure that you are immune to VMFS-failures.
This can be easily acchieved by using eager-zeroed thick provisioned vmdks allocated in one piece. After you created this vmdks you write down their allocation on the datastore and keep this textfile in a safe location. When you do this consequently for all your VMs - and I believe this should be quite easy with your setup - even a complete VMFS-corruption will not harm you.
I am aware that this does not answer your original question - but I highly recommend that you prepare against VMFS-failures before you think about harddisk problems.

Feel free to call via skype for detailed instructions - ich spreche deutsch
Ulli

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

continuum · ‎02-07-2018

> Question 1: How can I influence the time the VM becomes suspended, or, to be more preceise, the timeout of this dialog?

First of all make sure that you assign

scsi0 to your first disk of the VM and
scsi1 to your mirrored second disk.
Then you can influence the behaviour either with

scsi0.returnNoConnectDuringAPD = "TRUE" / "FALSE"

scsi0.returnBusyOnNoConnectStatus = "TRUE" / "FALSE"

scsi1.returnNoConnectDuringAPD = "TRUE" / "FALSE"

scsi1.returnBusyOnNoConnectStatus = "TRUE" / "FALSE"

or by answeing the question automatically with

msg.autoAnswer = "TRUE"

answer.msg.hbacommon.askonpermanentdeviceloss = "retry"

Sorry - never tried this myself - so please report if it helps in this special case.
Question 2: can I assume a read/write error on a datastore is forwarded to the VM?
No !
It depends - if there is an I/O error in the VMFS-metadata-area the VMDK can disappear and the VM will stop.
- If your VM is thin provisioned it would be lost in this case.
- If your VM is eager zeroed thick and you are prepared as explained in last post - you copy the VM to another datastore and can go on.
If the I/O error is located inside a vmdk in an area that is actively used by the guest - the guest will notice an I/O error and eventually crash.

- If your VM is thin provisioned it would be lost in this case.

- If your VM is eager zeroed thick and you are prepared as explained in last post - you copy the VM to another datastore and can go on.

If the I/O error is located inside a vmdk but outside of an actively used partition - neither guest nor ESXi will notice it.
This means that you may miss all early warnings that you would expect from a Solaris on hardware if the Solaris runs in a VM.

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

ITServ_GmbH · ‎02-07-2018

Hi Uli,

thank your for your patient advice. I'll take special caution on these files. I'll send you a private mail for further discussion.

However, my questions are not answered yet, so back to the community:

Q1: is it possible to reduce the time a VM is suspended on disk loss?

Q2: will I/O errors on a disk be forwarded to the VM, or do they result in a loss of the VMFS on that disk?

Thank you for your thoughts.

Finikiez · ‎02-07-2018

scsi0.returnNoConnectDuringAPD = "TRUE" / "FALSE"
scsi0.returnBusyOnNoConnectStatus = "TRUE" / "FALSE"
scsi1.returnNoConnectDuringAPD = "TRUE" / "FALSE"
scsi1.returnBusyOnNoConnectStatus = "TRUE" / "FALSE"

or by answeing the question automatically with
msg.autoAnswer = "TRUE"
answer.msg.hbacommon.askonpermanentdeviceloss = "retry"

I've tried myself these options. They work only for Windows VMs.

or by answeing the question automatically with
msg.autoAnswer = "TRUE"
answer.msg.hbacommon.askonpermanentdeviceloss = "retry"

I didn't manage to make them work for Linux VMs. Linux VMs worked until first reboot, after that they hanged on black screen during boot.

My personal opinion mirroring\striping configurations inside guest OS using several vmdks from different datastore should be avoided.

ITServ_GmbH · ‎02-08-2018

OK, thank you everybody. I'll try these in the next few days.

A last question: are these settings "per guest" or for the entire ESXi host?

Thank you!

ITServ_GmbH · ‎02-09-2018

OK, found the settings. They all are to be set for each VM.

For further documentation, read this article in the VMware Knowledge Base

All

Dealing with a disk loss