VMware Cloud Community
wheelz311
Contributor
Contributor

Possibility of Limiting Damage from Storage Failures

I have worked on multiple VMware environments, where unfortunately the back-end storage was not as reliable as one would hope.  When the storage would fail, it was a miserable time fixing all the systems and problems that arose even once the storage was up and working again.  This got me thinking... I don't know if this is already a feature I don't know about, could currently be implemented in a scripted way, or would require an enhancement request, but I wanted to throw it out here.

Is there any way to instead of just ripping the storage out from underneath the VMs while they are running, couldn't the host pause them?  I am thinking it would go like this.  As soon as the host detects that a storage device is inaccessible, it would then pause all VMs (perhaps configurable) that are using that storage.  Then it would immediately dump everything in memory for the VMs to local storage on the host.  Ideally the memory dumps for each VM could be saved to disk in a deduplicated way (or the same way it is stored in memory).  This would mean you would just have to keep enough local storage in each host that would be >= the amount of memory in the host.  Once the storage was back up and healthy, you could trigger it to resume the VMs where it would pull the memory back out and keep going where it left off.  What do you think?  Is this possible?  Could the pausing function at least work today?

0 Kudos
1 Reply
mcowger
Immortal
Immortal

You are looking for a stun feature, which doesn't exist today for storage failures.

But, contact your VMware rep to submit a feature request.

--Matt VCDX #52 blog.cowger.us
0 Kudos