VMware Cloud Community
whitetimc
Contributor
Contributor

Data-stores filled up causing virtual machine crashes

Good morning,

I have one datastore (LUN) that has 4 virtual machines residing on it. All of the virtual machines have thick provisioned disks. This morning a snapshot was taken of one of the servers which caused the datastore to fill up. Once this happened, all of the virtual machines on this datastore stopped working.

I shut down 2 of the virtual machines and migrated them over to another datastore and restarted them. They began working again as well as the 2 remaining virtual machines on the original LUN.

Management is asking me to explain why all of the virtual machines stopped performing when non of them had thin provisioned disk. They all had fixed sized disk.

Does anyone have a clue?

Thanks,

Tim

6 Replies
jhague
VMware Employee
VMware Employee

When you take a snapshot the primary disks (vmdks) are effectively frozen and changes will start being written to a delta file. So for example say you have a 100GB VM which hosts a 50GB database. You take a snapshot of the VM then immediately do a full backup of your database to a file. The footprint of that VM will now be 150GB because you have the original 100GB vmdk file plus the delta file which is around 50GB (the database backup file is a change from the original baseline). If you upgraded the OS as well the delta file could be nearer 100GB so you need to be careful and ensure you have adequate free space to cover the work you are undertaking.

A general recommendation with snapshots is you keep about 25% space on your datastores free to accommodate snapshots but this may need to be more or less depending on your workloads. If you have one VM on a data store and are doing the kind of activity above you might need 100% free space.

When you have a data store with a larger number of VMs the snapshot activity / usage will often average out but if you have fewer larger VMs you can be more prone to snapshots taking up a larger proportion of the data store so you need to factor these things in. The other recommendation is not to leave snapshots hanging around - 72 hours is the general guideline but again this can vary depending on usage.

Large DB servers is something I've often seen catch people out in the past for all the reasons above. Once the disk is full then everything is going to hault because it can't commit I/O operations.

John Hague http://linkedin.com/in/john-hague | twitter @jhague10 VCIX-DCV | VCP-DCV 3/4/5/6 | VCP6-NV | VCP7-CMA | VCAP7-CMA Design
Reply
0 Kudos
whitetimc
Contributor
Contributor

Thanks Jhague.

This makes sense somewhat. I understand that a snapshot can grow very quickly and fill up a LUN, but what I'm confused about is how come the other servers that were on this LUN stopped working when they had more than enough space available to them within the guess OS. Example below:

  1. Servers 1 is a Windows 2012 server with one 100 GB disk. When I looked inside the guest os there was 25 GB of free space.
  2. Server 2 is a Windows 2012 server with two 150 GB disks. The first disk has 50 GB of free space and the second disk has 100 GB of free disk space.
  3. Server 3 is a Windows 2012 server with two 125 GB disks. The first disk has 65 GB of free space and the second disk has 150 GB of free disk space.

So the question that is puzzling, why did these servers stop working when the LUN filled up and they were thick provisioned disks.

Thanks in advance

Reply
0 Kudos
jhague
VMware Employee
VMware Employee

When you say stopped working do you mean they were paused? Was there an error message on them? How big is your data store? Normally they will pause when they were requesting new blocks. This could potentially be VM logs (remember there are a bunch of files associated with a VM). Once you are out of space you are prone to unpredictable results as ESXi is not designed to work with zero free space.

John Hague http://linkedin.com/in/john-hague | twitter @jhague10 VCIX-DCV | VCP-DCV 3/4/5/6 | VCP6-NV | VCP7-CMA | VCAP7-CMA Design
Reply
0 Kudos
whitetimc
Contributor
Contributor

The servers appeared to have just been frozen once the datastore filled up. The datastore was 2TB.

Reply
0 Kudos
jhague
VMware Employee
VMware Employee

Usually when a VM is unable to write to the disk you will get an 'i' icon on the VM in vCenter with a question asking if you want to retry or cancel but not always. If you look in the vmware.log file in the VMs directory on the data store it will tell you what is trying to write so you could maybe see if there is more info in there for the VMs.

Do you have any backup tools or anything running that could have been taking snapshots of other VM?

John Hague http://linkedin.com/in/john-hague | twitter @jhague10 VCIX-DCV | VCP-DCV 3/4/5/6 | VCP6-NV | VCP7-CMA | VCAP7-CMA Design
Reply
0 Kudos
whitetimc
Contributor
Contributor

Thanks JHague,

It appears that a backup job via Commvault kicked off causing the datastore to fill up.