VMware Cloud Community
freakyocr
Contributor
Contributor

ESXi 4.1 - Ran Out of Space Disaster

We're running ESXi 4.1 (upgraded from 4.0U1) with the HP specific tools on a DL380G6.

We have a RAID 5 array of 405GB on a p410i w/256MB. We installed SBS 08 & Server 08 and they were running fine with little problem.

We had 70GB free on the datastore, but for some reason the space usage was creeping up, but no big deal. We had a small issues of not being able to shrink the thin allocation, but with a large amount of free space we weren't worried.

Two days later SBS stops due to 0bytes free space. Bad news. We're dumbfounded as to why 70GB dissappeared in 2 days, but that investigation will have to wait until ater we get the server back up.

Plan A:

We run out and grab both a BBWC and another drive and expand the aray. Reports say that we should be able to just expand the datastore no problem. No dice. Not even an option. We later read that several people reccomended that we should have created it as a new partition instead of expanding the existing one. Array expansion took 8 hours. Downtime thus far: 24 hours.

Okay, Plan B:

We grab another drive (we are using 2.5" SAS, so we dont exactly have many laying around. This time we see the controller in the enlarge box. We choose it click next and vSphere throws an error and halts. WE try and get around this, no dice. ESXi will not co-operate. This is really frustratingat this point. Downtime: 28 hours.

Okay, Plan C:

We transfer the vm's off ESXi, trash and re-create the datastore, and transfer them back. Hopefully the datastore including the extra 146GB we added. Except that this is soo painfully slow I want to die. Started off at 21MB/s, now trailing off at 6.7MB/s. We have 500GB worth of VM's and we've been running two transfers to two different machine for almost 9 hours... just to transfer them off. We just used the browser download because vSphere would tell us how long it was taking, and after 5 minutes, it had only done 800MB. SCP wouldn't work, it seems that we had to enable the unsupported mode to enable it? But from what I've read SCP transfers are slow. Man this is frustrating. Downtime 36 hours and counting.

We have full backups, but they are a day old and we'd prefer to keep the files because, well we should be able to.

Are these just limitations of ESXi? SHould I have spec'd more storage than I though they could resonably use? Was there something I could do to make the transfers off and on ESXi faster? I've never made an NFS, but I have a CentOS 5.4 box sitting there available.

Why did the datastore not enlarge? Why did ESXi not want to create a new disk and throw an error when we tried to add a SAS with a logical partition? ESXi really opens some great doors, but it has some serious limitations when it comes to production recovery. We aren't a datacentere, it's an SMB installation, so we don't have spare installations to move stuff to. Is it just not a good idea to use ESXi in the SMB space? Did I get burned because I'm using 4.1, which was only releaseda few weeks ago? Why is nothing working like it's supposed to?And insult to injury, why the crap does it transfer so slow? Downtime 39 Hours.Transfer complete off of datastore... wish me luck!

Updated: vSphere threw the same errors trying to delete the datastore, we deleted the vm's off the datastore and tried again, it threw the error, but worked. We re-created the datastore at the larger size, and are currently transferring them back on. ETA: 47 minutes. What? Not 3 hours turned into 9 hours? That's acceptable. 9 hours, not so much. Looks like we'll be resolved within the hour. (Knock-on-wood) What a process!

0 Kudos
7 Replies
freakyocr
Contributor
Contributor

Yes, 500GB worht of VM's because the flat-file includes unallocated space. They are both 250GB. That's what started the problem, I assume, because the client reversed the server allocation after we had provisioned it and stuck us with more space thinly provisioned that we actually had. Not best practive I'll be the first to admit, but ESXi doesn't let you reduce it after it's been increased, even if it's not used.

0 Kudos
DSTAVERT
Immortal
Immortal

I believe that virtualization is probably more beneficial to smaller organizations than large ones.

I am not trying to kick you when you are down but your situation probably indicates just how important it is to understand the tools. You may have had the same problem in the physical world as well. I will guess that the rapid consumption of disk space was somehow related to ignored snapshots.

Once you are back on your feet you should start looking at your disaster recovery plan. If you don't have one plan on making one. If you had had a local secondary network attached datastore you could have potentially been back up and running in minutes. Immediately after an issue is the time to lobby for additional budget.

Good luck getting up and going again.

-- David -- VMware Communities Moderator
freakyocr
Contributor
Contributor

Trial by fire, that's for sure. I thought I had a decent understanding and successfully ran an ESXi 4.0U1 in a sandbox environment for several months. Obviosuly, without the wildcard of userdata, it's tough to get a real, usable example unless you run through scenarios. But then again, it would have been helpful if things worked like they were supposed to... and from the amount of posts floating around about similiar issues, I suppose I'm not the only one having difficulties. I guess it was just the perfect storm of things that came together that lead to this. Better disaster planning would have saved us some down time, that's for sure.

I'm sure they weren't snapshots, as Thin provisioned VM's can't have snapashots, can they? Plus it only affected one VM not the other. Very odd. I'll let you know what I find once they are back up.

What would you reccomend for NFS... just a ReadyNAS box or likewise? Or any old box with a drive large enough to handle the VM's?

0 Kudos
DSTAVERT
Immortal
Immortal

There are lots of NFS devices in a range of prices. Although not as critical as iSCSI I would find something on the HCL. I believe some versions of the ReadyNAS are on the list. Something with more drives say 8 would be better than 4. Have a look at http://communities.vmware.com/docs/DOC-8760 or if you have vCenter you can schedule a clone.

-- David -- VMware Communities Moderator
0 Kudos
DSTAVERT
Immortal
Immortal

I'm sure they weren't snapshots, as Thin provisioned VM's can't have snapashots, can they? Plus it only affected one VM not the other. Very odd. I'll let you know what I find once they are back up.

Snapshots are available with thick or thin disks. How do you do your backups?

-- David -- VMware Communities Moderator
0 Kudos
freakyocr
Contributor
Contributor

Acronis Backup & Recovery 10 from within the OS. It creates wmware workstation (ovf?) for import back into Vmware. Only issue is that it's on an RDX, which is USB. Would have been better off with NAS, I suppose.

What's best practice for backup in the ESXi world? Acronis sold me on their suite, and it's supposed to conenct to ESXi directly, but haven't been able to connect that just yet.

0 Kudos
DSTAVERT
Immortal
Immortal

I don't know whether it uses VMware snapshots or not although a quick look mentions VCB which does use snapshots.

-- David -- VMware Communities Moderator
0 Kudos