VMware Cloud Community
nh4vm
Contributor
Contributor
Jump to solution

During failed remediation Snapshot disappear from manager, is on datastore

Hi Guys

I have been contracted in to sort out a vSphere 4.1 environment, and there is a couple of VM's some with up to 4 snapshots on them, and more than a year old. I have been remediating a few of them already with the "Delete all" according to the vSphere documentation, and all have gone fine.

I have just tried to remediate one more, It only has one VM disk and one snapshot of it which is only a few month old. The remediation process went wrong, and I got the following error: "File <unspecified filename> is larger than the maximum size supported by datastore '<unspecified datastore>root 25/05/2012 16:20:07 25/05/2012 16:20:07 25/05/2012 16:20:09".

The datastore has a block size of 1mb, and should be able to support up to 256mb file size, and the VM's disk is only 84gb, the snapshots actual size is 77gb with a provisioned space of 84gb same as the vm's original size.

The server is a SQL2008R2 box, so it's rather important that it doesn't fail. Also, the company used to have vCenter installed, but at some point in time the have stopped using it and are now only logging in directly to the ESXi hosts themselfes.

Does anyone have any idea of how to resolve this ?. The  server seems to be working fine, but if you look in the snapshot manager the snapshot has gone, but it's files still exist in the datastore. I will make sure we for now don't turn the server off.

I have also just discovered when comparing the .vmx files from this vm and another vm that still has old snapshots on it, the following difference;

hostCPUID.0 = "0000000b756e65476c65746e49656e69"
hostCPUID.1 = "000206c220200800029ee3ffbfebfbff"
hostCPUID.80000001 = "0000000000000000000000012c100800"
guestCPUID.0 = "0000000b756e65476c65746e49656e69"
guestCPUID.1 = "000206c200010800829822030febfbff"
guestCPUID.80000001 = "00000000000000000000000128100800"
userCPUID.0 = "0000000b756e65476c65746e49656e69"
userCPUID.1 = "000206c220200800029822030febfbff"
userCPUID.80000001 = "00000000000000000000000128100800"

This information above, doesn't exist in the .vmx file for the vm that failed the snapshot remediation.

Kind regards

Niels

Reply
0 Kudos
1 Solution

Accepted Solutions
a_p_
Leadership
Leadership
Jump to solution

Ok, let's put everything together:

... there is a couple of VM's some with up to 4 snapshots on them, and more than a year old

... The datastore has a block size of 1mb, and should be able to support up to 256Gb file size

..."File <unspecified filename> is larger than the maximum size supported by datastore

-> virtual disks _1, _2 and _3 on different datastores (LUNs)

Thinking again about what could be the cause of this error message, the only reason I can think of is that the other virtual disks were added after taking the snapshot and at least one of the virtual disks has a size larger than 254GB.(see "Calculating the overhead required by snapshot files" at http://kb.vmware.com/kb/1012384)

When you delete a snapshot with the VM powered on, ESXi creates a "consolidate helper snapshot". If this helper snapshot cannot be created, you will receive this error message. If this is the case you should at least be able to delete the snapshot with the VM powered off.

The next question would be how to resolve this situation, in order to be able to create snapshots in the future. Well, the easiest way - in case your hardware is supported - would be to upgrade to ESXi 5 and VMFS-5 which supports file sizes of ~2TB with its unified 1MB block size. In case an upgrade is not an option you'd either have to create smaller virtual disks and migrate/copy the data or create new datastores with a larger block size and migrate the virtual disks.

An alternative to the above mentioned options - which however adds complexity to the setup - could be to redirect the snapshots to a datastore with the appropriate block size. See http://kb.vmware.com/kb/1002929

André

View solution in original post

Reply
0 Kudos
12 Replies
john23
Commander
Commander
Jump to solution

Can you check from esx command line, whether it shows snapshot or not?

vim-cmd vmsvc/get.snapshot <vmid>

Thanks -A Read my blogs: www.openwriteup.com
Reply
0 Kudos
a_p_
Leadership
Leadership
Jump to solution

From the sizes you mentioned this error doesn't make any sense. What other files do you see in the datastore browser? Are there any "...ctk..." (Changed Block Tracking) files? In this case you could try to delete these ctk files (they will be recreated with the next snapshot), create another snapshot from the Snapshot Manager and run "Delete All" again. If this does not work, I would go ahead and clone the virtual disk using

vmkfstools -i <current-snapshot>.vmdk <target-disk-name>.vmdk

The CPUID entries in the VM's configuration file should not have anything to do with the snapshots.

André

Reply
0 Kudos
nh4vm
Contributor
Contributor
Jump to solution

Hi John23

Yes I have both ckecked the datastore from the "Browse datastore" in the client, and I have also logged in via the command line, which confirms what I see in the client, the snapshot file is present it just doesn't show up in the Snapshot manager. I have read somewhere that that often happens if the remediation goes wrong, just can't remember where I saw it.

Kind regards

Niels

Reply
0 Kudos
nh4vm
Contributor
Contributor
Jump to solution

Hi Andre

Since it's weekend now, and I don't have a remote login to the client, I can't see if there should be any .clk files, but I will check that out first thing Monday morning. Can I delete the .clk files while the server is running ?.

I have just found some info from a vExpert guy somewhere else, which confirms what you write, create another snapshot from the Snapshot Manager and run "Delete All" again. If this does not work, clone it using the vmkfstools. The thing that worries me is, that according to his post, I will need to shut down the VM while I do the cloning with vmkfstools, and the client want this server up and running all the time.

I took a copy of the .vmx file so that I could do some digging around this weekend, and the other info I found said how to see if the VM is running of the snapshot, which it is confirmed by these lines

scsi0:0.present = "TRUE"
scsi0:0.fileName = "SQL (reporting)-000001.vmdk"

The SQL server is set up in a way so it has it's normal c: drive as a regular VM, but then three additional VM disks have been created because SQL has been given separate disks for .ldf, .mdf, and some reporting stuff, in order to optimize performance, But snapshot has only been taken of the first disk, the c: drive. Below is the info from the .vmx file regarding the other three additional SQL disks

scsi0:1.present = "TRUE"
scsi0:1.fileName = "/vmfs/volumes/4fa16720-63a7952a-89f2-78e3b515ecf0/SQL (reporting)/SQL (reporting)_1.vmdk"
scsi0:1.deviceType = "scsi-hardDisk"
scsi0:3.present = "TRUE"
scsi0:2.present = "TRUE"
scsi0:1.redo = ""
scsi0:2.fileName = "/vmfs/volumes/4fa16898-6723fd08-fdfe-78e3b515ecf0/SQL (reporting)/SQL (reporting)_2.vmdk"
scsi0:2.deviceType = "scsi-hardDisk"
scsi0:3.fileName = "/vmfs/volumes/4fa168d1-72d87174-2b70-78e3b515ecf0/SQL (reporting)/SQL (reporting)_3.vmdk"
scsi0:3.deviceType = "scsi-hardDisk"
scsi0:2.redo = ""
scsi0:3.redo = ""

Kind regards

Niels

Reply
0 Kudos
a_p_
Leadership
Leadership
Jump to solution

Without knowing the details it's hard to say what's causing the issue. I assume you don't have a current vmware.log file from this virtual machine, do you?

Often virtual disks for SQL, Exchange or other applications which are backed up by agents in the OS itself are excluded from snapshots by setting the virtual disks to "independent-persistent". Do you see such entries in the .vmx file for the 3 data disks?

... and the client want this server up and running all the time.

Although I can understand this, there are sometimes situations where this cannot be achieved. If the "Delete All" does not work and there are no helpful entries in the vmware.log file (after running "Delete All"), cloning might be the only safe solution to resolve the issue and therefore the virtual disks must not be in use (i.e. the VM has to be powered off). After cloning the disk you have to reconfigure the VM and replace the current disk with the cloned disk.

Anyway, let's first see what "Delete All" can do and - if necessary - take a look at the log files, before doing the next steps.

André

Reply
0 Kudos
nh4vm
Contributor
Contributor
Jump to solution

Hi André

No, I don't have a current log file right now, but when I get in in the moring, I will have a look through the log files and see if I can find something useful.

As far as I remember, there is only a snapshot of disk 0 the operating system itself, but I somehow doubt if the have made the three SQL data disks "independent-persistent", since they have a serious lack of virtualization knowledge. They just got it installed somehow, I don't know by who, but they don't know much about it. There are no "independent-persistent" entries in the vmx files.

To be fair, even though they haven't got the best of start with their new virtualized environment, their CTO is a very nice and pragmatic guy, and so far as long as I have been able to explain to him what is best practice, and what would be best for them in the long run, his is both listening and not afraid to buy any new hardware if needed, and is also aware that he need to improve his staff's educational level in virtualization. So I'm sure, I will be able to convince him if we might have to turn off the server for a cloning.

Does the cloning actually do remediation of the snapshot, as well as cloning it ?.

In the morning I will first have a good look through the log files, and then I will see if I can create a new snapshot, then run the delete all again. And if that doesn't do the trick, I will do the cloning as soon we can out of hours.

Thank you very much for your help André, I will let you know how this pans out.

Kind regards

Niels

Reply
0 Kudos
a_p_
Leadership
Leadership
Jump to solution

Ok, let's put everything together:

... there is a couple of VM's some with up to 4 snapshots on them, and more than a year old

... The datastore has a block size of 1mb, and should be able to support up to 256Gb file size

..."File <unspecified filename> is larger than the maximum size supported by datastore

-> virtual disks _1, _2 and _3 on different datastores (LUNs)

Thinking again about what could be the cause of this error message, the only reason I can think of is that the other virtual disks were added after taking the snapshot and at least one of the virtual disks has a size larger than 254GB.(see "Calculating the overhead required by snapshot files" at http://kb.vmware.com/kb/1012384)

When you delete a snapshot with the VM powered on, ESXi creates a "consolidate helper snapshot". If this helper snapshot cannot be created, you will receive this error message. If this is the case you should at least be able to delete the snapshot with the VM powered off.

The next question would be how to resolve this situation, in order to be able to create snapshots in the future. Well, the easiest way - in case your hardware is supported - would be to upgrade to ESXi 5 and VMFS-5 which supports file sizes of ~2TB with its unified 1MB block size. In case an upgrade is not an option you'd either have to create smaller virtual disks and migrate/copy the data or create new datastores with a larger block size and migrate the virtual disks.

An alternative to the above mentioned options - which however adds complexity to the setup - could be to redirect the snapshots to a datastore with the appropriate block size. See http://kb.vmware.com/kb/1002929

André

Reply
0 Kudos
nh4vm
Contributor
Contributor
Jump to solution

Hi André

After looking a your latest post I had a word with the CTO, and he confirms that the three additional disk has been added since the snapshot was taken. So I guess the system gets a little confused when you first create the VM with one disk of 80gb and then take a snapshot of it. Then after the snapshot is taken which is only of the first 80gb disk, you add three additional disks of 900gb each. It doesn't seem to be able to remediate with the first disk only (as the snapshot was taken), but seem to try to get the other three disks involved in the process somehow,

What I would like to do is to first stop all the SQL services running on the VM, then remove the three additional disks of 900gb each from the VM, without deleting them from the store. Then create a new snapshot, then try the "remediate/delete all" again, and if that succeed I can re-attached the disk to the VM again,  and then start the SQL services up again.

Does that sound to you like a sensible way of trying to resolve the issue ?.

Kind regards

Niels

Reply
0 Kudos
a_p_
Leadership
Leadership
Jump to solution

In this case the issue is that the VM's base folder is located on a datastore with a 1MB block size. ESXi (until version 4.x) creates snapshots for all virtual disks in the VM's base folder by default, which in this case does not work due to the 900GB virtual disks.

Do you need snapshots for the 3 data disks at all? If not, I'd recommend you power off the VM and configure the 3 data disks as "Independent-Persistent" which excludes them from snapshots. This should allow you to resolve the issue without removing the disks from the VM.

André

Reply
0 Kudos
nh4vm
Contributor
Contributor
Jump to solution

Hi André

No we do not need any snapshot of the 3 data disks, they just do a regular backup of those. I will use your suggestion and use the "Independent-Persistent" disks instead, I have already talked to them a little while ago about changing them to "Independent-Persistent" disk, so I already have them convinced.

Thank you very much for your help, I will let you know if it works.

Niels

Reply
0 Kudos
nh4vm
Contributor
Contributor
Jump to solution

Hi André

Sorry but we haven't tried it yet. I have just expanded the SAN, and we have had some issues with the SQL server that needed to be sorted first, which has nothing to do with VMware.

We will probably try tomorrow, if nothing else starts to play up.

I will keep you posted if it worked or not.

Niels

Reply
0 Kudos
nh4vm
Contributor
Contributor
Jump to solution

Hi André

After I have expanded and reconfigured the SAN, I finally go to do it. I shut the SQL server down, removed the 3 extra addedd hard drives, created a new snapshot and then consolidated the snapshots, and it all went fine. The only thing was that it took an extremely long time to finish, but as long as it was successful I don't mind. And the server seem to be working perfectly now.

I even manage to move it and it's 3 separate hard drives to other stores, and it all worked perfectly.

Thank you very much for you help, very much appreciate.

Niels

Reply
0 Kudos