VMware Cloud Community
Sandor_Fabian
Contributor
Contributor

ESX server VM state problem after power failure

We have 2 ESX servers, at the weekend we had a longer power cut and the ESX servers had been restarted automatically. Today morning all virtual machines were in power-off state. We restarted them. But we have a very serious problem with one of these virtual machines.

This specific virtual machine is also restarted, but the VM is in a very old state (end of November 2007), we miss a complete database from this VM. We started to check the logs, the guest OS (Windwos 2003 Server) last eventlog entry (Security, Application, etc...) is dated to 27th of Nov last year... I checked the log files in Virtual Center as well, there's no event in this virtual machine's event log after this data (?!). The second very strange thing is that all machine on our ESX server have performance data in the Virtual Center, if I check this specific VM's performance data, all graphs are cut at 27th of Nov, and we do not have any performance data after this, although I know that we used this VM a few days ago...

The VM doesn't have any snapshot, and we didn't make snapshots for this machine in the past as well. I checked the settings of the virtual disk, it is not an independent disk, so it is not the case when a disk is set to non-persistent and after a power-off it gets back to its previous state.

The ESX hosting the VM is: VMware ESX Server, 3.0.2 build 61618

The Virtual Center server is: Virtual Infrastructure Server, 2.0.2 build 61426

Some more information: the size of the vmdk files in the VMFS file system are very strange, we have 2 vmdk files for this VM:

000045-WinSrv2003Std-0-0.vmdk 332 bytes

000045-WinSrv2003Std-0-0-delta.vmdk 496.0 Mbytes

I don't understand how the VM can run, when the vmdk file is less then 0.5GBytes...Only the operating system itself is larger than this size... And the VM is running (although in an earlier state...).

So, I imported this VM with VMware Converter to my notebook, and converted it to Workstation format. After the conversion I got a 3.2 GB large vmdk file on my notebook...

I cannot mount this disk file, cause the mount utility crashes with a runtime error and displays the vmdk file is corrupted. But I can start and run and VM in VMware Workstation... It is running. But the state of files, eventlogs, etc... the whole disk is the same, end of November 2007.

I appreciate any help, thank you all in advance.

Message was edited by: Sandor.Fabian

0 Kudos
11 Replies
jhanekom
Virtuoso
Virtuoso

The fact that your VMs didn't power on automatically is, unfortuantely, expected behaviour if your entire cluster goes down. The "Virtual Machine Startup/Shutdown" feature was previously automatically disabled when you added a host to VirtualCenter and after subsequent VMotion migrations. This has reportedly been corrected in VirtualCenter 2.0.2, but still has to be manually enabled on each host.

My guess at your "out of date" VM problem would be that someone cloned the VM at some point, and that the cloned VM with updated content is still around somewhere on the host. (Since you manually powered on VMs, I'm guessing that the original VM was previously in a powered off state, and that the cloned VM is now in a powered off state - or has a conflicting IP address if it is powered on.)

0 Kudos
Sandor_Fabian
Contributor
Contributor

Thank you for your suggestions, we have never cloned that VM. I updated my post, I think this can be some kind of VMFS file system corruption.

Is there any way for check and repair a corrupted VMFS file system? It is not normal that the vmdk file of this VM is less than 0.5GB. It must be more, cause the operating system is running...

0 Kudos
jhanekom
Virtuoso
Virtuoso

In the unlikely event that something did get corrupted, you wouldn't be seeing "out of date" data, but more than likely "no data". The symptoms you describe are symptomatic of either a reverted snapshot or an old, cloned VM. Bytes on disk cannot magically roll themselves back to a consistent point in time due to corruption.

There is unfortunately no public "fsck"-like tool to analyze a VMFS volume. If you do suspect data corruption (again, this is extremely unlikely - I've certainly never encountered it myself), you will need to open a support case with VMware for help.

There are two VMDK files to each virtual disk. One, the header file, is tiny (a few hundred bytes) and the other (usually with a "-flat.vmdk" portion in the name) contains the actual data. Could this be what you're seeing?

To try to get to the bottom of what happened to your VM, may I suggest looking at the vmware.log files in the directory that houses the vmx file for the VM?* If there are no log files containing entries between August and now, you're dealing with a cloned VM that was dormant during that time. If not, try to get to the point in time when the VM was powered on again to see what happened.

jhanekom
Virtuoso
Virtuoso

I've just noticed that you've edited your original post, so my previous post didn't take that into account.

The virtual disk you're describing most definitely has (or had) a snapshot associated with it. The "-delta" file is proof of that. The 332-byte .vmdk file you mention is just the header file that points to the delta file. Unless this is an RDM (raw device mapping), there should be two more files called 000045-WinSrv2003Std.vmdk and 000045-WinSrv2003Std-flat.vmdk that contains the actual data. Bear in mind that these may be on a different LUN from the LUN where the vmx file lives. (By default, all snapshots are stored with the vmx.)

It will be quite a task explaining how your Virtual Machine pointed back at the source disk in stead of the snapshot. This is not something that should happen, and I haven't seen any reports of that happening on the forums (though I only read so many posts myself...)

Can you examine the VM's properties (or the VMX file) to see which file(s) it points to for its disk(s)? Are these the header files for the "-delta.vmdk" files, or the "-flat.vmdk" files? You can examine the header files by using for example "cat 000045-WinSrv2003Std.vmdk". (Note: don't try this on the -flat or -delta files! This will try to dump the contents of the entire virtual disk through your SSH session.)

Sandor_Fabian
Contributor
Contributor

I have copied the VM files from the storage to my notebook (with WinSCP), I checked the log files (vmware.log, vmware-1.log, vmware-2.log, etc...). There is no event between 27th of November 2007 and 7th January 2008! Absolutely nothing, same as the performance data what is completely missing for this VM at the specified time interval. What is very interesting, I heard that the performance data is stored in an MSSQL Express database on the Virtual Center machine. If it is the case, how it is possible that the VM went back to an older state, and the performance data on a completely differnet virtual machine (Virtual Center is also deployed to a VM) cleared as well from the MSSQL database. I haven't yet checked this database.

Yes, you are right, it is also very strange and unlikely for me that the result of a filesystem corruption is an older, but still consistent state of the virtual machine.

I asked the developers who worked on that VM, cause I couldn't find any trace in the log files that shows somebody made a clone or a snapshot in the last months, and in the current state of the VM I cannot see any snapshot in Virtual Center for this VM.

If it is a cloned machine that was dormant as you mentioned, is there any possiblity to restore the original one? We used the services on that VM last time 5th January 2008...

0 Kudos
Sandor_Fabian
Contributor
Contributor

The corresponding part of the vmx file is:

scsi0:0.filename = "000045-WinSrv2003Std-0-0.vmdk"

The contents of this vmdk file is:

  1. Disk DescriptorFile

version=1

CID=eac530a3

parentCID=e65e0b37

createType="vmfsSparse"

parentFileNameHint="/vmfs/volumes/46b1b053-e6e2b3c7-273e-0013d34e03c6/VM/44/000044-WinSrv2003Std-0-0.vmdk"

  1. Extent description

RW 67108864 VMFSSPARSE "000045-WinSrv2003Std-0-0-delta.vmdk"

  1. The Disk Data Base

#DDB

ddb.toolsVersion = "7202"

There's something what I do not understand, the vmx file and the -delta.vmdk file is located in /vmfs/volumes/....../VM/45. I thought all files corresponding to this virtual machine is stored in this folder. But the header file posted above contains a reference to a folder .../VM/44.

0 Kudos
jhanekom
Virtuoso
Virtuoso

Well, that VM definitely has at least one snapshot - that much is clear. It's unfortunately a fairly common occurrence for the snapshot database to "lose track" of snapshots, making them invisible from the VI client. Usually, the cure is quite simple - just take another snapshot in the VI client and commit that - this process will commit all snapshots "below" that one as well.

In your case, that's probably undesirable, since committing the snapshots will just make permanent the changes that you're seeing now (i.e. you don't want them.)

It might be a start to try and track all the vmdk files this VM references and try to establish some sort of map of them, noting their child and parent vmdk's as well as their last modified time.

I would still try to hunt for another virtual machine that might have been running. The fact that there are no events from November for this virtual machine is somewhat suspicious. You can view the entire list of VMs registered on the host by running "vmware-cmd -l". You can view a list of all the vmx files on the host by running "find /vmfs/volumes -name *.vmx -print".

0 Kudos
Sandor_Fabian
Contributor
Contributor

I checked the vmdk header files, I have found a lot of references in the vmdk header files for other vmdk header files in other folders. The reference "queue" is the following (with last change time of files):

/vmfs/volumes/46b1b053-e6e2b3c7-273e-0013d34e03c6/VM/45/000045-WinSrv2003Std-0-0.vmdk 08/01/2007 (cause we started it after power failure)

/vmfs/volumes/46b1b053-e6e2b3c7-273e-0013d34e03c6/VM/44/000044-WinSrv2003Std-0-0.vmdk 10/10/2007

/vmfs/volumes/46b1b053-e6e2b3c7-273e-0013d34e03c6/VM/43/000043-WinSrv2003Std-0-0.vmdk 13/09/2007 16:48

/vmfs/volumes/46b1b053-e6e2b3c7-273e-0013d34e03c6/VM/42/000042-WinSrv2003Std-0-0.vmdk 13/09/2007 14:23

/vmfs/volumes/46b1b053-e6e2b3c7-273e-0013d34e03c6/VM/41/000041-WinSrv2003Std-0-0.vmdk 13/09/2007 14:12

/vmfs/volumes/46b1b053-e6e2b3c7-273e-0013d34e03c6/VM/21/000021-WinSrv2003Std-0-0.vmdk 07/09/2007

/vmfs/volumes/46b1b053-e6e2b3c7-273e-0013d34e03c6/VM/7/000007-WinSrv2003Std-0-0.vmdk 17/08/2007

All of these folders contain -delta.vmdk and .vmss (snaphot??) files, the only one -flat.vmdk file is 000007-WinSrv2003Std-0-0-flat.vmdk located in ...VM/7/.

None of the mentioned .vmdk, -delta.vmdk, -flat.vmdk, .vmss files is dated after 27th Nov. I do not have any trace of activites in December. This VM contained a MySQL database what used frequently in December. What do you thin is there any possiblity to restore data from the disk or snapshot files on the ESX server?

Yes, even if I couldn't find any trace in log files for snapshot generation, I can accept that a -delta.vmdk file is an obvious proof for this event. There is another thing what we done and it might related to this issue. We exported the virtual machine with Converter to Workstation format, an external developer worked on this, and later we imported back the machine to ESX, this event happend more than once during the development phase.

Thank your for your suggestions.

0 Kudos
jhanekom
Virtuoso
Virtuoso

Importing the VM several times with Converter should not cause what you're seeing, but it would be interesting if the times tied up with the snapshot times you've uncovered.

It's unusual for the delta and snapshot files to be in so many different folders. Any chance you were using a third party product to manage your VMs, or something like VMware Lab Manager? I'm not sure if that would cause this type of structure, so I'm more asking for my own curiosity.

I'm still hoping that there is another VM registered on the ESX host somewhere that is the real machine you're looking for. The fact that you imported the VM several times increases the possibility that this happened.

0 Kudos
Sandor_Fabian
Contributor
Contributor

We didn't use any third party management tool, but in the past we used LabManager for checking its features, etc... (currently it is powered off, and we do not plan to use it further, we manage our infrastructure with Virtual Center). This VM was imported to LabManager at the past. I will check all VMs on that host, and identify them one by one, maybe one of them is a more up to date state of our VM.

0 Kudos
jhanekom
Virtuoso
Virtuoso

Hmm the Lab Manager linked-clone capabilities might explain the delta's in various folders. I'm hoping someone more familiar with Lab Manager can comment on that...

0 Kudos