Our vSphere somehow was using a shared volume from another server which was recently formatted to be use as NTFS. The time I discovered it, the ESXi v4.1 host was still okay until one day when browsing the Datastore, it doesn't show the files (like the .vdhk, etc.). Even when I do an SSH session, when I go to the folder, it's totally not showing anything. However, VMs are still running but having issues like we are not able to update the contents inside anymore.
I guess because of the conflict in the SAN volume, it damaged the file system of the ESXi 4.1.
Is there anyway we can still recover the VMs? Salvage the virtual disk to be run on another ESXi?
I'm not a linux person so I'm sorry if I not able to put in all the details. Hoping for someone can guide me.
Thanks in advance!
Your VMs are still running ? - good
Still using ESXi 4.1 ? - good
Thats the end of the good news.
The risk to make the situation much worse than it already is signifikant.
i ASSUME THE FOLLOWING SCENARIO:
The NTFS-format has at least partly overwritten that section of the VMFS-headers that store the allocation of each vmdk-fragment.
The allocation of those VMs that still run resides on RAM only.
The VMFS metadata for the VMs that are not running at the moment is eventually already gone.
Trying to write into the VMs that are still running needs to be stopped as soon as possible.
The current state has to be considered as unstable. Soon the datastore will fail to use resources from RAM and probably a process will crash or hang itself up.
We need to avoid a reboot or poweroff we until we acquired that stuff that at the moment only lives in RAM and which will probably get lost when the ESXi reboots.
Assuming worst case we have to expect that any command that actively interacts with the datastore can result in a hung process which then can be the final killer.
Here is a list that immediatly should reach any user that may still work with any of the running VMs:
- immediatly stop any write attempts from inside the guest
- instead collect some data about those VMs that maybe necessary latert - the partitiontable is very valuable for example, directorylistings of datadisks are also very valuable-
- users should write this down to paper
- next all non invasive methods tos stop action on non eagerzeroed thick provisioned vmdks should be tried - (webserver writing into a delta-vmdk for example can reduce the need to allocate new fragments on the datastoreip if stop traffic on the firewall)
The other - much more important thing that someone should do as soon as possible is collecting a header dump.
Do NOT use ESXi via putty to do that - instead use a Linux LiveCD from the outside - a physical notebook in the admin-network is a good idea.
From that Linux connect via sshfs in READONLY mode to the ESXi
mkdir /esxi
sshfs -o ro root@esxi-ip:/ /esxi
dd if=/esxi/dev/disks/naa-numer-for the-datastore bs=1M count=1500 of=/tmp/dagohoy.1500
Next download dagohoy.1500 and store it away safely.
Once we have that dump we can relax a bit and start to handle the still running VMs one by one.
When the VMs are powered down -we should try to get the mapping of all flat.vmdks and all delta.vmdks using vmkfstools commands.What ever you do - follow the instructions of someone with a complete plan.
Do not execute any suggestion that starts with "Maybe we should try ...."this case the highest priority is getting the VMFS-header dump with a minimal invasive procedure.
Do you think these instructions are way more radical than really required ?
Yes - thats possible - please explain then.
From my experience handling such a case is 3 - 10 times as workintensive if there is no fresh header-dump available.
So I recommend to get the dump asap
Its late at night - if you want to talk you can catch me in skype next 15minutes from now or tomorrow.
Good night and good luck
Ulli
In
Maybe continuum will be able to help. You'll find contact details on his profile page.
André
Your VMs are still running ? - good
Still using ESXi 4.1 ? - good
Thats the end of the good news.
The risk to make the situation much worse than it already is signifikant.
i ASSUME THE FOLLOWING SCENARIO:
The NTFS-format has at least partly overwritten that section of the VMFS-headers that store the allocation of each vmdk-fragment.
The allocation of those VMs that still run resides on RAM only.
The VMFS metadata for the VMs that are not running at the moment is eventually already gone.
Trying to write into the VMs that are still running needs to be stopped as soon as possible.
The current state has to be considered as unstable. Soon the datastore will fail to use resources from RAM and probably a process will crash or hang itself up.
We need to avoid a reboot or poweroff we until we acquired that stuff that at the moment only lives in RAM and which will probably get lost when the ESXi reboots.
Assuming worst case we have to expect that any command that actively interacts with the datastore can result in a hung process which then can be the final killer.
Here is a list that immediatly should reach any user that may still work with any of the running VMs:
- immediatly stop any write attempts from inside the guest
- instead collect some data about those VMs that maybe necessary latert - the partitiontable is very valuable for example, directorylistings of datadisks are also very valuable-
- users should write this down to paper
- next all non invasive methods tos stop action on non eagerzeroed thick provisioned vmdks should be tried - (webserver writing into a delta-vmdk for example can reduce the need to allocate new fragments on the datastoreip if stop traffic on the firewall)
The other - much more important thing that someone should do as soon as possible is collecting a header dump.
Do NOT use ESXi via putty to do that - instead use a Linux LiveCD from the outside - a physical notebook in the admin-network is a good idea.
From that Linux connect via sshfs in READONLY mode to the ESXi
mkdir /esxi
sshfs -o ro root@esxi-ip:/ /esxi
dd if=/esxi/dev/disks/naa-numer-for the-datastore bs=1M count=1500 of=/tmp/dagohoy.1500
Next download dagohoy.1500 and store it away safely.
Once we have that dump we can relax a bit and start to handle the still running VMs one by one.
When the VMs are powered down -we should try to get the mapping of all flat.vmdks and all delta.vmdks using vmkfstools commands.What ever you do - follow the instructions of someone with a complete plan.
Do not execute any suggestion that starts with "Maybe we should try ...."this case the highest priority is getting the VMFS-header dump with a minimal invasive procedure.
Do you think these instructions are way more radical than really required ?
Yes - thats possible - please explain then.
From my experience handling such a case is 3 - 10 times as workintensive if there is no fresh header-dump available.
So I recommend to get the dump asap
Its late at night - if you want to talk you can catch me in skype next 15minutes from now or tomorrow.
Good night and good luck
Ulli
In
Hi All,
i have a problem with with one of my datastor, certain folders are missing files and when I try to access them through ssh I get the an error, but other folders I can access them without a problem
please assist
I may be able to assist in recovering the missing files if you provide a VMFS header dump - see
https://vm-sickbay.com/create-a-vmfs-header-dump-using-an-esxi-host-in-production