We shut down one of our production VMWare hosts yesterday in order to ugprade the NIC, and when it came up, one of the VMs on it is overrun with disk errors upon bootup. We shut it down by shutting down each individual VM and putting the server in maintenance mode, then shutting the server down.
We have run this software (Linux-based) on hundreds of machines including probably 30+ on VMWare. This is the THIRD machine and the THIRD VM (one on each host) that this has happened on in the past few weeks. We gave up and rebuilt the other two systems, but we can't afford to lose weeks of work to a problem that never should have occurred.
There are snapshots for this VM, but I'm afraid if I manipulate them or run filesystem check from the Linux OS (it boots but with major errors) that the data will be lost for good.
Please help, or help explain why this is happening and what we can do about it. We can't afford to lose production servers every time we take a host down for maintenance.
We are running open-vm-tools on these guests.
> There are snapshots for this VM, but I'm afraid if I manipulate them or run filesystem check from the Linux OS (it boots but with major errors) that the data will be lost for good.
Before I go into details I need to know if you really powered off that host cleanly or if it powered down because of a power failure.
Dont feel offended by that question - it has to be asked.
Next question that has to be asked ...
Can you still boot the VMs - or do you see messages about corrupt redo logs or I/O errors before they even start ?
For VMs with snapshots that are startable the next steps are:
- dont run filesystem checks or try to fix any problems manually !
- instead create a new snapshot while the VM is powered off (thats important !!!)
- then once the new snapshot is active you can run fs-checks and repair the filesystem.
Which Linux filesystems do you use ?
In my experience several of the newer FS-versions for Linux like Riser, XFS or btrfs dont work as reliable as expected if you use thin-provisioned vmdks.
And even if your basedisks use thick provisioned vmdks the use of snapshots turns them into thin vmdks.
Hopefully the reason for your problems is caused by some fixable issue with your hardware or with some hardware acceleration feature that should better be disabled.
Anyway - as a first step I would highly recommend to use eagerzeroed thick vmdks and avoid snapshots.
If you have seriously damaged VMs that do not start at all and report I/O errors - I can help you with that.
Call me via skype if you need help.