Hi, im a newbie in Vmware ESX Implementation services and althought
i have covered all the basics the are certain aspects that are till unknow to me.
I was wondering if anyone could help me with this right now im working with a client
that had a major SAN failure (storage system suddenly going down), After we helpeed
getting the storage up he found out one of his virtual machines got corrupted either
by the process he used to bring it up after the failure or by the failure itself.
In short he tried to start the vmachine and it failed saying that the disk wasnt found
he proceeded to manually search for the disk and started and althought it started okay
it didnt match with his latest snapshot state, he then proceed to try and recover and older
snapshot wich in turn was worse cause the snapshot wouldnst start so he ended up
losing two day wise business data.
So after all this my question is Best Practice Wise wich is the right procedure
to execute when having this kind of problem.
may i say firstly, welcome to the family....
as far as your query goes, i'd start with implementing a backup policy for the VMs, analyse RTO and RPO requirements and build around these. i wouldn't rely solely on carsh consistent snapshots to role back to although they are usually good enough. it seems strange that 1 machine was affected and the others ok although SAN failure is pretty rare and it is hard to say what this particular VM was doing at the time, it may have tried to VMotion as the SAN went down.
the only thing i'd have tried would have been an FSCK from the ESX host before the VMs were brought back up.
Jeje thanks for the warm welcome
Another question is there a white paper , article or document that address more deeply how
esx address partial faults , like lost of a hba, lost of a path to san , lost of san controller