ESXi4 - vmfs corruption after multipath problem, what was our error?
Yesterday my company had severe server-unavailability failure, which we thought is impossible to happen.
Even tho eveything is now now working with minimal time&data loss, we still have no idea what was the cause - any ideas are welcomed
So, we have 2 sites, an IBM DS-8300 in both sites, 3 ESX servers in both sites (total 6 ESXi, 2 DS).
2 fiber switches in every site (Brocade), with dedicated fiber channel.
So an ESX server has 4 distinct paths to a LUN.
What started the failure, was "killing" one of paths via dropping a port in one of fiber switches - we had done this before and everything until today was smooth with no problems.
For an hour or so, there were 4 or 5 such "port drops", with at least 4-5 minutes between. And in one moment the sky had fallen - ESX servers started losing LUN connectivity, with running VMs going into "suspend" condition.
We have serveral non-ESX servers which use same fiber switches and DS storages - some WIn2008R2 (Exchange, MS SQL), some Aix and Linux machines.
None of these machines failed and losed connectivty 😕
We had a high priority VM, which had to be made available ASAP. So we started giving "shut down guest" commands, hoping to shut down everything unimportant and bring back this machine and its lun.
After hours, after restarting all ESX servers, we got corrupted 1.5TB VMFS partition - with ~ 10 VMs on it, most of them were working at time of incident.
Mounting partition to only 1 server and running vmkfstools "fixed" the partiton and all VMs were successfully restored.
But we have no explanation why this happened.
Is it problem with vmware 4.1 corrupting vmfs when under high i/o load when path fails?