A Linux machine I virtualized 2 weeks ago suddenly went unresponsive, both via network (ping, http, ftp, ssh) and via console. The console just showed the login screen, but did not react to any input. The realtime performance data did not show any activity. Neither CPU, HDD nor network.
After I hard-rebooted the VM, the OS itself seemed ok, but the approx. 9TB large LVM-volume mounted to /srv was crashed. Even a recommended “reiserfsck --rebuild-sb” could not recover the super block. The LVM itself seemed to be ok as pvdisplay and lvdisplay showed the expected values. The lv consists of three virtual drives of 4TB, 3TB and 2TB.
The vSphere is a cluster of 2 ESXi, 5.5.0, 2718055. The 2 HP DL380p Gen8 are directly connected with 16G FC redundant paths (2) to a dual controller MSA2040 SAN. The network connection is made from a netgear 10G switch with 2 cables 10GBase-T to each host. The cluster is working since months without any problems with some other guests running SLES11 and Windows Server 2012 R2. I did the last update in February – updated all hardware firmware, hypervisor and VM Tools.
The VM is a SUSE Linux Enterprise Server 10 (x86_64) SP4 / Kernel 18.104.22.168-0.103.1-smp and worked for years on a physical hardware without any problems. The virtual hardware configuration has one LSI Logic Parallel controller with 5 virtual thick provisioned harddrives (550GB, 1.5TB, 4TB, 3TB, 2TB) and one e1000 network card. The kernel is using the mptspi module for the controller.
At the time the crash happened, I neither had any snapshots present, nor any machine needed consolidation, nor I had more load than usual. (moderate to high)
No other VMs were affected. I don’t see any hardware problems on any of the two hosts.
Currently I restore the 4TB, 3TB and 2TB drives back with Veeam 7. Afterwards I expect the system to work again (hopefully), but I fear that this will most probably happen again, if I don’t find the reason that caused the problem.
I'm really frustrated as this is our most important machine and I have no ideas! What would you do?
Restoring the three harddisks helped to access them again. Still I need to find out why this happened, as I fear this could/will happen again.
I think I will update the MSA firmware and update the VM from version 8 to 10.
Could there be any data corruption of the MSA and/or datastore? If yes, wouldn't the regular scrubbing of the SAN detect this? Wouldn't there a problem with any of the other VMs?
Could times of high IOP load / high latencies affect the validity of filesystems within a VM?
Any other things I should investigate?