I see you use common seagate barracuda 7200.14 drives, which are *not* recommended for any serious server usage. Problem is they sometimes become temporarily unresponsive for quite long time (eve...
See more...
I see you use common seagate barracuda 7200.14 drives, which are *not* recommended for any serious server usage. Problem is they sometimes become temporarily unresponsive for quite long time (even a few seconds) due to error-recovery, weak sector re-allocation, etc. That is quite normal for desktop-drives (even new ones) and accepted by desktop-OS, but if this happens during high I/O load, it is all what server-OS needs to mark drive as "failed" and disconnect it. This is even bigger problem for raid-controllers, which prefer to drop drive very easily, if it becomes unresponsive. If you want to use affordable sata-drives, pick at least "server" drives (sometimes marked as "24/7", or "raid-edition"), i.e. "Barracuda ES". They come with low TLER value (time-limited error recovery), so if sector re-allocation takes long time, it is interrupted before OS (or controller) disconnects it due to stalled I/O operations, and disk-maintenance routine is performed later when the drive is not under load. I had problem with very similar symptoms as you describe, frequent disconnects for a drive, without any apparent reason. I consulted it with our hw-supplier and got the info mentioned above. So I simply switched the disk for other of "24/7" edition, and the problem was over. Interesting, the disk which caused so many problems in my esxi-server has since then been used for a few years in heavily loaded workstation, without single disconnect or any other problem...