I have 4 esx servers 3.0.2 together with vcenter 2.0.2. All of the SAN storage paths are redundant (2 Brocade switches / AMS500 Hitachi with two controllers).
Today one of the SAN controller in the SAN storage went down. All of my microsoft guests (W2003 R2) where ok (are still running without a problem),
but the SLES9/SP3 guest is getting a problem.
On this guest a progress database is running which crashed. Here is the part of the /var/log/messages file which correspond to the error:
Jan 24 12:02:26 progress kernel: SCSI error : <0 0 5 0> return code = 0x20008
Jan 24 12:02:26 progress kernel: end_request: I/O error, dev sdf, sector 10747983
Jan 24 12:02:26 progress kernel: Buffer I/O error on device sdf1, logical block 1343490
Jan 24 12:02:26 progress kernel: lost page write due to I/O error on sdf1
Jan 24 12:02:26 progress kernel: SCSI error : <0 0 5 0> return code = 0x20008
Jan 24 12:02:26 progress kernel: end_request: I/O error, dev sdf, sector 19398839
Jan 24 12:02:26 progress kernel: Buffer I/O error on device sdf1, logical block 2424847
Jan 24 12:02:26 progress kernel: lost page write due to I/O error on sdf1
Jan 24 12:02:26 progress kernel: SCSI error : <0 0 5 0> return code = 0x20008
Jan 24 12:02:26 progress kernel: end_request: I/O error, dev sdf, sector 19398895
Jan 24 12:02:26 progress kernel: Buffer I/O error on device sdf1, logical block 2424854
Jan 24 12:02:26 progress kernel: lost page write due to I/O error on sdf1
Jan 24 12:02:26 progress kernel: SCSI error : <0 0 5 0> return code = 0x20008
Jan 24 12:02:26 progress kernel: end_request: I/O error, dev sdf, sector 20185223
Jan 24 12:02:26 progress kernel: Buffer I/O error on device sdf1, logical block 2523145
Jan 24 12:02:26 progress kernel: lost page write due to I/O error on sdf1
Jan 24 12:02:26 progress kernel: SCSI error : <0 0 5 0> return code = 0x20008
Jan 24 12:02:26 progress kernel: end_request: I/O error, dev sdf, sector 20185271
Jan 24 12:02:26 progress kernel: Buffer I/O error on device sdf1, logical block 2523151
Jan 24 12:02:26 progress kernel: lost page write due to I/O error on sdf1
Jan 24 12:02:26 progress kernel: SCSI error : <0 0 5 0> return code = 0x20008
Jan 24 12:02:26 progress kernel: end_request: I/O error, dev sdf, sector 19398847
Jan 24 12:02:26 progress kernel: Buffer I/O error on device sdf1, logical block 2424848
Jan 24 12:02:26 progress kernel: lost page write due to I/O error on sdf1
Jan 24 12:02:38 progress kernel: mptscsih: ioc0: attempting task abort! (sc=f4a75c80)
Jan 24 12:02:38 progress kernel: scsi0 : destination target 1, lun 0
Jan 24 12:02:38 progress kernel: command = Write (10) 00 00 02 50 87 00 00 20 00
Jan 24 12:02:39 progress kernel: mptbase: ioc0: IOCStatus(0x0048): SCSI Task Terminated
Jan 24 12:02:39 progress kernel: mptscsih: ioc0: task abort: SUCCESS (sc=f4a75c80)
Jan 24 12:11:01 progress kernel: SCSI error : <0 0 1 0> return code = 0x20008
Jan 24 12:11:01 progress kernel: end_request: I/O error, dev sdb, sector 55168839
Jan 24 12:11:01 progress kernel: Buffer I/O error on device sdb1, logical block 6896097
Jan 24 12:11:01 progress kernel: lost page write due to I/O error on sdb1
Jan 24 12:11:01 progress kernel: SCSI error : <0 0 1 0> return code = 0x20008
Jan 24 12:11:01 progress kernel: end_request: I/O error, dev sdb, sector 55168847
Jan 24 12:11:01 progress kernel: Buffer I/O error on device sdb1, logical block 6896098
Jan 24 12:11:01 progress kernel: lost page write due to I/O error on sdb1
Jan 24 12:11:02 progress kernel: SCSI error : <0 0 1 0> return code = 0x20008
Jan 24 12:11:02 progress kernel: end_request: I/O error, dev sdb, sector 47860223
Jan 24 12:11:02 progress kernel: Buffer I/O error on device sdb1, logical block 5982520
Jan 24 12:11:02 progress kernel: lost page write due to I/O error on sdb1
Jan 24 12:11:03 progress kernel: SCSI error : <0 0 1 0> return code = 0x20008
Jan 24 12:11:03 progress kernel: end_request: I/O error, dev sdb, sector 47860231
Jan 24 12:11:03 progress kernel: Buffer I/O error on device sdb1, logical block 5982521
Jan 24 12:11:03 progress kernel: lost page write due to I/O error on sdb1
After a reboot, a e2fsck will find no error. All filesystems are clean.
Any idea what I can do, so that this error don't come back when one san controller of the storage went down?
Hello,
When a Linux VM detects a disk failure (LUN failure in this case) it automatically goes into a read-only mode on the disk so that data is not corrupted. This is solved by a reboot of the Linux system. This depends on how many reads vs writes you are doing and how sensitive the system is to LUN failures.
If sdb an RDM or VMDK? That also could be an issue. I have thought on this as well and I am not sure there is anything you can do in this case. Clustering is perhaps your only recourse..... But you would need to ensure that the VMDK/RDM in question is not using the same Storage Controller....
Best regards,
Edward L. Haletky
VMware Communities User Moderator
====
Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education. As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization
All volumes are VMDK's.
Is it possible to make the linux system less more sensitive about LUN faillures.
My windows 2003 guests recognize the problem but have no problem with the LUN faillure.
Also LUN faillure isn't exact the problem. When one san_storage_controller fails - because of redundant path - the esx servers switched to another (alternate) path.
I guess this takes about 60 seconds.
Hello,
60 seconds is a very long time. Mine takes less time than that, a lot less time. Unfortunately, the only thing you can do is attempt to remount the partitions after the failure. YOu could setup a script that looks for the error and then uses mount -o remount.....
I would however look at your failover and see why it is so slow, mine is roughly 1 second or so.
Best regards,
Edward L. Haletky
VMware Communities User Moderator
====
Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education. As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization
Hi,
are there some SAN tweaking options in ESX so that the failover happens faster?
Markus
Was there a solution to this? Either a setting on the ESX/SAN or something within Linux?
Hello,
The solution is in the SAN or the multipath setup within ESX. There are not a lot of SAN settings available within ESX however. You can see them all under Configuration tab Advanced Settings link within the VIC. I would investigate your SAN/fabric for errors. If you imply iSCSI instead of FC-SAN by using the word SAN, then there are even less options available.
Best regards,
Edward L. Haletky
VMware Communities User Moderator
====
Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education.
CIO Virtualization Blog: http://www.cio.com/blog/index/topic/168354
As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization