VMware Cloud Community
rtmanager
Enthusiast
Enthusiast

Problem with SAN

I have 4 esx servers 3.0.2 together with vcenter 2.0.2. All of the SAN storage paths are redundant (2 Brocade switches / AMS500 Hitachi with two controllers).

Today one of the SAN controller in the SAN storage went down. All of my microsoft guests (W2003 R2) where ok (are still running without a problem),

but the SLES9/SP3 guest is getting a problem.

On this guest a progress database is running which crashed. Here is the part of the /var/log/messages file which correspond to the error:

Jan 24 12:02:26 progress kernel: SCSI error : <0 0 5 0> return code = 0x20008

Jan 24 12:02:26 progress kernel: end_request: I/O error, dev sdf, sector 10747983

Jan 24 12:02:26 progress kernel: Buffer I/O error on device sdf1, logical block 1343490

Jan 24 12:02:26 progress kernel: lost page write due to I/O error on sdf1

Jan 24 12:02:26 progress kernel: SCSI error : <0 0 5 0> return code = 0x20008

Jan 24 12:02:26 progress kernel: end_request: I/O error, dev sdf, sector 19398839

Jan 24 12:02:26 progress kernel: Buffer I/O error on device sdf1, logical block 2424847

Jan 24 12:02:26 progress kernel: lost page write due to I/O error on sdf1

Jan 24 12:02:26 progress kernel: SCSI error : <0 0 5 0> return code = 0x20008

Jan 24 12:02:26 progress kernel: end_request: I/O error, dev sdf, sector 19398895

Jan 24 12:02:26 progress kernel: Buffer I/O error on device sdf1, logical block 2424854

Jan 24 12:02:26 progress kernel: lost page write due to I/O error on sdf1

Jan 24 12:02:26 progress kernel: SCSI error : <0 0 5 0> return code = 0x20008

Jan 24 12:02:26 progress kernel: end_request: I/O error, dev sdf, sector 20185223

Jan 24 12:02:26 progress kernel: Buffer I/O error on device sdf1, logical block 2523145

Jan 24 12:02:26 progress kernel: lost page write due to I/O error on sdf1

Jan 24 12:02:26 progress kernel: SCSI error : <0 0 5 0> return code = 0x20008

Jan 24 12:02:26 progress kernel: end_request: I/O error, dev sdf, sector 20185271

Jan 24 12:02:26 progress kernel: Buffer I/O error on device sdf1, logical block 2523151

Jan 24 12:02:26 progress kernel: lost page write due to I/O error on sdf1

Jan 24 12:02:26 progress kernel: SCSI error : <0 0 5 0> return code = 0x20008

Jan 24 12:02:26 progress kernel: end_request: I/O error, dev sdf, sector 19398847

Jan 24 12:02:26 progress kernel: Buffer I/O error on device sdf1, logical block 2424848

Jan 24 12:02:26 progress kernel: lost page write due to I/O error on sdf1

Jan 24 12:02:38 progress kernel: mptscsih: ioc0: attempting task abort! (sc=f4a75c80)

Jan 24 12:02:38 progress kernel: scsi0 : destination target 1, lun 0

Jan 24 12:02:38 progress kernel: command = Write (10) 00 00 02 50 87 00 00 20 00

Jan 24 12:02:39 progress kernel: mptbase: ioc0: IOCStatus(0x0048): SCSI Task Terminated

Jan 24 12:02:39 progress kernel: mptscsih: ioc0: task abort: SUCCESS (sc=f4a75c80)

Jan 24 12:11:01 progress kernel: SCSI error : <0 0 1 0> return code = 0x20008

Jan 24 12:11:01 progress kernel: end_request: I/O error, dev sdb, sector 55168839

Jan 24 12:11:01 progress kernel: Buffer I/O error on device sdb1, logical block 6896097

Jan 24 12:11:01 progress kernel: lost page write due to I/O error on sdb1

Jan 24 12:11:01 progress kernel: SCSI error : <0 0 1 0> return code = 0x20008

Jan 24 12:11:01 progress kernel: end_request: I/O error, dev sdb, sector 55168847

Jan 24 12:11:01 progress kernel: Buffer I/O error on device sdb1, logical block 6896098

Jan 24 12:11:01 progress kernel: lost page write due to I/O error on sdb1

Jan 24 12:11:02 progress kernel: SCSI error : <0 0 1 0> return code = 0x20008

Jan 24 12:11:02 progress kernel: end_request: I/O error, dev sdb, sector 47860223

Jan 24 12:11:02 progress kernel: Buffer I/O error on device sdb1, logical block 5982520

Jan 24 12:11:02 progress kernel: lost page write due to I/O error on sdb1

Jan 24 12:11:03 progress kernel: SCSI error : <0 0 1 0> return code = 0x20008

Jan 24 12:11:03 progress kernel: end_request: I/O error, dev sdb, sector 47860231

Jan 24 12:11:03 progress kernel: Buffer I/O error on device sdb1, logical block 5982521

Jan 24 12:11:03 progress kernel: lost page write due to I/O error on sdb1

After a reboot, a e2fsck will find no error. All filesystems are clean.

Any idea what I can do, so that this error don't come back when one san controller of the storage went down?

0 Kudos
6 Replies
Texiwill
Leadership
Leadership

Hello,

When a Linux VM detects a disk failure (LUN failure in this case) it automatically goes into a read-only mode on the disk so that data is not corrupted. This is solved by a reboot of the Linux system. This depends on how many reads vs writes you are doing and how sensitive the system is to LUN failures.

If sdb an RDM or VMDK? That also could be an issue. I have thought on this as well and I am not sure there is anything you can do in this case. Clustering is perhaps your only recourse..... But you would need to ensure that the VMDK/RDM in question is not using the same Storage Controller....


Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education. As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
0 Kudos
rtmanager
Enthusiast
Enthusiast

All volumes are VMDK's.

Is it possible to make the linux system less more sensitive about LUN faillures.

My windows 2003 guests recognize the problem but have no problem with the LUN faillure.

Also LUN faillure isn't exact the problem. When one san_storage_controller fails - because of redundant path - the esx servers switched to another (alternate) path.

I guess this takes about 60 seconds.

0 Kudos
Texiwill
Leadership
Leadership

Hello,

60 seconds is a very long time. Mine takes less time than that, a lot less time. Unfortunately, the only thing you can do is attempt to remount the partitions after the failure. YOu could setup a script that looks for the error and then uses mount -o remount.....

I would however look at your failover and see why it is so slow, mine is roughly 1 second or so.


Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education. As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
0 Kudos
MDK_BUHL
Contributor
Contributor

Hi,

are there some SAN tweaking options in ESX so that the failover happens faster?

Markus

0 Kudos
mcorey3
Contributor
Contributor

Was there a solution to this? Either a setting on the ESX/SAN or something within Linux?

0 Kudos
Texiwill
Leadership
Leadership

Hello,

The solution is in the SAN or the multipath setup within ESX. There are not a lot of SAN settings available within ESX however. You can see them all under Configuration tab Advanced Settings link within the VIC. I would investigate your SAN/fabric for errors. If you imply iSCSI instead of FC-SAN by using the word SAN, then there are even less options available.


Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education.

CIO Virtualization Blog: http://www.cio.com/blog/index/topic/168354

As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
0 Kudos