problem with centos crashing every night

spinner · ‎06-28-2008

hello,

i have a centos vm running on esx 3.5, every night the centos vm is crashing with the following error:

sd 0:0:0:0: SCSI error: return code = 0x08000002

un 28 02:48:28 vmon kernel: sda: Current: sense key: Aborted Command

Jun 28 02:48:28 vmon kernel: Add. Sense: Some commands cleared by iSCSI Protocol event

Jun 28 02:48:28 vmon kernel:

Jun 28 02:48:28 vmon kernel: Info fld=0x0

Jun 28 02:48:28 vmon kernel: end_request: I/O error, dev sda, sector 23369653

Jun 28 02:48:28 vmon kernel: Buffer I/O error on device dm-0, logical block 2895053

Jun 28 02:48:28 vmon kernel: lost page write due to I/O error on dm-0

Jun 28 02:48:28 vmon kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002

Jun 28 02:48:28 vmon kernel: sda: Current: sense key: Aborted Command

Jun 28 02:48:28 vmon kernel: Add. Sense: Some commands cleared by iSCSI Protocol event

after this event either the / partition becomes read-only or the var partition

i have to reboot to get the host back online

any suggestion on how to fix this issue?

my fstab is as follows:

# cat /etc/fstab

/dev/VolGroup00/LogVol00 / ext3 defaults 1 1

/dev/VolGroup00/LogVol02 /var ext3 defaults 1 2

LABEL=/boot /boot ext3 defaults 1 2

tmpfs /dev/shm tmpfs defaults 0 0

devpts /dev/pts devpts gid=5,mode=620 0 0

sysfs /sys sysfs defaults 0 0

proc /proc proc defaults 0 0

/dev/VolGroup00/LogVol01 swap swap defaults 0 0

Beginning of the block added by the VMware software

.host:/ /mnt/hgfs vmhgfs defaults,ttl=5 0 0

End of the block added by the VMware software

i am running:

Linux vmon. 2.6.18-53.el5 #1 SMP Mon Nov 12 02:22:48 EST 2007 i686 i686 i386 GNU/Linux

#

# pvscan

PV /dev/sda2 VG VolGroup00 lvm2 http://19.88 GB / 0 free

Total: 1 http://19.88 GB / in use: 1 http://19.88 GB / in no VG: 0

#

# lvscan

ACTIVE '/dev/VolGroup00/LogVol00' http://15.00 GB inherit

ACTIVE '/dev/VolGroup00/LogVol02' http://4.12 GB inherit

ACTIVE '/dev/VolGroup00/LogVol01' http://768.00 MB inherit

#

# vgscan

Reading all physical volumes. This may take a while...

Found volume group "VolGroup00" using metadata type lvm2

#

thank you

Texiwill · ‎06-29-2008

Hello,

WHat is happening on that ESX Sever every time the system shows this error? Note you should also review your /var/log/vmkernel logfile for correlation of time and actions. Are you running backups at this time or anything else that could cause the iSCSI server to go away?

Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

CIO Virtualization Blog: http://www.cio.com/blog/index/topic/168354

As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

stumpr · ‎06-29-2008

Take a look at this VMware KB. It sounds like an old bug I saw on some RHEL P2Vs. This looks surprisingly similar, though last time it was blatant time-outs, but those time-outs would set my / partition to read-only.

I ran into this problem about a year or so ago with RHEL kernels < 2.6.9-55.EL with the mpt scsi driver. I believe I found a kernel >= 2.6.9-55.EL that didn't have this driver version and its associated problem. I see by your uname you are at 2.6.18. If you refer to the linked KB, they updated it recently indicating kernels 2.6.22 as a fix. It was probably fixed, regressed and fixed again. A simple kernel upgrade to >2.6.22 might resolve the problem.

This link might provide some more explanation into the problem as well.

You should be able to get your mpt driver version from 'cat /proc/mpt/version'.

If you want to isolate if it's an mpt driver issue (since it may not be the same issue), install the buslogic driver into the VM (probably isn't there by default, wasn't for RHEL) and you could switch the disk controller type on the VM to buslogic.

Hope that gets you started on a fix.

Reuben Stump | http://www.virtuin.com | @ReubenStump

Guillir · ‎06-29-2008

Why are you using iSCSI initiator inside VM?

Certainly a best practice would be you configure ESX as iSCSI initiator for those LUNs and deliver them as an RDM for the VM.

It would improve your performance accessing those LUNs and maybe solve your problem.

Regards,

Guilherme Schäffer

Infiniit | www.infiniit.com.br

spinner · ‎06-29-2008

i am not running iSCSI on the VM

iSCSI is being used to share three LUNs to the ESX server

there is no need for the VM to use iSCSI or use RDM

i am not sure why iSCSI is showing up in the VM logs.....

i am going to try to upgrade the kernel, perhaps that will help

i am also running vizioncore vrangerpro, i am going to disable that and see if it makes a difference

Guillir · ‎06-29-2008

What lvdisplay show?

Could you print lvdisplay output here?

Guilherme Schäffer

Infiniit | www.infiniit.com.br

Guillir · ‎06-29-2008

It also can be usefull to you:

RHEL5, RHEL4 U4, RHEL4 U3, SLES10, and SLES9 SP3 File Systems may Become Read-Only

spinner · ‎06-29-2008

according to the KB article i should upgrade from rhel 5 to 5.1

i am already running centos 5.1 which is the "same" as rhel 5.1

i upgraded to centos 5.2 to see if that fixes the issue

i also upgraded the kernel from 2.6.15-53.e15 to 2.6.18-92.1.6.el5

i will watch it tonight and see what happens

spinner · ‎06-29-2008

# lvdisplay

--- Logical volume ---

LV Name /dev/VolGroup00/LogVol00

VG Name VolGroup00

LV UUID o2gHwk-yHCQ-xymd-Fntc-G3Ip-0rLZ-ucT9FW

LV Write Access read/write

LV Status available

open 1

LV Size 15.00 GB

Current LE 480

Segments 1

Allocation inherit

Read ahead sectors 0

Block device 253:0

--- Logical volume ---

LV Name /dev/VolGroup00/LogVol02

VG Name VolGroup00

LV UUID p3i7SS-gely-M0y4-XANq-Aiff-y8Zz-rFhlpg

LV Write Access read/write

LV Status available

open 1

LV Size 4.12 GB

Current LE 132

Segments 1

Allocation inherit

Read ahead sectors 0

Block device 253:1

--- Logical volume ---

LV Name /dev/VolGroup00/LogVol01

VG Name VolGroup00

LV UUID nruE1i-iZIB-pxg9-0gVR-Ykiu-TYOa-uulBcz

LV Write Access read/write

LV Status available

open 1

LV Size 768.00 MB

Current LE 24

Segments 1

Allocation inherit

Read ahead sectors 0

Block device 253:2

#