VMware Cloud Community
rossb2b
Hot Shot
Hot Shot
Jump to solution

SCSI errors on rhel5 VM

I've logged in and noticed this in dmesg (and some in the ssh terminal left

overnight):

sd 0:0:0:0: SCSI error: return code = 0x00020008

end_request: I/O error, dev sda, sector 358621 sd 0:0:0:0: SCSI error: return code = 0x00020008

end_request: I/O error, dev sda, sector 4666181 Buffer I/O error on device dm-0, logical block 557119 lost page write due to I/O error on dm-0 sd 0:0:0:0: SCSI error: return code = 0x00020008

end_request: I/O error, dev sda, sector 358645 Buffer I/O error on device dm-0, logical block 18677 lost page write due to I/O error on dm-0 sd 0:0:0:0: SCSI error: return code = 0x00020008

end_request: I/O error, dev sda, sector 1373093 Buffer I/O error on device dm-0, logical block 145483 lost page write due to I/O error on dm-0 Aborting journal on device dm-0.

journal commit I/O error

ext3_abort called.

EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only

The vm storage in on our SAN along with other VMs on the same lun that seem unaffected. Storage team says SAN is OK.

Anyone have any ideas as to what caused this?

0 Kudos
1 Solution

Accepted Solutions
admin
Immortal
Immortal
Jump to solution

Take a look at the following knowledga base article for a workaround for this issue:

http://kb.vmware.com/kb/1001778

View solution in original post

0 Kudos
10 Replies
rossb2b
Hot Shot
Hot Shot
Jump to solution

up

0 Kudos
admin
Immortal
Immortal
Jump to solution

Take a look at the following knowledga base article for a workaround for this issue:

http://kb.vmware.com/kb/1001778

0 Kudos
rossb2b
Hot Shot
Hot Shot
Jump to solution

That is it thanks!

0 Kudos
smgoller
Contributor
Contributor
Jump to solution

I can't seem to be able to access this knowledge base article via this link. Can someone post the workaround or a better link to the article? I'm having this exact problem.

0 Kudos
UofS
Enthusiast
Enthusiast
Jump to solution

I also am having this problem and the link is now bad.  Can someone repost this solution?

0 Kudos
DSTAVERT
Immortal
Immortal
Jump to solution

The KB article may have changed or been obsoleted. I have submitted this page to the KB team. We can hope that it is tracked down.

-- David -- VMware Communities Moderator
0 Kudos
mcowger
Immortal
Immortal
Jump to solution

This is pretty common with RHEL5 on VM.  It usually happens when the underlying storage gets slow enought to not respond to an IO in a reasonable amount of time (either dead paths or overloaded storage).  At that point, the sd drivers fails the IO up to the VFS layer, which fails it to EXT3, which then aborts the journal for safety and marks the FS readonly.

Have your storage team look again for any slowdowns in the storage or pathing failures along the way (controller trespasses too).

--Matt VCDX #52 blog.cowger.us
0 Kudos
Texiwill
Leadership
Leadership
Jump to solution

Hello,

While the storage folks are looking at their end, I suggest looking at the VM as well. Some questions:

* Are there any snapshots on this VM, if so how many? It may be better to commit all snapshots.

* Is there anything in the vmware.log file associated with the VM with respect to storage?

* Is there anything in the vmkernel.log file with respect to storage on which this VM lives? If so investigate these errors. Note that the errors show up, but you may need to look around before the specific error or after to determine what happened.

* Are there any hardware failures with respect to the disk, controller, etc.? These may show up in the logs, but most modern server hardware has the information somewhere.

Best regards,

Edward L. Haletky

Communities Moderator, VMware vExpert,

Author: VMware vSphere and Virtual Infrastructure Security,VMware ESX and ESXi in the Enterprise 2nd Edition

Podcast: The Virtualization Security Podcast Resources: The Virtualization Bookshelf

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
0 Kudos
UofS
Enthusiast
Enthusiast
Jump to solution

We are suspecting that it may be a SCSI locking issue as there were approx 25 VMs on that particular LUN.  moving VMs to other LUNs seemed to resolve the issue.

Would using the VAAI storage APIs reduce this problem? 

0 Kudos
mcowger
Immortal
Immortal
Jump to solution

If SCSI locking were the problem, yes, ATS/VAAI would help.

However, its pretty unlikely that locking with just a few VMs caused such a locking issue, unless they were all being booted simultaneously.

--Matt VCDX #52 blog.cowger.us
0 Kudos