VMware Cloud Community
pwatvu
Contributor
Contributor

Linux file systems becoming read-only

The file systems on several of our guest OSs running RHEL4 have suddenly started becoming read-only. ESX is 3.0.0. The two file systems that are usually / and /var. Both systems are HP DL385 with Qlogic 2430 cards. Any ideas what might be causing this?

0 Kudos
47 Replies
pwatvu
Contributor
Contributor

Natiboy, what are you using for switches? We have a Cisco 9509.

0 Kudos
KnowItAll
Hot Shot
Hot Shot

What scsi adapters are you using in your Guest OS (Buslogic or LSILogic)?

There are problems with the Buslogic driver. All should be using LSILogic drivers in your linux guest.

0 Kudos
Damin
Enthusiast
Enthusiast

All guests are using the LSI Logic adapters..

0 Kudos
tsightler
Hot Shot
Hot Shot

I've posted this in another thread, however, in the interest of getting the most visibility to this isssue I'll post here as well.

I've researched this issue and think it is caused by a change in the LSI Logic driver that was included in the linux kernel on Sept 15, 2005 and is slowly making it's way into current distros (it was inlcuded in RHEL4 U3 and higher). The specific change is here:

http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blobdiff;h=8dd25aac53557777ca...

The specifics around this change and it's negative impact with RHEL4 U3 guest running within VMware are documented in Redhat Bugzilla 197158.

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=197158

I have posted a patched driver which reverts this small change along with installation instructions at

http://www.tuxyturvy.com/blog/index.php?/archives/31-VMware-ESX-and-ext3-journal-aborts.html#extende...

So far our internal testing, which was able to reproduce this problem in minutes with the old driver, seem to be showing excellent success at resolving this issue. I would be very interested in hearing others test results.

Please note that this is provided "as is" so if it breaks, you get to keep all of the pieces, however, it is a VERY simply change that simply reverts the behavior of a SCSI_STATUS_BUSY to the previous behavior in RHEL4 U2.

Since this has also been in the standard kernel for over a year I have no doubt that this will likely also affect pretty much any recent distro.

Later,

Tom

0 Kudos
garybrown
Enthusiast
Enthusiast

I was seeing this error on a test box I built today so implemented this fix and no further problems since.

So my question is - who owns the 'supported' fix - do we have an official word from VMware ?

0 Kudos
tsightler
Hot Shot
Hot Shot

I don't know who would own the "official" fix. It's possible that VMware might be able to simply change their code to send a different type of "bus busy" signal to the VMware, however, that would likely require a significant amount of testing.

Another solution would of course be for LSI Logic to back out this code, however, I suspect that the new code may technically be the more "correct" code, it just happens to have behavior we probably don't want in a VM (although perhaps in a cluster case, we still may want this behavior).

Another option would be for VMware to ship their own custom version of the mptscsih driver just like they always shipped a BusLogic driver with ESX 2.5.x. This is my least favorite approach, but somehow the one I think most likely because it's probably the only one VMware can control that offers no risk to their other supported environments.

Of course, they may also continue to simply attempt to ignore the situation, that appears to be the current strategy, claim every system that has this problem is not on the HCL and ignore it. Never mind the fact that it can be duplicated on local disk, and even on "certified" SAN hardware (I have reproduced it on our Fiber Channel CX400 arrays, although it is much more difficult to trigger).

I guess we will see. Our production systems with my current "workaround" have both made 8 days of uptime since I installed these slightly modified drivers. The previous record for one of the systems was 7 days, and usually only 2-3 days of normal load.

Later,

Tom

0 Kudos
CTeague
Contributor
Contributor

My CentOS 4.4 VM which usually went 2-3 days of uptime (before rolling back to the old scsi driver) has been 100% stable since 10/24.

From my point of view this "fix" takes care of the read-only file system for my situation and this weekend I will apply it to my production VMs and move them back to the SAN data store.

thanks tsightler!

0 Kudos
edp4you
Contributor
Contributor

We have experienced the same issue at customer site.

OS is SLES10 and it's appended randomly many times.

We will open an support incident with VMware.

Very dangerous.

0 Kudos
garybrown
Enthusiast
Enthusiast

Any resolution to this - have the same problem on SLES9 ?

0 Kudos
CTeague
Contributor
Contributor

The link below fixed my issue on CentOS 4.4 VMs.

http://www.tuxyturvy.com/blog/index.php?/archives/31-VMware-ESX-and-ext3-journal-aborts.html#extende...

They have been 100% stable since October with the rolled back SCSI driver. I have put my Apache servers back into production and have been more than happy with their stability & performance.

0 Kudos
tonywieczorek
Contributor
Contributor

For those who come to this forum and need a VMWare fix, after much searching I've come up with this link: http://kb.vmware.com/vmtnkb/search.do?cmd=displayKC&docType=kc&externalId=51306&sliceId=SAL_Public

We've been running this patch on several of our production Oracle and web servers for a few days now, and no more of those messages.

Hope this saves some trouble for people!

0 Kudos
williambishop
Expert
Expert

We've had this happen a few times, mostly it was when changing the active path from our ds4800. When we change from the a controller to the b controller it happens every time.

--"Non Temetis Messor."
0 Kudos
doomdevice
Enthusiast
Enthusiast

I´ve got this issue at a customer site too with SuSE Linux Enterprise Server 10 and published my way of changing the driver here:

http://www.vmachine.de/kb/index.php/Linux_Kernel_2.6_Problem_-_Read-Only_Filesystem_nach_Path_Failov...

The text is in German language but the commands should be understandable for everyone.

Furthermore SLES9 should be work nearly the same.

Dennis

VI PowerScripter [http://www.powerscripter.net] Every Click can be a customized function within VI client
0 Kudos
egr
Contributor
Contributor

Hi,

not very nice...

Unfortunately I have to upgrade our ESX 2.52 to 3.01.

We have got a EMC CX300 and for example SuSE Enterprise Server 9 SP3 running with IBM Domino Cluster...

So my question:

Does this also happen when using ReiserFS instead of EXT3?

Thanks in advance.

/egr

0 Kudos
tsightler
Hot Shot
Hot Shot

Hi,

not very nice...

Unfortunately I have to upgrade our ESX 2.52 to

3.01.

We have got a EMC CX300 and for example SuSE

Enterprise Server 9 SP3 running with IBM Domino

Cluster...

So my question:

Does this also happen when using ReiserFS instead of

EXT3?

Actually, that's an excellent question. My non-expert opinion is that it's very likely that ReiserFS would have this problem as well. The issue is not directly related to ext3, but rather to the way the mptscsih driver reports a BUS_BUSY condition back to the SCSI mid-layer. This can create both minor, and major timeouts.

Now, interestingly, ext3 is "oversensative" to these minor errors, at least in RHEL4. This has been fixed in RHEL4 for 2.6.9-42.0.8 and above kernels but that fix was not enough to resolve the VMware issue because major timeouts are still a failure mode for ext3 (as they should be). Effectively the mid-layer reports the disk with write errors, and I would suspect that both ext3 and ReiserFS would fail in that scenario. Actually, I think I remember reading that ReiserFS is even more paranoid about write failures although this might be a little dated because I think I read it in the paper at http://www.cs.wisc.edu/wind/Publications/sfa-dsn05.pdf

Now, I'm not a SuSE user, but a friend of mine who is says that he has had good success with SLES9 and the Buslogic driver, which I think is actually still a supported configuration for SLES9 and ESX3 as long as the VM has less than 4GB of RAM.

Later,

Tom

0 Kudos
egr
Contributor
Contributor

Hi tsightler,

thanks for your answer.

After reading and studying I am also of the opinion that this should happen on both - EXT3 and ReiserFS.

Furthermore SuSE probably is going to use EXT3 as default in future versions (as in OpenSuSE 10.2).

Buslogic would be the supported solution but our SLES with Domino DBs running (incl. ERP, Desktop Support Tools etc.) is using LSI.

I think I will migrate the SLES9 Domino cluster node to ESX 3.01 and patch the SCSI driver.

If it doesn' work.. it's a cluster Smiley Wink

My Domino colleagues will thank me...

/egr

0 Kudos
mj820
Contributor
Contributor

VMware has provided a workaround for this problem. Please see the link below.

http://kb.vmware.com/KB/51306

0 Kudos
managedservices
Contributor
Contributor

I've recently come upon this same problem. I've found the kb 51306 'fix', but that will not even complete installation. The initial error I receive is a failure to complete the 'mv -f' command to backup the old mptscsi.ko file. This is because the file does not exist. I've created a file in its place with this name, run the install again, and I receive this:

"Failed to build the new initrd for 2.6.5-7.283-default kernel. Installation Failed"

Now I realize that the example in the doc says 2.6.5-7.244-default, which would indicate to me a different kernel version (I'm not a linux guy, so I'm just guessing). However when I run 'SPident' I receive:

"found SLES-9-i386-SP3 + Online Updates"

That leads me to believe that my OS is up to date.

Any suggestions?

0 Kudos
dalepa
Enthusiast
Enthusiast

We are running ESX 3.01 using a NFS datastore. We also see this problem during a NFS cluster failover. If the NFS Cluster failover takes more than 60 seconds, RHAS4U4 and RHEL5 filesystem goes read-only. We need to up the timout from 60 seconds to 300+ seconds.

We know that the datastore goes offline in the NAS failover, if you using SAN and this occurs, you should be looking into why you SAN is going offline for more than 60 seconds.

Anyone know how to increase (scsi) timeout?

0 Kudos
astanfor
Contributor
Contributor

On RHEL5, this blog link:

http://www.tuxyturvy.com/blog/index.php?/archives/31-VMware-ESX-and-ext3-journal-aborts.html

Has a link to patches that were successfully installed. Thanks goes to Tom for this!

http://www.tuxyturvy.com/files/fusion-el5.tar.gz

0 Kudos