The file systems on several of our guest OSs running RHEL4 have suddenly started becoming read-only. ESX is 3.0.0. The two file systems that are usually / and /var. Both systems are HP DL385 with Qlogic 2430 cards. Any ideas what might be causing this?
Natiboy, what are you using for switches? We have a Cisco 9509.
What scsi adapters are you using in your Guest OS (Buslogic or LSILogic)?
There are problems with the Buslogic driver. All should be using LSILogic drivers in your linux guest.
All guests are using the LSI Logic adapters..
I've posted this in another thread, however, in the interest of getting the most visibility to this isssue I'll post here as well.
I've researched this issue and think it is caused by a change in the LSI Logic driver that was included in the linux kernel on Sept 15, 2005 and is slowly making it's way into current distros (it was inlcuded in RHEL4 U3 and higher). The specific change is here:
The specifics around this change and it's negative impact with RHEL4 U3 guest running within VMware are documented in Redhat Bugzilla 197158.
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=197158
I have posted a patched driver which reverts this small change along with installation instructions at
So far our internal testing, which was able to reproduce this problem in minutes with the old driver, seem to be showing excellent success at resolving this issue. I would be very interested in hearing others test results.
Please note that this is provided "as is" so if it breaks, you get to keep all of the pieces, however, it is a VERY simply change that simply reverts the behavior of a SCSI_STATUS_BUSY to the previous behavior in RHEL4 U2.
Since this has also been in the standard kernel for over a year I have no doubt that this will likely also affect pretty much any recent distro.
Later,
Tom
I was seeing this error on a test box I built today so implemented this fix and no further problems since.
So my question is - who owns the 'supported' fix - do we have an official word from VMware ?
I don't know who would own the "official" fix. It's possible that VMware might be able to simply change their code to send a different type of "bus busy" signal to the VMware, however, that would likely require a significant amount of testing.
Another solution would of course be for LSI Logic to back out this code, however, I suspect that the new code may technically be the more "correct" code, it just happens to have behavior we probably don't want in a VM (although perhaps in a cluster case, we still may want this behavior).
Another option would be for VMware to ship their own custom version of the mptscsih driver just like they always shipped a BusLogic driver with ESX 2.5.x. This is my least favorite approach, but somehow the one I think most likely because it's probably the only one VMware can control that offers no risk to their other supported environments.
Of course, they may also continue to simply attempt to ignore the situation, that appears to be the current strategy, claim every system that has this problem is not on the HCL and ignore it. Never mind the fact that it can be duplicated on local disk, and even on "certified" SAN hardware (I have reproduced it on our Fiber Channel CX400 arrays, although it is much more difficult to trigger).
I guess we will see. Our production systems with my current "workaround" have both made 8 days of uptime since I installed these slightly modified drivers. The previous record for one of the systems was 7 days, and usually only 2-3 days of normal load.
Later,
Tom
My CentOS 4.4 VM which usually went 2-3 days of uptime (before rolling back to the old scsi driver) has been 100% stable since 10/24.
From my point of view this "fix" takes care of the read-only file system for my situation and this weekend I will apply it to my production VMs and move them back to the SAN data store.
thanks tsightler!
We have experienced the same issue at customer site.
OS is SLES10 and it's appended randomly many times.
We will open an support incident with VMware.
Very dangerous.
Any resolution to this - have the same problem on SLES9 ?
The link below fixed my issue on CentOS 4.4 VMs.
They have been 100% stable since October with the rolled back SCSI driver. I have put my Apache servers back into production and have been more than happy with their stability & performance.
For those who come to this forum and need a VMWare fix, after much searching I've come up with this link: http://kb.vmware.com/vmtnkb/search.do?cmd=displayKC&docType=kc&externalId=51306&sliceId=SAL_Public
We've been running this patch on several of our production Oracle and web servers for a few days now, and no more of those messages.
Hope this saves some trouble for people!
We've had this happen a few times, mostly it was when changing the active path from our ds4800. When we change from the a controller to the b controller it happens every time.
I´ve got this issue at a customer site too with SuSE Linux Enterprise Server 10 and published my way of changing the driver here:
The text is in German language but the commands should be understandable for everyone.
Furthermore SLES9 should be work nearly the same.
Dennis
Hi,
not very nice...
Unfortunately I have to upgrade our ESX 2.52 to 3.01.
We have got a EMC CX300 and for example SuSE Enterprise Server 9 SP3 running with IBM Domino Cluster...
So my question:
Does this also happen when using ReiserFS instead of EXT3?
Thanks in advance.
/egr
Hi,
not very nice...
Unfortunately I have to upgrade our ESX 2.52 to
3.01.
We have got a EMC CX300 and for example SuSE
Enterprise Server 9 SP3 running with IBM Domino
Cluster...
So my question:
Does this also happen when using ReiserFS instead of
EXT3?
Actually, that's an excellent question. My non-expert opinion is that it's very likely that ReiserFS would have this problem as well. The issue is not directly related to ext3, but rather to the way the mptscsih driver reports a BUS_BUSY condition back to the SCSI mid-layer. This can create both minor, and major timeouts.
Now, interestingly, ext3 is "oversensative" to these minor errors, at least in RHEL4. This has been fixed in RHEL4 for 2.6.9-42.0.8 and above kernels but that fix was not enough to resolve the VMware issue because major timeouts are still a failure mode for ext3 (as they should be). Effectively the mid-layer reports the disk with write errors, and I would suspect that both ext3 and ReiserFS would fail in that scenario. Actually, I think I remember reading that ReiserFS is even more paranoid about write failures although this might be a little dated because I think I read it in the paper at http://www.cs.wisc.edu/wind/Publications/sfa-dsn05.pdf
Now, I'm not a SuSE user, but a friend of mine who is says that he has had good success with SLES9 and the Buslogic driver, which I think is actually still a supported configuration for SLES9 and ESX3 as long as the VM has less than 4GB of RAM.
Later,
Tom
Hi tsightler,
thanks for your answer.
After reading and studying I am also of the opinion that this should happen on both - EXT3 and ReiserFS.
Furthermore SuSE probably is going to use EXT3 as default in future versions (as in OpenSuSE 10.2).
Buslogic would be the supported solution but our SLES with Domino DBs running (incl. ERP, Desktop Support Tools etc.) is using LSI.
I think I will migrate the SLES9 Domino cluster node to ESX 3.01 and patch the SCSI driver.
If it doesn' work.. it's a cluster
My Domino colleagues will thank me...
/egr
VMware has provided a workaround for this problem. Please see the link below.
I've recently come upon this same problem. I've found the kb 51306 'fix', but that will not even complete installation. The initial error I receive is a failure to complete the 'mv -f' command to backup the old mptscsi.ko file. This is because the file does not exist. I've created a file in its place with this name, run the install again, and I receive this:
"Failed to build the new initrd for 2.6.5-7.283-default kernel. Installation Failed"
Now I realize that the example in the doc says 2.6.5-7.244-default, which would indicate to me a different kernel version (I'm not a linux guy, so I'm just guessing). However when I run 'SPident' I receive:
"found SLES-9-i386-SP3 + Online Updates"
That leads me to believe that my OS is up to date.
Any suggestions?
We are running ESX 3.01 using a NFS datastore. We also see this problem during a NFS cluster failover. If the NFS Cluster failover takes more than 60 seconds, RHAS4U4 and RHEL5 filesystem goes read-only. We need to up the timout from 60 seconds to 300+ seconds.
We know that the datastore goes offline in the NAS failover, if you using SAN and this occurs, you should be looking into why you SAN is going offline for more than 60 seconds.
Anyone know how to increase (scsi) timeout?
On RHEL5, this blog link:
http://www.tuxyturvy.com/blog/index.php?/archives/31-VMware-ESX-and-ext3-journal-aborts.html
Has a link to patches that were successfully installed. Thanks goes to Tom for this!