Ok, I've got the driver installed, and we haven't had the file system (ReiserFS 3) go read only yet. However, I'm concerned. I'm still seeing tons of mpt errors on the SLES 10 server in /var/log/messages. Such as:
Jun 7 01:00:41 MoodleFinal kernel: mptbase: ioc0: IOCStatus(0x0002): Busy
Jun 7 01:01:02 MoodleFinal kernel: mptbase: ioc0: IOCStatus(0x0002): Busy
Jun 7 10:35:15 MoodleFinal kernel: mptscsih: ioc0: attempting task abort! (sc=c1df1980)
Jun 7 10:35:15 MoodleFinal kernel: mptbase: ioc0: IOCStatus(0x004b): SCSI IOC Terminated
Jun 7 10:35:15 MoodleFinal kernel: mptscsih: ioc0: task abort: SUCCESS (sc=c1df1980)
I'm sincerely hoping that even in the presence of these errors that the file system driver doesn' continue to remount the volume as read only. I already announced to our community that the problem has been fixed. Thankfully it hasn't happened yet since the driver update.
Some of you have asked so I'm posting the relevant error entries in ESX /var/log/vmkernel:
Jun 7 20:14:11 esx1 vmkernel: 145:08:22:29.648 cpu3:1687)VSCSI: 1829: Reset request on handle 8335 (0 outstanding commands)
Jun 7 20:14:11 esx1 vmkernel: 145:08:22:29.648 cpu3:1687)VSCSI: 1829: Reset request on handle 8336 (0 outstanding commands)
Jun 7 20:14:11 esx1 vmkernel: 145:08:22:29.648 cpu3:1663)VSCSI: 2028: Resetting handle 8335 \[0/0]
Jun 7 20:14:11 esx1 vmkernel: 145:08:22:29.648 cpu3:1663)SCSI: 3222: handle 306590 / orig 0x7d71bb0
Jun 7 20:14:11 esx1 vmkernel: 145:08:22:29.648 cpu3:1663)VSCSI: 1878: Completing reset on handle 8335 (0 outstanding commands)
Jun 7 20:14:11 esx1 vmkernel: 145:08:22:29.648 cpu3:1663)VSCSI: 2028: Resetting handle 8336 \[0/0]
Jun 7 20:14:11 esx1 vmkernel: 145:08:22:29.648 cpu3:1663)SCSI: 3222: handle 735951 / orig 0x7d70e08
Jun 7 20:14:11 esx1 vmkernel: 145:08:22:29.648 cpu3:1663)<6>scsi(1:0:0:12): DEVICE RESET SUCCEEDED.
Jun 7 20:14:11 esx1 vmkernel: 145:08:22:29.648 cpu3:1663)VSCSI: 1878: Completing reset on handle 8336 (0 outstanding commands)
Jun 7 20:17:24 esx1 vmkernel: 145:08:25:42.730 cpu2:1040)WARNING: SCSI: 1726: Unexpected status returned: bad000a I/O error
Where should I go from here to eliminate these errors?
Also, we're not currently backing up our Linux VMs (gasp) but we're going to be implementing ESX ranger within a month or two. So heavy disk I/O during backup isn't the cause. And our "read only" issue I don't believe is related to storage load. It seems to happen randomly.
Our hardware:
IBM bladecenter
2 LS20 blades running ESX 3.0.0
Qlogic SFF FC adapters
Bladecenter integrated 14x2 port FC switch module
Qlogic Sanbox 3050
IBM FasTt 600 and Nexsan Satablade SAN arrays
The two FC switches run in tandem with 2 uplinks between them for bandwidth and failover between the switches. The SAN arrays are connected to the 8 port Sanbox 3050. We are not multipathing. Both FC ports on Satablade are connected to the switch and I'm exposing LUNs from both ports. I have mapped raw LUNs exposed out of both ports for bandwidth balancing but I'm only exposing a given LUN on one port since we're not multipathing. The FasTt 600 only has one port connected to the switch.
Anyone try this with RHEL 5 ?
My kernel version is 2.6.18-8.1.1.el5.
I am trying compile the module from source and
getting the following errors.
In file included from
/home/sri/mptscsi_vmware/mptscsi-rhel-3.02.62.01/mptba
se.c:50:
include/linux/config.h:6:2: warning: #warning
Including config.h is deprecated.
/home/sri/mptscsi_vmware/mptscsi-rhel-3.02.62.01/mptba
se.c: In function âmpt_suspendâ:
/home/sri/mptscsi_vmware/mptscsi-rhel-3.02.62.01/mptba
se.c:1646: error: switch quantity not an integer
/home/sri/mptscsi_vmware/mptscsi-rhel-3.02.62.01/mptba
se.c: In function âMakeIocReadyâ:
/home/sri/mptscsi_vmware/mptscsi-rhel-3.02.62.01/mptba
se.c:2510: error: implicit declaration of function
âcrashdump_modeâ
make[2]: ***
[/home/sri/mptscsi_vmware/mptscsi-rhel-3.02.62.01/mptb
ase.o] Error 1
make[1]: ***
[_module_/home/sri/mptscsi_vmware/mptscsi-rhel-3.02.62
.01] Error 2
make[1]: Leaving directory
`/usr/src/kernels/2.6.18-8.1.4.el5-i686'
Sriram
Same problem here. Also the version of the in-kernel driver is somewhat newer than the one provided from vmware.
I currently believe that the solution in RHEL4 U5 (kernel versions >= 2.6.9-55.EL) does fix this issue acceptably. I still think it may be possible for a very long (say >5 minute) pause in IO to cause the error to occur but if your SAN is pausing/stopping IO for >5 minutes then you probably have other issues you should address.
I've run every stress test I can throw at it, including rebooting my Equallogic array, and removing the zones on our CX700 and so far have not seen the issue with these kernels. Based on my testing results I moved our development Oracle systems, and several other fairly high I/O systems to this kernel, without patches, and they have survived several weeks without issue. These systems would typically fail in days, or even hours with the 2.6.9-42.x seriese of kernels.
Based on this, I believe the issue is "fixed" for RHEL4 Update 5, however, that still leaves a lot of distros that either still need the VMware fix or the manual compiled driver with my small patch.
RHEL5 still needs a "fix" as well, and since it's not yet officially supported from VMware for ESX my guess is it won't get a formal fix until it is certified. I plan to post a patched driver for RHEL5 on my website in the next day or so. I would suspect this newer driver would also work with other "newish" distros based on more recent kernels.
Later,
Tom
Is anyone running the latest Update 4 kernel but all the other Update 5 patches from RHN? Was wondering if that may be a good route to go until they fully support Update 5.
Does anyone know a good thread that talks about RHEL 4 Update 5 support? This one has good info, just cant find other ones.
Thanks
I've been happily running RHEL4 U5 on about a dozen VM's for weeks now, so far with no issues so, while RHEL4 U5 is not yet officially supported it seems to work fine.
That being said I'm sure you can happily run the RHEL4 U4 kernel with all of the U5 updates if you want to.
Later,
Tom
For those running SLES9 or SLES10, see Novell's TID 3584352:
SUSE Linux Enterprise 10
This issue has been fixed as of Service Pack 1.
SUSE Linux Enterprise Server 9 and Open Enterprise Server (Linux based)
This issue has been fixed as of kernel 2.6.5-7.286.
Hi - U5? I am running the 55.0.2 kernel from Red Hat following this link http://kbase.redhat.com/faq/FAQ_85_10846.shtm but have not got any more U5 updates (running up2date), largely because my /usr is almost full on the guest machine. So I am running RHEL 4 AS with a U5 kernel - but will this fix my issue? & are there any other specific RPM's I should get hold hold of?
Thanks
Andy
Has anybody tried this fix yet (for RH5)? Does it work or is tsightler's fix still required?
http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&externalId=1001778
I'm a little confused by the statement: "it also requires a SCSI mid-layer patch"
Where exactly is the SCSI mid-layer patch applied? OS, HBA firmware on my side or SAN side, Fiber Switch? Just a little confused if VMWare isn't providing it then are we waiting on Red Hat, LSI, other?
The fix posted by VMware for RHEL5 is equivalent to my fix for RHEL5 which I have posted on my website and is basically just a patch to the LSIlogic SCSI driver which gives the required "retry forever" behavior.
VMware is correct in stating that RHEL5 has another issue caused by code in the SCSI mid-layer that means that, even with the fix in the LSIlogic SCSI driver, it's still possible for RHEL5 to timeout during very long stalls in I/O. This is much more difficult to patch because I believe that it requires changes to code that is compiled directly into the kernel rather than simply loaded as a module.
That being said, the SCSI mid-layer takes a pretty large stall before it times out, enough that I'm not really sure it's a major issue. The old problem was that even fairly short stalls, on the order of 10-30 seconds, could cause the issue, but, so far in my testing, it seems to take minutes before the SCSI mid-layer in RHEL5 times out. If your storage array is pausing for minutes, you've got serious problems anyway.
I've been running RHEL5 through the paces pretty hard for about a month, first using my own patch, and now using VMware's patch, and I haven't been able to trigger this problem just using high loads. I have been able to trigger a timeout by doing things like disconnecting all cables or restarting the storage array, but I would actually expect those things to fail. In my opinion it was always a luxury that RHEL4 with the patch would actually survive a storage array reboot that took minutes.
Later,
Tom
Tom,
I've been testing the waters on our ESX 3.0.1 (w/ all patches) cluster w/ a bunch of Centos 5 virtual machines, and all of them tend to go ReadOnly at some point. Same behavior as the last time, while the Centos 4 boxes happily crank along. I'm running stock Centos kernels, so I'll look at the patches to see what might need to be done and automate it. I've also been seeing this with Rpath based appliances running the later 2.6.18 kenels as well.
Damin,
Are you saying you are seeing this even with the VMware provided RHEL5 LSILogic patches or the RHEL5 patches from my site? I know that a completely unpatched RHEL5 system still has the problem, that's what I stated above (although perhaps not clearly).
Also, if you have access you may want to test the RHEL5.1 beta kernels as they appear to include the same fixes that Redhat included in RHEL4 U5.
Later,
Tom
Unpatched RedHat 5 kernels (Centos) will break. I've patched several of the virtual machines w/ your Centos patches and so far so good. I guess I was under the mistaken impression that the mainlined redhat kernels in R5 had the patches.
OK, good, that's what I thought.
So just to clarify for future readers of the thread here is the current status as far as I know:
RHEL4 U3 & U4 -- VMware provides a patch for these OS versions in VMware KB ID 51306.
You can also upgrade to RHEL4 U5 which includes the required fixes.
SLES9 SP3, SLES10 -- VMware provides a patch for these OS versions in VMware KB ID 51306.
RHEL5 FCS (Initial Release) -- VMware provides a patch for these OS versions in VMware KB ID 1001778
The RHEL5 U1 beta kernel includes a fix for this issue and thus it is expected that when RHEL5 U1 final is released (probably sometime in October) it will also include the fix.
For other, unsupported distros (or even the distro's above if you don't want VMware's patches for some reason) you can also continue to use the patched drivers available on my site or follow the generic instructions to manually apply the source patch to the driver included with your distros kernel source package. I have success reports from many Debian and Ubuntu users as well as several other distros.
Later,
Tom
Hi All,
For what it's worth, we ran into this issue also. However, in our case this symptom (ext3fs remounts read-only) occured even after applying the mptscsi hotfix to a 2.6.9-42.0.8 kernel (the one with another ext3 + san based storage and ext3fs fix). We also found another thread which talks about increasing vm.min_free_kbytes to 10240 (see links below). Even after making sure all these three things were set and done, we encountered the issue!
However we did finally pinpoint another reason for this occurence. There is a recent update for ESX 3.0.2 (3.0.2 Update 1) and three updates for this release, detailed in ESX-1002424, ESX-1002425 and ESX-1002429. This update fixes a networking latency issue on various hardware that leads to high network latency (high ping response times) and extremely slow san/iscsi/nfs/smb speeds.
To summarize, what we had to do to finally alleviate the issue is this:
1: Set vm.min_free_kbytes = 10240 in /etc/sysctl.conf on guests:
http://www.noah.org/wiki/index.php/VMware_notes
http://communities.vmware.com/message/249823
2: Update to at least kernel 2.6.9-42.0.8 which fixes an issue with ext3fs on san/iscsi/nfs turning read-only by interpreting certain scsi layer messages as severe:
http://kbase.redhat.com/faq/FAQ_85_9610.shtm
https://bugzilla.redhat.com/show_bug.cgi?id=213921
3: Apply the fix mentioned on this page relating to the mptscsi driver and an upstream patch relating to failover/multipath functionality:
http://www.tuxyturvy.com/blog/index.php?/archives/31-VMware-ESX-and-ext3-journal-aborts.html
http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&externalId=51306
ESX 3.0.1 - Linux Guests go ReadOnly
4: Update ESX 3.0.2 to 3.0.2 Update 1 + three additional fixes: ESX-1002424, ESX-1002425 and ESX-1002429, fixing high networking latency and other issues:
http://www.vmware.com/download/vi/
http://www.vmware.com/download/vi/vi3_patches_302.html
We have pretty thoroughly stress-tested the above configuration with a very high load on io and cpu for about three days now and have survived stuff that would previously lead to remounting read-only pretty quickly (used to be less than an hour in most cases).
Regards,
Rubin.
Has anybody else had this problem with the addition of corrupted file systems? Almost all our rhel4u3 vms encountered this at some point this autumn and became unusable afterwards.
I only see people in this thread talking about the file systems being mounted as read only, do your vms also receive kernel panics at reboot after this happened?