|
Reply
75.
Re: ESX 3.0.1 - Linux Guests go ReadOnly Jun 7, 2007 6:50 PM
Ok, I've got the driver installed, and we haven't had the file system (ReiserFS 3) go read only yet. However, I'm concerned. I'm still seeing tons of mpt errors on the SLES 10 server in /var/log/messages. Such as:
Jun 7 01:00:41 MoodleFinal kernel: mptbase: ioc0: IOCStatus(0x0002): Busy Jun 7 01:01:02 MoodleFinal kernel: mptbase: ioc0: IOCStatus(0x0002): Busy Jun 7 10:35:15 MoodleFinal kernel: mptscsih: ioc0: attempting task abort! (sc=c1df1980) Jun 7 10:35:15 MoodleFinal kernel: mptbase: ioc0: IOCStatus(0x004b): SCSI IOC Terminated Jun 7 10:35:15 MoodleFinal kernel: mptscsih: ioc0: task abort: SUCCESS (sc=c1df1980) I'm sincerely hoping that even in the presence of these errors that the file system driver doesn' continue to remount the volume as read only. I already announced to our community that the problem has been fixed. Thankfully it hasn't happened yet since the driver update. Some of you have asked so I'm posting the relevant error entries in ESX /var/log/vmkernel: Jun 7 20:14:11 esx1 vmkernel: 145:08:22:29.648 cpu3:1687)VSCSI: 1829: Reset request on handle 8335 (0 outstanding commands) Jun 7 20:14:11 esx1 vmkernel: 145:08:22:29.648 cpu3:1687)VSCSI: 1829: Reset request on handle 8336 (0 outstanding commands) Jun 7 20:14:11 esx1 vmkernel: 145:08:22:29.648 cpu3:1663)VSCSI: 2028: Resetting handle 8335 [0/0] Jun 7 20:14:11 esx1 vmkernel: 145:08:22:29.648 cpu3:1663)SCSI: 3222: handle 306590 / orig 0x7d71bb0 Jun 7 20:14:11 esx1 vmkernel: 145:08:22:29.648 cpu3:1663)VSCSI: 1878: Completing reset on handle 8335 (0 outstanding commands) Jun 7 20:14:11 esx1 vmkernel: 145:08:22:29.648 cpu3:1663)VSCSI: 2028: Resetting handle 8336 [0/0] Jun 7 20:14:11 esx1 vmkernel: 145:08:22:29.648 cpu3:1663)SCSI: 3222: handle 735951 / orig 0x7d70e08 Jun 7 20:14:11 esx1 vmkernel: 145:08:22:29.648 cpu3:1663)<6>scsi(1:0:0:12): DEVICE RESET SUCCEEDED. Jun 7 20:14:11 esx1 vmkernel: 145:08:22:29.648 cpu3:1663)VSCSI: 1878: Completing reset on handle 8336 (0 outstanding commands) Jun 7 20:17:24 esx1 vmkernel: 145:08:25:42.730 cpu2:1040)WARNING: SCSI: 1726: Unexpected status returned: bad000a I/O error Where should I go from here to eliminate these errors? Also, we're not currently backing up our Linux VMs (gasp) but we're going to be implementing ESX ranger within a month or two. So heavy disk I/O during backup isn't the cause. And our "read only" issue I don't believe is related to storage load. It seems to happen randomly. Our hardware: IBM bladecenter 2 LS20 blades running ESX 3.0.0 Qlogic SFF FC adapters Bladecenter integrated 14x2 port FC switch module Qlogic Sanbox 3050 IBM FasTt 600 and Nexsan Satablade SAN arrays The two FC switches run in tandem with 2 uplinks between them for bandwidth and failover between the switches. The SAN arrays are connected to the 8 port Sanbox 3050. We are not multipathing. Both FC ports on Satablade are connected to the switch and I'm exposing LUNs from both ports. I have mapped raw LUNs exposed out of both ports for bandwidth balancing but I'm only exposing a given LUN on one port since we're not multipathing. The FasTt 600 only has one port connected to the switch. |
|
Hi - U5? I am running the 55.0.2 kernel from Red Hat following this link http://kbase.redhat.com/faq/FAQ_85_10846.shtm but have not got any more U5 updates (running up2date), largely because my /usr is almost full on the guest machine. So I am running RHEL 4 AS with a U5 kernel - but will this fix my issue? & are there any other specific RPM's I should get hold hold of?
Thanks Andy |
|
Has anybody tried this fix yet (for RH5)? Does it work or is tsightler's fix still required?
http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&externalId=1001778 I'm a little confused by the statement: "it also requires a SCSI mid-layer patch" Where exactly is the SCSI mid-layer patch applied? OS, HBA firmware on my side or SAN side, Fiber Switch? Just a little confused if VMWare isn't providing it then are we waiting on Red Hat, LSI, other? |
|
Reply
Re: ESX 3.0.1 - Linux Guests go ReadOnly Aug 31, 2007 2:56 PM
Reply
83.
Re: ESX 3.0.1 - Linux Guests go ReadOnly Aug 31, 2007 2:56 PM
The fix posted by VMware for RHEL5 is equivalent to my fix for RHEL5 which I have posted on my website and is basically just a patch to the LSIlogic SCSI driver which gives the required "retry forever" behavior.
VMware is correct in stating that RHEL5 has another issue caused by code in the SCSI mid-layer that means that, even with the fix in the LSIlogic SCSI driver, it's still possible for RHEL5 to timeout during very long stalls in I/O. This is much more difficult to patch because I believe that it requires changes to code that is compiled directly into the kernel rather than simply loaded as a module. That being said, the SCSI mid-layer takes a pretty large stall before it times out, enough that I'm not really sure it's a major issue. The old problem was that even fairly short stalls, on the order of 10-30 seconds, could cause the issue, but, so far in my testing, it seems to take minutes before the SCSI mid-layer in RHEL5 times out. If your storage array is pausing for minutes, you've got serious problems anyway. I've been running RHEL5 through the paces pretty hard for about a month, first using my own patch, and now using VMware's patch, and I haven't been able to trigger this problem just using high loads. I have been able to trigger a timeout by doing things like disconnecting all cables or restarting the storage array, but I would actually expect those things to fail. In my opinion it was always a luxury that RHEL4 with the patch would actually survive a storage array reboot that took minutes. Later, Tom |
|
Hi All,
For what it's worth, we ran into this issue also. However, in our case this symptom (ext3fs remounts read-only) occured even after applying the mptscsi hotfix to a 2.6.9-42.0.8 kernel (the one with another ext3 + san based storage and ext3fs fix). We also found another thread which talks about increasing vm.min_free_kbytes to 10240 (see links below). Even after making sure all these three things were set and done, we encountered the issue! However we did finally pinpoint another reason for this occurence. There is a recent update for ESX 3.0.2 (3.0.2 Update 1) and three updates for this release, detailed in ESX-1002424, ESX-1002425 and ESX-1002429. This update fixes a networking latency issue on various hardware that leads to high network latency (high ping response times) and extremely slow san/iscsi/nfs/smb speeds. To summarize, what we had to do to finally alleviate the issue is this: 1: Set vm.min_free_kbytes = 10240 in /etc/sysctl.conf on guests: http://www.noah.org/wiki/index.php/VMware_notes http://communities.vmware.com/message/249823 2: Update to at least kernel 2.6.9-42.0.8 which fixes an issue with ext3fs on san/iscsi/nfs turning read-only by interpreting certain scsi layer messages as severe: http://kbase.redhat.com/faq/FAQ_85_9610.shtm https://bugzilla.redhat.com/show_bug.cgi?id=213921 3: Apply the fix mentioned on this page relating to the mptscsi driver and an upstream patch relating to failover/multipath functionality: http://www.tuxyturvy.com/blog/index.php?/archives/31-VMware-ESX-and-ext3-journal-aborts.html http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&externalId=51306 http://communities.vmware.com/thread/58121?tstart=0&start=0 4: Update ESX 3.0.2 to 3.0.2 Update 1 + three additional fixes: ESX-1002424, ESX-1002425 and ESX-1002429, fixing high networking latency and other issues: http://communities.vmware.com/thread/97117 http://www.vmware.com/download/vi/ http://www.vmware.com/download/vi/vi3_patches_302.html We have pretty thoroughly stress-tested the above configuration with a very high load on io and cpu for about three days now and have survived stuff that would previously lead to remounting read-only pretty quickly (used to be less than an hour in most cases). Regards, Rubin. |