VMware
1 2 3 4 5 6 7 Previous Next 99 Replies Last post: Mar 8, 2008 10:19 PM by Damin   Go to original post

Re: ESX 3.0.1 - Linux Guests go ReadOnly

45. Jan 20, 2007 11:26 AM in response to: osterber
Click to view tsightler's profile Hot Shot 177 posts since
Sep 30, 2005
I'm seeing this behavior (Linux guest goes read-only
suddenly) on a guest OS that is running on _local_
storage. (I.e., not on a SAN, and not on iSCSI.)
Has anyone seen this?

Oh yeah, I can definitely reproduce this issue on local storage, basically, anything that causes a delay of greater than 5 seconds in reading data from the disks. It's especially easy to trigger if you have a storage system with large amounts of cache that creates long pauses during forced flushes. We've reproduced the issues on IBM ServeRAID 6i and Dell PERC 4/Di controllers using locally attached UltraSCSI 320 drives.

Re: ESX 3.0.1 - Linux Guests go ReadOnly

46. Jan 20, 2007 11:40 AM in response to: letoatrads
Click to view tsightler's profile Hot Shot 177 posts since
Sep 30, 2005
Is this being asked because there is a possibility
that having a NIC bond going to the ISCSI target
could mitigate this issue? The reason I ask is I have
several ESX Hosts hitting an EQL ISCSI SAN, and I
have several RHEL4 UP4 guests with Sybase Databases
running in the on the guest and I've never seen this
issue. Even when I've spent 9 or 10 hours loading
1TB+ database.

OK, I know you didn't ask me this question, but I thought my experiences might be interesting to you. We have ESX servers running against several different storage arrays, a CX400 via FC, a CX700 via FC, a AX150i via iSCSI, and an Equalogic PS300E via iSCSI (and also a few systems running on local storage like IBM ServeRAID and Dell PERC controllers).

We've found the problem to be fairly easy to reproduce on the AX150i, as well as the CX700. For whatever reason it's actually more difficult to trigger it on the CX400, we theorize it has to do with the smaller write cache and thus lower latency during heavy writes but it may also be related to the fact that the CX400 simply has less contention because it services fewer hosts.

The Equalogic PS300E is by far the most difficult for us to reproduce the issue. We did mange to create the issue even when running against this array but it took a pretty crazy level of fake I/O load running in multiple VM's to do it.

On the other hand, if you pull the plug on a network cable, it's likely that you'll see the issue on every Linux guest by the time the iSCSI connection fails over to another port. My opinion is that this fix is an important proactive fix if you want your guest to continue running during any failure scenario, especially since that's usually the reason you invest in all of the redundancy.

Later,
Tom

Re: ESX 3.0.1 - Linux Guests go ReadOnly

48. Feb 3, 2007 1:55 PM in response to: Damin
Click to view tsightler's profile Hot Shot 177 posts since
Sep 30, 2005
The patch in question appears to be as follows:

--- linux-2.6.9/fs/ext3/dir.c~ 2007-01-08 16:19:36.000000000 -0500
+++ linux-2.6.9/fs/ext3/dir.c 2007-01-08 16:22:10.000000000 -0500
@@ -155,7 +155,7 @@ static int ext3_readdir(struct file * fi
brelse (tmp);
}
if (num) {
- ll_rw_block (READA, num, bha);
+ ll_rw_block (READ, num, bha);
for (i = 0; i < num; i++)
brelse (bha);
}

I'm not sure that this is enough to correct the issue in VMware because the Bugzilla seems to imply the issue was only seen in cases that were using PowerPath or dm-multipath. When using READA request under heavy load the bio request can sometimes return EAGAIN which is basically saying "I had a temporary error, please try again" but the ext3 code did not actually expect to handle this condition.

Redhat initially looked at backporting the code in more recent 2.6 kernels but this appeared to have more complex interactions and they opted for the simple, low-risk fix, which appears to basically forgoe readahead on directory lookups. It's possible that this might have a minor impact on performance, but probably not noticable in most workloads.

I'll be interested in seeing how your tests turn out so let us know.

Later,
Tom

Re: ESX 3.0.1 - Linux Guests go ReadOnly

50. Feb 5, 2007 7:47 AM in response to: Damin
Click to view tsightler's profile Hot Shot 177 posts since
Sep 30, 2005
So, as I suspected, the patch is question is simply not enough. Of interest however is that Redhat has commited a patch to the U5 stream which does include a supposed fix (and one that looks "sufficient" to me). Test kernels that include this fix are available at the following location:

http://people.redhat.com/~jbaron/rhel4/RPMS.kernel/

It would be interesting to see what you tests turn up with those kernels.

Later,
Tom

Re: ESX 3.0.1 - Linux Guests go ReadOnly

51. Feb 6, 2007 8:01 AM in response to: tsightler
Click to view doomdevice's profile Enthusiast 98 posts since
Dec 4, 2005
If you use Novell SLES, I´ve created an short guideline how to change the lsi driver based on Tom´s RedHat guide (Thanks Tom!).
You can find it in here:
http://www.vmachine.de/kb/index.php/Linux_Kernel_2.6_Problem_-_Read-Only_Filesystem_nach_Path_Failover

It´s in German but the commands should be understandable by everyone.

Dennis

Re: ESX 3.0.1 - Linux Guests go ReadOnly

52. Feb 9, 2007 6:13 AM in response to: doomdevice
Click to view egr's profile Enthusiast 70 posts since
Sep 22, 2005
Hi,

VMware has updated the article:
http://kb.vmware.com/vmtnkb/search.do?cmd=displayKC&docType=kc&externalId=51306&sliceId=SAL_Public

Now a hotfix for SLES 9 SP3 is available.
Does anyone if this does work for SLES 10, too?

@doomdevice : Thanks for your work!
But ask that because we are looking for an offiicially supported solution.

Thanks in advance.
/egr

Re: ESX 3.0.1 - Linux Guests go ReadOnly

53. Feb 12, 2007 2:09 PM in response to: Damin
Click to view magusnet's profile Novice 29 posts since
Jan 30, 2006
The VMware patch won't help us since we are on SLES9_SP3 kernel build -282.

1. Can someone provide a clear set of steps to reproduce this in a lab?
2. Has anyone gotten any information from Novell for SLES9 on a fix like
RedHat has done?

Re: ESX 3.0.1 - Linux Guests go ReadOnly

54. Feb 13, 2007 4:28 PM in response to: magusnet
Click to view tsightler's profile Hot Shot 177 posts since
Sep 30, 2005
1. Can someone provide a clear set of steps to
reproduce this in a lab?

You mean reproduce the error? It's a little difficult to provide clear steps since it can be more difficult to trigger on some hardware, however, the basic formula in my case is, run a continuous loop of ' iozone -r 1m -s 1024m -t 8' in 2 VM's against a single VMFS LUN. Given enough time this has failed on every one of my SAN's without doing much of anything else, usually within just a few hours.

Damin seems to have a more formal test that he feels comfortable with on his setup.

Later,
Tom

Re: ESX 3.0.1 - Linux Guests go ReadOnly

55. Feb 14, 2007 7:56 AM in response to: tsightler
Click to view magusnet's profile Novice 29 posts since
Jan 30, 2006
This is just what I need.
Thanks for the assist :)

If I come up with any variations of:
'iozone -r 1m -s 1024m -t 8'
with iozone or any other tools, I'll post back here.

Re: ESX 3.0.1 - Linux Guests go ReadOnly

56. Feb 14, 2007 8:35 AM in response to: magusnet
Click to view ddecker's profile Lurker 2 posts since
Sep 26, 2005
Our company had a thorough discussion and meeting sessions setup with redhat on this 'readonly issue' not sure if its identical but we actually had a hot-fix and new kernel created from our testing. If you using RHEL your going to need to lay down 2.6.9-42.0.8 kernel. 42.0.3 is not sufficient enough. Bugzilla 213921 is a good reference to the trouble we had.

Re: ESX 3.0.1 - Linux Guests go ReadOnly

57. Feb 14, 2007 8:52 AM in response to: tsightler
Click to view petr's profile Champion VMware Employees 7,223 posts since
Jul 10, 2003
Do you have logs from guest & vmkernel from around time of failure? You should hit this problem only if I/O takes longer than 30 seconds to finish (and you'll hit it on both real hardware and in the VM). Perhaps your SAN decided to do something else in the middle of test?

Re: ESX 3.0.1 - Linux Guests go ReadOnly

58. Feb 14, 2007 12:43 PM in response to: ddecker
Click to view tsightler's profile Hot Shot 177 posts since
Sep 30, 2005
Our company had a thorough discussion and meeting
sessions setup with redhat on this 'readonly issue'
not sure if its identical but we actually had a
hot-fix and new kernel created from our testing. If
you using RHEL your going to need to lay down
2.6.9-42.0.8 kernel. 42.0.3 is not sufficient enough.
Bugzilla 213921 is a good reference to the trouble we
had.

The bug fixed in 2.6.9-42.0.8, and documented in Bugzilla 213921, is an issue with ext3 and is not really related to VMware, although I suspect that VMware may make the problem slightly more likely to occur.

The "VMware" bug is an issue with the mptscsih driver, it's interaction with VMware server and how it reports timeouts to the SCSI midlayer in the guest.

While the failure mode of both bugs does cause a nearly indetical end result (ext3 filesystem going read-only) but key differences can be noted in the log files in most cases.

Later,
Tom

Re: ESX 3.0.1 - Linux Guests go ReadOnly

59. Feb 23, 2007 8:54 PM in response to: Damin
Click to view Ops admin's profile Novice 9 posts since
Jun 5, 2006
This came up on our farm.

In fact one of the servers was rebooted, when it came back up all that was left of the "/" filesystem was "lost&found" The entire rest was gone!!

One thing we have noticed, the 3 systems that have the problem all have the updated Redhat kernel from RHN "2.6.9-42.0.8.ELsmp #1 SMP" and the following entry for the driver "mptlinux-3.02.62.01rh". We can apply the patch to these servers.

We have 20+ other systems that are running the stock stuff "2.6.9-22.ELsmp #1 SMP" and "mptlinux-3.02.18". We need to update the kernel to apply the patch.

Does anyone know if we need to apply the patch to the stock kernel systems? I am just wondering if the problem wont show itself with what we have on the older systems.

VMware Beta Programs

Want to be Considered for Future Beta Programs?

Learn More

VMware Developer

Download SDKs, APIs, videos,
training, and more in the Developer community.

Learn More

Developer
Sample Code

Increase your developer productivity with VMware API sample code.

Learn More

VMworld
Sessions & Labs

Online access to the latest VMworld Sessions & Labs and online services.

Learn more

Purchase PSO Credits Online

Purchase credits to redeem training and consulting services online.

Buy Now

Community Hardware Software

View reported configurations or report your own.

Learn More

Only VMware ... Delivers Nexus 1000V

Ensure consistent, policy-based network capabilities to virtual machines across your data center.

Learn More

Communities