Linux file systems becoming read-only

pwatvu · ‎10-12-2006

The file systems on several of our guest OSs running RHEL4 have suddenly started becoming read-only. ESX is 3.0.0. The two file systems that are usually / and /var. Both systems are HP DL385 with Qlogic 2430 cards. Any ideas what might be causing this?

admin · ‎10-12-2006

Are these guest OS's using VMDKs or RDMs?

pwatvu · ‎10-12-2006

They are VMDKs. Also, I mistyped the HBAs. They are 2340 cards.

admin · ‎10-12-2006

Sounds like a problem inside the guest OS, have you run fschk's on the file systems of offending boxes?

pwatvu · ‎10-12-2006

All file systems are marked clean upon reboot. No problems found when fsck is run. The odd part is that it is happening to multiple VMs on multiple ESX servers.

KnowItAll · ‎10-12-2006

What type of San Array are you connected to?

This problem could be in 2 places:

#1 The scsi drive in the quest (buslogic or lsilogic?)

#2 You array (this is the most likely cause).

In the guest check message to see if you can pinpoint when the filesystem goes readonly and then on ESX check /var/log/vmkernel and /var/log/vmkwarnings to see if you find warnings or errors in those log files to correspond to the time frame your guest OS filesystems go to readonly.

Also, what update of RHEL 4 are you running?

manuel_wenger · ‎10-13-2006

I've had a customer setup with the same problem, running ESX 3.0.0. Out of the blue, all RHEL4 guests (not sure which update) on both physical servers connected to the same SAN became read-only and had to be rebooted. W2003 guests were OK.

The servers are 2 IBM x326m (Dual Core AMD Opteron 285), the storage is an IBM DS4300 Express and the fiber channel cards are QLogic 2340.

What I think is that the storage had a timeout of some sort. The customer told me that this happened at the same time when they accidentally "looped" their management network by connecting two switches among each other twice. The storage controllers were connected to that network, so maybe the controllers momentarily froze when this happened.

I didn't find anything in the vmkernel logs, and didn't find the real cause of the problem with 100% certainty.

-Manuel

KnowItAll · ‎10-13-2006

That would make sense.

If there is an underlying problem with the array, I have see the linux guest go to readonly on their filesystem.

There is ALWAYS something in vmkernel or vmkwarnings that can point to an underlying problem. I can agree that you really need to know how to read all the warnings and all the errors in the logs but that is for a different thread.

The Windows guest will not go to readonly but I am sure you can see filesystem retries (scsi bus resets) in the Windows event log and in vmware.log located with the .vmx files.

plankers · ‎10-16-2006

Hey folks,

Is this happening on any SAN that is built with switches other than Brocade? What version of FabricOS are you running on your Brocade switches? I'm seeing this and I'm running 5.0.3.

VMware did issue an alert about FabricOS 5.0.4 not handling failover correctly , maybe this is related.

manuel_wenger · ‎10-16-2006

I've seen this happen on a directly-connected SAN (no switches) with no failover at all (single port HBA). I'm not saying it couldn't be a failover problem, but certainly it's not the only issue.

RobMokkink · ‎10-17-2006

Are the linux guest using LVM?

If so are the vmware-tools installed?

manuel_wenger · ‎10-17-2006

In my case, the guests are all using LVM and vmware-tools are installed.

CTeague · ‎10-17-2006

I have had this issue 3 separate times while my Linux VM's have been on the SAN data store.

The issue was finally brought to light when the group managing the SAN mentioned they had a card reset in the SAN backbone which caused everything to reset on their end (VM wise, only my Linux VMs were affected and Win2k3 was okay).

They said it should of been okay even if one of the cards did reset. Now it shows there is a multipathing issue and they are currently troubleshooting it.

Was a LUN communication issue after all.

pwatvu · ‎10-17-2006

Ok, so we discovered another non-VM RHEL server that is have the same IO issues with our SAN. It looks like ESX is not our cause. We've opened a call with EMC to see if they can help. Thanks everyone for the community help.

Natiboy · ‎10-17-2006

We have seen this very thing. We traced this into our EMC San (DMX) and believe that it has something to do with SRDF and the way that our FC Switches are configured. All connected FC hosts, VM and Physical, take a hit and cause the event (Read-Only in Linux systems and Errors in windows) at the server level. The SRDF system registers an event at the same time. Our ticket is still running within EMC, but I believe this is an EMC only issue. Anyone else seen this behavior on a Non EMC System?

admin · ‎10-17-2006

Yes, we are seeing this issue with an IBM DS4300Turbo SAN, although it only seems to be affecting Ubuntu guests. It's happened to one particular VM twice in the last day or so.

manuel_wenger · ‎10-17-2006

It would be interesting to know if this happened with 2.5.x as well, or if it's a 3.0.x issue only.

admin · ‎10-17-2006

I have only seen this issue since upgrading to 3.0.1, that of course proves nothing, but it does seem a bit coincidental as I had no problems of this nature with 2.5.x.

Damin · ‎10-17-2006

This topic is also being discussed here:

http://www.vmware.com/community/thread.jspa?threadID=58121&tstart=0

jlauro · ‎10-17-2006

I have seen it a little too. It does appear to be a 3.0 only. (or at least far less often under 2.5). It has happened mainly when the SAN was under stress (busy slow sata drives, or path failover), so possibly there is a timeout or retry parameter that can be adjusted....???