VMware Cloud Community
pwatvu
Contributor
Contributor

Linux file systems becoming read-only

The file systems on several of our guest OSs running RHEL4 have suddenly started becoming read-only. ESX is 3.0.0. The two file systems that are usually / and /var. Both systems are HP DL385 with Qlogic 2430 cards. Any ideas what might be causing this?

0 Kudos
47 Replies
admin
Immortal
Immortal

Are these guest OS's using VMDKs or RDMs?

0 Kudos
pwatvu
Contributor
Contributor

They are VMDKs. Also, I mistyped the HBAs. They are 2340 cards.

0 Kudos
admin
Immortal
Immortal

Sounds like a problem inside the guest OS, have you run fschk's on the file systems of offending boxes?

0 Kudos
pwatvu
Contributor
Contributor

All file systems are marked clean upon reboot. No problems found when fsck is run. The odd part is that it is happening to multiple VMs on multiple ESX servers.

0 Kudos
KnowItAll
Hot Shot
Hot Shot

What type of San Array are you connected to?

This problem could be in 2 places:

#1 The scsi drive in the quest (buslogic or lsilogic?)

#2 You array (this is the most likely cause).

In the guest check message to see if you can pinpoint when the filesystem goes readonly and then on ESX check /var/log/vmkernel and /var/log/vmkwarnings to see if you find warnings or errors in those log files to correspond to the time frame your guest OS filesystems go to readonly.

Also, what update of RHEL 4 are you running?

0 Kudos
manuel_wenger
Enthusiast
Enthusiast

I've had a customer setup with the same problem, running ESX 3.0.0. Out of the blue, all RHEL4 guests (not sure which update) on both physical servers connected to the same SAN became read-only and had to be rebooted. W2003 guests were OK.

The servers are 2 IBM x326m (Dual Core AMD Opteron 285), the storage is an IBM DS4300 Express and the fiber channel cards are QLogic 2340.

What I think is that the storage had a timeout of some sort. The customer told me that this happened at the same time when they accidentally "looped" their management network by connecting two switches among each other twice. The storage controllers were connected to that network, so maybe the controllers momentarily froze when this happened.

I didn't find anything in the vmkernel logs, and didn't find the real cause of the problem with 100% certainty.

-Manuel

0 Kudos
KnowItAll
Hot Shot
Hot Shot

That would make sense.

If there is an underlying problem with the array, I have see the linux guest go to readonly on their filesystem.

There is ALWAYS something in vmkernel or vmkwarnings that can point to an underlying problem. I can agree that you really need to know how to read all the warnings and all the errors in the logs but that is for a different thread.

The Windows guest will not go to readonly but I am sure you can see filesystem retries (scsi bus resets) in the Windows event log and in vmware.log located with the .vmx files.

0 Kudos
plankers
VMware Employee
VMware Employee

Hey folks,

Is this happening on any SAN that is built with switches other than Brocade? What version of FabricOS are you running on your Brocade switches? I'm seeing this and I'm running 5.0.3.

VMware did issue an alert about FabricOS 5.0.4 not handling failover correctly , maybe this is related.

0 Kudos
manuel_wenger
Enthusiast
Enthusiast

I've seen this happen on a directly-connected SAN (no switches) with no failover at all (single port HBA). I'm not saying it couldn't be a failover problem, but certainly it's not the only issue.

0 Kudos
RobMokkink
Expert
Expert

Are the linux guest using LVM?

If so are the vmware-tools installed?

0 Kudos
manuel_wenger
Enthusiast
Enthusiast

In my case, the guests are all using LVM and vmware-tools are installed.

0 Kudos
CTeague
Contributor
Contributor

I have had this issue 3 separate times while my Linux VM's have been on the SAN data store.

The issue was finally brought to light when the group managing the SAN mentioned they had a card reset in the SAN backbone which caused everything to reset on their end (VM wise, only my Linux VMs were affected and Win2k3 was okay).

They said it should of been okay even if one of the cards did reset. Now it shows there is a multipathing issue and they are currently troubleshooting it.

Was a LUN communication issue after all.

0 Kudos
pwatvu
Contributor
Contributor

Ok, so we discovered another non-VM RHEL server that is have the same IO issues with our SAN. It looks like ESX is not our cause. We've opened a call with EMC to see if they can help. Thanks everyone for the community help.

0 Kudos
Natiboy
Enthusiast
Enthusiast

We have seen this very thing. We traced this into our EMC San (DMX) and believe that it has something to do with SRDF and the way that our FC Switches are configured. All connected FC hosts, VM and Physical, take a hit and cause the event (Read-Only in Linux systems and Errors in windows) at the server level. The SRDF system registers an event at the same time. Our ticket is still running within EMC, but I believe this is an EMC only issue. Anyone else seen this behavior on a Non EMC System?

0 Kudos
admin
Immortal
Immortal

Yes, we are seeing this issue with an IBM DS4300Turbo SAN, although it only seems to be affecting Ubuntu guests. It's happened to one particular VM twice in the last day or so.

0 Kudos
manuel_wenger
Enthusiast
Enthusiast

It would be interesting to know if this happened with 2.5.x as well, or if it's a 3.0.x issue only.

0 Kudos
admin
Immortal
Immortal

I have only seen this issue since upgrading to 3.0.1, that of course proves nothing, but it does seem a bit coincidental as I had no problems of this nature with 2.5.x.

0 Kudos
Damin
Enthusiast
Enthusiast

0 Kudos
jlauro
Expert
Expert

I have seen it a little too. It does appear to be a 3.0 only. (or at least far less often under 2.5). It has happened mainly when the SAN was under stress (busy slow sata drives, or path failover), so possibly there is a timeout or retry parameter that can be adjusted....???

0 Kudos