VMware Cloud Community
jarends3
Contributor
Contributor

RHEL5 guest has become read-only twice

I have a RHEL5 guest that on 2 occasions has suddenly had a read-only file system. As a result the service running on it (Sassafras KeyServer) has died since it can't write anything. Rebooting the VM solved the problem, but as a result I lost some valuable data.

Any idea what could have caused this? We previously ran the VM on VMware Server for about a year and never had a problem. Since migrating the VM to our ESX server, this has happened twice in the span of about 4 months.

0 Kudos
7 Replies
Texiwill
Leadership
Leadership

Hello,

This is usually caused by the disk subsystem suddenly becoming unavailable. Perhaps during a vMotion there was an issue or an issue with the shared storage. Look at /var/log/vmkernel on CLI of the ESX Host for issues.


Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education. As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
0 Kudos
jarends3
Contributor
Contributor

This is all I could find in the log, and none of it means much to me:

Apr 20 01:03:22 my-esx-server vmkernel: 65:13:49:16.112

cpu3:1118)VSCSIFs: 439: fd 4115 status Busy

Apr 20 01:03:22 my-esx-server vmkernel: 65:13:49:16.349

cpu2:1147)VSCSIFs: 439: fd 8221 status Busy

Apr 20 01:03:22 my-esx-server last message repeated 2 times

Apr 20 01:03:23 my-esx-server vmkernel: 65:13:49:16.512

cpu3:1118)VSCSIFs: 439: fd 4115 status Busy

Apr 20 01:03:23 my-esx-server vmkernel: 65:13:49:16.512

cpu3:1118)VSCSIFs: 439: fd 4115 status Busy

Apr 20 01:03:23 my-esx-server vmkernel: 65:13:49:16.634

cpu1:1061) drv 4.31]

Apr 22 11:46:32 my-esx-server vmkernel: 68:00:32:26.264 cpu0:1118)VSCSI:

2803: Reset request on handle 8197 (0 outstanding commands)

Apr 22 11:46:32 my-esx-server vmkernel: 68:00:32:26.264 cpu1:1056)VSCSI:

3019: Resetting handle 8197

Apr 22 11:46:32 my-esx-server vmkernel: 68:00:32:26.264 cpu1:1056)VSCSI:

2871: Completing reset on handle 8197 (0 outstanding commands)

0 Kudos
fmateo
Hot Shot
Hot Shot

Hello,

Are you using multipath? I had a similar issue. I'm using iSCSI shared storage (using storage HA), and on the original installation someone forgot to mark an important option: disable the channel if server need to be stopped or server hungs. The problem is: the ESX server belives that it has access to the LUN (where the VM are running) throught a channel (in my envirorment I have 4 possible channels). If the storage server don't disable or don't notify to the application server (ESX) that the channels goes turn off, the LUN for that ESX dissapears, and the HD from the VM dissapear, too. And the filesystem (if its RedHat, any version), becomes on read-only.

In older versions from RedHat (RHEL4.4 or earlier), the systems needs a package for multipath: mpio-iscsi-mpath (or some similar). From RHEL4.5 to RHEL5, this rpm are included in the system.

Byee

0 Kudos
Texiwill
Leadership
Leadership

Hello,

Your storage failed over, and either the failover is setup incorrectly (look at the settings on the array) or multipath for SAN is setup incorrectly, but in effect you lost access to the LUN the VM resides upon so RHEL went read only to protect the system from failed writes. I would investigate your iSCSI/SAN device and discuss with the vendor.


Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education. As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
0 Kudos
jarends3
Contributor
Contributor

Does a specific line in the logs refer to the failover? I don't know

enough about what the log entries mean to pick the correct one out.

None of the other VMs on the ESX host reacted this way so I am puzzled.

0 Kudos
Texiwill
Leadership
Leadership

Hello,

VSCSI busy statements in your log imply something is wrong, as well as the following:

cpu1:1061)<6>qla24xx_abort_command(1): handle to abort=912
Apr 21 17:25:09 my-esx-server vmkernel: 67:06:11:03.148
cpu1:1061)<6>qla24xx_abort_command(1): handle to abort=913
Apr 21 17:25:09 my-esx-server vmkernel: 67:06:11:03.169

I would investigate your SAN fabric and correlate these dates/times to SAN actions issues. These errors and VSCSI BUSY statements are very bad to see in log files. It says there are issues with the Storage subsystem.


Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education. As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
0 Kudos
tsightler
Hot Shot
Hot Shot

While the many answers in this thread are potentially correct and seeing VSCSI busy is not a good thing, I still do not think they should cause your guest to go read-only. The VSCSI message lasted about 10 seconds which is not great, but should not cause the failure you are seeing, the guest should easily survive outages that are 30-60 seconds without a major issue.

Can you please tell me what kernel version you are running in your RHEL5 guest? There were several known issues with Linux guest going read-only during relatively minor timeouts. Early kernel versions in RHEL5 had this problem but recent versions (RHEL5.1 versions) should not. See VMware KB article 1001778 for details.

Later,

Tom

0 Kudos