Hello,
My hopes for the 3.0.1 release solving this problem have been dashed. Per the suggestion of Vmware support, we ripped out our Qlogic HBAs and went w/ software iSCSI and upgraded our 3.0.0 machine to 3.0.1.
Same problem. Under heavy load, the Linux guests go into Read Only mode on their filesystem.
Consoles indicate the following:
SCSI Error : <0 0 0 0> return code = 0x20008
end_request: I/O error, dev sda, sector 4928181
Aborting journal on device dm-0
ext3_abort called.
EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only.
All guests are using the LSI-Logic driver and have the latest Vmware tools installed.
This is very clearly a Vmware issue, and we've now proven that it IS NOT related to the Qlogic QLA4010 HBAS, as we migrated to Software iSCSI and are getting the same problem.
It's similar to issues with redhat scsi drivers in a multipath environment, where SAN switch reset confuse the mutlipath driver and cause the filesystem to flag a corrupt journal. Is it possible that under heavy load your iSCSI network is dropping packets or performing badly in some way. This might expose some issues with the underlying iSCSI stack which are in turn getting exposed to the filesystem ? Might be worth trying to get an idea of the health of the network which you are sending your iSCSI i/o to see if something is underperforming/failling under load.
I totally agree that this problem should not be happening tho...
Message was edited by:
jonhutchings
Can you see if there are any errors reported in ESX's /var/log/vmkernel?.
Also, make sure network connectivity is healthy and there are no intermittent failures. Path to storage being down for extended amount of time (for RH it is typically 1min, I guess) can take the FS read-only. Which target are you using?.
Is this a shared network or dedicated to iSCSI only?. Are any VLAN configuration in the picture?. Any traffic shaping or bandwidth allocation restrictions placed on the physical switch that is connected to iSCSI?.
Wow, having this same problem I didn't even have to search to find another with it!
I'm running CentOS 4.3 (4x guests) on ESX 3.0 and 3.0.1 w/ VMWare Tools installed.
I experience the same exact read-only problem....I have had this happen on three occasions now and a reboot temporarily fixed the issue. The problem is that I don't have any heavy loads going to any of these guests (Apache web servers / mySQL servers, currently with little web traffic / db traffic as it's still heavy in development).
I'm using PE2850 (Dual Duo-Core 2.8ghz / 8gb Ram / 2x QLogic PCI-E HBA) on a Sun based SAN. The SAN is controlled by a higher technical department within my school so the details I can give right now are lacking.
Any ideas what could be causing this?
There's another thread discussing the same problem, in case you missed it:
No. This happens on a direct crossover cable (I.E. no switch) between the HBA and the iSCSI Target box. The ports indicate no errors, dropped packets etc.. so the underlying network layer (all Cat-5E 6 feet of it) is fine.
This might expose some
issues with the underlying iSCSI stack which are in
turn getting exposed to the filesystem ?
I suspect that it is related to how Vmware passes SAN timeouts to it's Guests. I was told by a Second Level support that Vmware is aware of the problem, that they have fixed it for Windows in 3.0.1, but that they have NOT fixed it yet for Linux. I strongly, strongly, strongly, STRONGLY disagree that this is a problem with the iSCSI target. In all of our testing over 12 months, we've never had this issue w/ any other Clients. Our iSCSI target is used by more than just Vmware here.
Might be
worth trying to get an idea of the health of the
network which you are sending your iSCSI i/o to see
if something is underperforming/failling under load.
The network is 100% clean. No dropped packets, no resends, no errors, no overruns. Nothing. It's clean.
I totally agree that this problem should not be
happening tho...It's similar to issues with redhat scsi drivers in a
multipath environment, where SAN switch reset confuse
the mutlipath driver and cause the filesystem to flag
a corrupt journal. Is it possible that under heavy
load your iSCSI network is dropping packets or
performing badly in some way.
Can you see if there are any errors reported in ESX's
/var/log/vmkernel?.
Yep. And those have been provided to Vmware. When using the Software iSCSI initiator, the problem happens much more rapidly than using an HBA. This makes a lot of sense, as the HBA is interrupt driven, while the Software is system-load dependent. On a highly utilized Host, the Software iSCSI initiator is NOT going to be able to service the data as quickly as an HBA will.
So.. back to your question.. We see buffer overruns for reserve/release activity, and several iSCSI timeouts. The timeouts appear to indicate a networking issue such as dropped packets (none showd on either end, or on either port!!!), but we can account for every packet back and forth between the client and the target and there ARE NO DROPPED PACKETS!
We've even removed the switch from the mix and gone to crossover cables between the host and the target. Same problem.
Also, make sure network connectivity is healthy and
there are no intermittent failures. Path to storage
being down for extended amount of time (for RH it is
typically 1min, I guess) can take the FS read-only.
Which target are you using?.
IETD, specifically patched w/ Reserve/Release and several other patches that have been submitted back to the project as a result of 9 months of testing w/ Vmware.
Is this a shared network or dedicated to iSCSI only?.
Dedicated to iSCSI only.
Are any VLAN configuration in the picture?.
No.
Any
traffic shaping or bandwidth allocation restrictions
placed on the physical switch that is connected to
iSCSI?.
On a crossover cable? No.
Who did you provide the vmkernel log?. Did you file an SR?. Do you have SR number?.
1. Vmkernel log was provided to Vmware support.
2. Yes I have an SR#.
Can you post the SR number?. We will take a look at the vm-support logs.
#316658
BTW.. this was closed by Vmware because the "iSCSI Target you are using is not supported".
Never mind that the behavior occurs on HCL supported iSCSI Targets as well.
The basic cause of this problem, as explained to me by Vmware engineering, is that if a SAN timeout happens, the Vmkernel will send a message up to the guest reflecting the timeout. I was told that this is a known issue.
I'm wondering if this fact is a "known issue" as in "we will/must do something to fix it", or if it's an expected behaviour.
I'm wondering if this fact is a "known issue" as in
"we will/must do something to fix it", or if it's an
expected behaviour.
From what I understand, when the SAN times out (Which SANS can/will do occasionally!) Vmware is sending a signal to the Guest. This signal can either be "Storage Unavailable "or "Storage Timeout". The guest storage drivers can react differently to these signals. Under Linux, the LSI driver freaks out and tells the OS that storage is unavailable, and Ext3 goes into read-only mode.
According to the engineer that I spoke with, this was fixed for Windows guests, but not yet for Linux. I'm not sure how they plan on fixing it.
I've been battling this issue for weeks so it's good to know I'm not alone.
In researching the issue tonigh I found the following bug which looks like it might be related:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=197158
It looks like this problem may be caused by a fairly recent (in the last few months) change in the upstream LSI Logic driver. For RHEL4 systems this code was changed between U2 and U3. I wonder if we could simply recompile the mptscsih driver from U2 for use in U3 and newer as a temporary solution. Has anybody tried that?
Basically, the newer driver adds an extra DID_BUS_BUSY status to a SCSI command failure of , MPI_SCSI_STATUS_BUSY which causes the SCSI mid-layer only tries 5 times before reporting a timeout to the upper layers. With previous versions without this extra return status the SCSI mid-layer would try indefinitely.
If this is really the issue it would be pretty simple to patch and compile a new mptscsih driver to temporarily work around this issue. Maybe I'll give that a try tomorrow.
Later,
Tom
OK, today I decided to try patching the mptscsi driver to revert it's behavior to RHEL4 U2 and prior. So far, this seems to be working. I've run significant tests on several systems today, including a set of my most troublesome boxes, connected to a lowly AX150i, that previously would fail in 5-10 minutes. So far the boxes have survived the day without issue.
There have been several iSCSI timeouts issues on the hosts (as noted in the vmkernel logs) which would normally have been propagated as SCSI timeouts in the guest but with the patched driver the Linux system seems to just pause, and the return to normal operation.
I've posted the patched files and some crude instructions at http://www.tuxyturvy.com/blog/index.php?/archives/31-VMware-ESX-and-ext3-journal-aborts.html so others can give it a try if they want too.
I'm not willing to say this is 100% yet, perhaps the systems are just behaving today, but so far it looks good.
Later,
Tom
What kind of test you are doing to get to the read-only situation?.
I assume the boot disk is on iSCSI volume. Do you use virtual disk in VMFS volume or RDM for guest's boot disk?. If using VMFS, do you perform any I/O's to VMFS volume apart from booting (and guest doing I/O's)?.
Are you sharing the LUN with any other server?.
I can't remember, but did you mention which target array you are using?.
Do you use NIC teaming?. If yes, could you check if the problem happen when there is only one uplink?.
Wow, you sure ask a lot of questions!
What kind of test you are doing to get to the read-only situation?.
Basically, anything that eventually leads to a timeout on the ESX host. For the most part this simply involves running disk intensive benchmarks on multiple guests. We've been using iozone, but I would suspect you could use almost anything that generates a lot of I/O.
I assume the boot disk is on iSCSI volume.[/b]
No, boot disk is on locally attached RAID storage (Dell PERC 4)
Do you use virtual disk in VMFS volume or RDM for guest's boot disk?. If using VMFS, do you perform any I/O's to VMFS volume apart from booting (and guest doing I/O's)?.
We use a virtual disk in a VMFS volume. Yes, of course there are I/O's other than booting since the entire virtual disk for the guest is on the VMFS volume.
Are you sharing the LUN with any other server?.
I can't remember, but did you mention which target array you are using?.
One other ESX server.
Do you use NIC teaming?. If yes, could you check if the problem happen when there is only one uplink?.
Yes, but the problem will happen even with a single uplink.
You seem to be under the impression that the problem can only happen with iSCSI, however, we have reproduced the issue with locally attached RAID, and with fiber channel attached storage. Basically, if you can genereate enough I/O that you get an Async I/O error, the chances of seeing your ext3 volumes go read only is high.
If you see a message like this in your log:
Oct 22 01:28:19 esxhost1 vmkernel: 13:20:19:56.753 cpu3:1028)SCSI: 3731: AsyncIO timeout (5000); aborting cmd w/ sn 399366, handle b78/0x6a028a8
You are probably going to have a problem. With the small change I made to the LSI Logic driver the linux systems seem to be able to survive these timeouts.
Later,
Tom
I have updated my development CentOS 4.4 VM with your proposed fix. Usually it takes me 24-48 hours to encounter a read-only file system. So far it's been working well since 9:00 AM EST.
If the VM is stable after a week I'll be fairly certain the cause of this problem has been found.
Thank you Tom for the patch and directions.
Thanks. I like to hear from Damin too regarding his configuration and how was the usage of iSCSI.
We have now survived 24 hours running our simple iozone benchmark which used to trigger the issue in only a few minutes so I'm very interested in your results.
If you do have the problem, if you could possibly capture the output of dmesg that would be great.
Later,
Tom