Re: Console Journal Errors

acowl · ‎02-21-2007

Hi,

I have installed ESX 3.0.1 on 4 DL380G5 servers each with mirrored 72GB drives on the P400 controller, all the servers are giving a Journal I/O error on the console, I have rebuilt them all, I have no problems with space.

The exact error is.

EXT3-fs error (device cciss0(104,5)) in start_transaction: Journal has aborted

I also get

Error (-5) journal on device cciss0(104,5)

All the system firmware has been updated to the latest version available, the insight manager agents have been loaded/unloaded the error occours regardless of whether they are loaded or not, the console memory has been upped to 350Mb to take into account the insight agents will be loaded later, also the swap file has been increased to 1024Mb.

Any help/suggestions gratefully received.

Andy

vmako · ‎02-22-2007

We are still suffering these errors intermittently on four servers despite reinstalling ESX.

The console is logging the errors mentioned previously along with;

"journal commit I/O error"

"EXT3-fs error (device cciss0(104,5)) in start_transaction: Journal has aborted"

The problem becomes apparent when the HA agent fails on the servers and it appears that the /opt filesystem becomes read-only. Although the mount command shows /opt mounted as rw.

Can anyone shed any light on this infuriating problem?

TIA,

Dan.

Rumple · ‎02-26-2007

I've had the same problem 2 times on our ESX Servers running on IBM x366 servers. There was apparently a firmware upgrade for both the Raidserv 8i controller and the drives.

I've upgraded the firmware and hopefully don't see it again although the funniest part is that after upgrading the firmware on one server it came up and stated that 3 out of 4 drives were missing in my array (nice).

I was working with IBM and missed the prompt to get into the servRAID cd and esx freaking loaded and has been running all day.

IBM cannot explain why the serveraid manager sees only a single drive in the array running but ESX is able to load the array with no problems.

vmako · ‎02-27-2007

This is currently logged with and being investigated by HP.

acowl · ‎03-08-2007

It is still with HP!

Can anyone else shead some light on what may be wrong?

Thanks

Andy

dpomeroy · ‎04-17-2007

I have a DL580G2 that was running ESX 2.5.x fine for over a year. I just rebuilt it as ESX 3.0.1 and started having this problem.

EXT3-fs error (device cciss0(104,5)) in start_transaction: Journal has aborted

Has anyone else with this issue found any resolution?

It seems in our case it will be fine until I plug in the fiber. /var goes read only and we get the journal I/O errors.

What is interesting is that we have 4 other same servers that were rebuilt as ESX 3.0.1 and are working fine. All have the same hardware and same BIOS/Firmware.

acowl · ‎04-18-2007

Hi All,

This has now been diagnosed as a bug by HP/VMware, it is relevent to both the P400 and P600 controller. A fix is hoped to be included in the May Patch release bundle.

It seems to occour specifically when partition 5 is used, for example in our case our partition table looked like this.

/boot = 100MB

swap = 1024MB

/ = 5000MB

Then in extended

/var = 2000MB

/opt = 2000MB

/home = 2000MB

/usr = 2000MB

Regards,

Andy

dpomeroy · ‎04-18-2007

Its more than just the P400/600 controller.

I have this on a DL580G2 that uses the onboard 5i+ controller. If I use the default partitions you get with the install its fine, if I change it to our standard:

/boot=250mb

/=10gb

swap=2gb

EXTENDED

var=5gb

tmp=5gb

core=100mb

vmfs3=rest of disk

then it doesnt work.

The odd thing is that I have other DL580G2s running ESX 3.0.1 without this issue, I thought they were all at the same bios/firmware issue, but now Im wondering if that has something to do with it.

vmako · ‎04-20-2007

According to VMware the issue is a bug in the partition protection code. Therefore, from what I understand this issue is not contained to the DL380G5 and may occur on other server hardware.

It appears that the partition layout that we used caused the issue to appear in our environment. We await a patch that may or may not appear before the May monthly patch release.

Don, which partition do you see switch to read only in your configuration?

dpomeroy · ‎04-20-2007

In our config its the /var that goes read only.

This is the layout:

Disk /dev/cciss/c0d0: 36.4 GB, 36414750720 bytes

64 heads, 32 sectors/track, 34727 cylinders

Units = cylinders of 2048 * 512 = 1048576 bytes

Device Boot Start End Blocks Id System Mount

/dev/cciss/c0d0p1 * 1 250 255984 83 Linux /boot

/dev/cciss/c0d0p2 251 10250 10240000 83 Linux /

/dev/cciss/c0d0p3 10251 12298 2097152 82 Linux swap swap

/dev/cciss/c0d0p4 12299 34727 22967296 f Win95 Ext'd

/dev/cciss/c0d0p5 12299 17298 5119984 83 Linux /tmp

/dev/cciss/c0d0p6 17299 22298 5119984 83 Linux /var

/dev/cciss/c0d0p7 22299 34627 12624880 fb Unknown local vmfs

/dev/cciss/c0d0p8 34628 34727 102384 fc Unknown core dump

dpomeroy · ‎04-20-2007

One other thing I see on these systems that have this problem is this in the hostd log:

\[2007-04-20 02:26:03.050 'Partitionsvc' 3076464768 info] Error Stream from partedUtil while getting partitions: Error: The partition table on /vmfs/devices/disks/vmhba0:0:0:0 is inconsistent. There are many reasons why this might be the case. However,the most likely reason is that Linux detected the BIOS geometry for /vmfs/devices/disks/vmhba0:0:0:0 incorrectly. GNU Parted suspects the real geometry should be 34727/64/32 (not 4427/255/63). You should check with your BIOS first, as this may not be correct. You can inform Linux by adding the parameter /devices/disks/vmhba0:0:0:0=34727,64,32 to the command line. See the LILO or GRUB documentation for more information. If you think Parted's suggested geometry is correct, you may select Ignore to continue (and fix Linux later). Otherwise, select Cancel (and fix Linux and/or the BIOS now).

twwlogin · ‎05-16-2007

We're seeing this error on a Sun X4200M2 with ESX 3.0.1. It is very frustrating. We have ESX loaded on one of the internal drives and have the guest OS's accessed via NFS. When the NFS file server and the X4200M2 are connected to a GigE switch, the problem surfaces every few hours. We're running with both on a 10/100 switch for now which seems to alleviate the problem.

Anyone with an idea when this issue will be resolved?

Rand · ‎07-16-2007

I am getting the exact error. Has anyone found a resolution?

mlydy · ‎10-16-2007

Has anybody heard of a fix for this problem?