VMware Cloud Community
monkalnakor
Contributor
Contributor

Lose esx4.1 node with HBA error

Hi,

I have a IBM BladeH with four HS22 blade installed.

I use VSphere Enterp Plus with ESX on 4.1 release.

This morning I found a cluster node down, all the VMs off, without the HA has worked.

On the node in /var/log/messages I have found many many message like that:

Oct 10 01:08:06 esx2 vobd: Oct 10 01:08:06.360: 5295395498800us: http://esx.clear.storage.redundancy.restored Path redundancy to storage device naa.600a0b800074cc480000014f4c47ba58 (Datastores: "VMDisk1430_A") restored. Path vmhba2:C0:T0:L0 is active again..

Oct 10 01:09:43 esx2 vobd: Oct 10 01:09:43.364: 5295492544743us: http://vob.scsi.scsipath.por Power-on Reset occurred on vmhba2:C0:T0:L0.

These messages were opened on October 7, but today the system crashed.

On the file /var/log/vmkernel I have found many many message like that:

Oct 14 06:01:05 esx2 vmkernel: 65:11:49:34.674 cpu4:62021)WARNING: LinScsi: SCSILinuxAbortCommand: The driver failed to call done from itsabort handler and yet it returned SUCCESS

Oct 14 06:01:05 esx2 vmkernel: 65:11:49:34.674 cpu4:62021)WARNING: LinScsi: SCSILinuxAbortCommands: Failed, Driver qla2xxx, for vmhba2

Oct 14 06:01:05 esx2 vmkernel: 65:11:49:34.674 cpu4:62021)WARNING: ScsiPath: 5176: Set retry timeout for failed TaskMgmt abort for CmdSN 0x0, status Failure, path vmhba2:C0:T0:L0

Oct 14 06:01:05 esx2 vmkernel: 65:11:49:34.777 cpu11:62023) drv 8.46]

Oct 14 06:01:07 esx2 vmkernel: 65:11:49:36.514 cpu6:61981)ScsiDeviceIO: 1672: Command 0x16 to device "naa.600a0b800074cc480000014f4c47ba58" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

Oct 14 06:01:07 esx2 vmkernel: 65:11:49:36.514 cpu6:61981)WARNING: NMP: nmp_DeviceStartLoop: NMP Device "naa.600a0b800074cc480000014f4c47ba58" is blocked. Not starting I/O from device.

Oct 14 06:01:07 esx2 vmkernel: 65:11:49:36.514 cpu13:4135)WARNING: FS3: 7030: Reservation error: Timeout

This morning I solved the problem by restarting the physical node from IBM AMM.

After the restart there were no more of these error messages the HBA.

What happened?

Why the HA did not worked?

One last statement: when I am connected to the machine through the console, I have seen this error screen. (See attached file)

Tags (5)
0 Kudos
1 Reply
monkalnakor
Contributor
Contributor

Some people can halp me?

0 Kudos