ESXi 5 and hardware RAID: SCSI error messages duri...

Loc2262 · ‎04-01-2012

Hello!

I'm operating two servers at a German hoster, and they have an SAS MegaRAID controller for hardware RAID-1.

Once per week, in the night from Friday to Saturday, from about 3 AM to 9 AM (GMT), the controller performs a self-check of the array. This is accompanied by the following entries in /scratch/log/vmkernel.log:

2012-03-24T03:00:36.732Z cpu5:3547)<6>megasas_service_aen[6]: aen received
2012-03-24T03:00:36.732Z cpu0:2659)<6>megasas_hotplug_work[6]: event code 0x0027
2012-03-24T03:00:41.172Z cpu0:2659)<6>megasas_hotplug_work[6]: aen registered
2012-03-24T03:00:41.178Z cpu5:339084)<6>megasas_service_aen[6]: aen received
2012-03-24T03:00:41.178Z cpu3:2655)<6>megasas_hotplug_work[6]: event code 0x0042
2012-03-24T03:00:41.185Z cpu3:2655)<6>megasas_hotplug_work[6]: aen registered
2012-03-24T03:00:41.191Z cpu5:3779)<6>megasas_service_aen[6]: aen received
2012-03-24T03:00:41.191Z cpu0:2653)<6>megasas_hotplug_work[6]: event code 0x0023
2012-03-24T03:00:41.191Z cpu4:2653)<6>megasas_hotplug_work[6]: aen registered
2012-03-24T03:00:41.191Z cpu5:3779)<6>megasas_service_aen[6]: aen received
2012-03-24T03:00:41.191Z cpu7:2652)<6>megasas_hotplug_work[6]: event code 0x0041
2012-03-24T03:00:41.191Z cpu7:2652)<6>megasas_hotplug_work[6]: aen registered
2012-03-24T03:01:30.854Z cpu5:3547)<6>megasas_service_aen[6]: aen received
2012-03-24T03:01:30.854Z cpu0:2650)<6>megasas_hotplug_work[6]: event code 0x0041
2012-03-24T03:01:30.863Z cpu0:2650)<6>megasas_hotplug_work[6]: aen registered
2012-03-24T03:02:19.287Z cpu5:3769)<6>megasas_service_aen[6]: aen received
[...]
2012-03-31T08:44:12.622Z cpu1:2647)<6>megasas_hotplug_work[6]: event code 0x0041
2012-03-31T08:44:12.631Z cpu1:2647)<6>megasas_hotplug_work[6]: aen registered
2012-03-31T08:45:32.106Z cpu5:3758)<6>megasas_service_aen[6]: aen received
2012-03-31T08:45:32.106Z cpu0:2652)<6>megasas_hotplug_work[6]: event code 0x0041
2012-03-31T08:45:32.115Z cpu0:2652)<6>megasas_hotplug_work[6]: aen registered
2012-03-31T08:46:50.790Z cpu5:2053)<6>megasas_service_aen[6]: aen received
2012-03-31T08:46:50.791Z cpu0:2650)<6>megasas_hotplug_work[6]: event code 0x0041
2012-03-31T08:46:50.800Z cpu6:2650)<6>megasas_hotplug_work[6]: aen registered
2012-03-31T08:46:55.061Z cpu5:2053)<6>megasas_service_aen[6]: aen received
2012-03-31T08:46:55.061Z cpu1:2646)<6>megasas_hotplug_work[6]: event code 0x003a
2012-03-31T08:47:01.341Z cpu1:2646)<6>megasas_hotplug_work[6]: aen registered

The hoster says this is normal behavior. But, during the process, in irregular intervals, there are error messages logged too, like so:

2012-03-31T06:14:42.664Z cpu0:1872490)megasas: ABORT sn 97000334 cmd=0x2a retries=0 tmo=0
2012-03-31T06:14:42.664Z cpu0:1872490)megaraid_sas: megasas_abort Line 2022: ABORT instance->fw_outstanding: 32 , instance->disableOnlineCtrlReset: 0
2012-03-31T06:14:42.664Z cpu0:1872490)<5>0 :: megasas: RESET -97000334 cmd=2a retries=0
2012-03-31T06:14:42.664Z cpu0:1872490)megaraid_sas: HBA reset handler invoked without an internal reset condition.
2012-03-31T06:14:43.886Z cpu2:3547)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1055 Host Busy vmhba1:2:0:0 (driver name: LSI Logic SAS based MegaRAID driver) - Message repeated 799 times
2012-03-31T06:14:43.887Z cpu2:3547)ScsiDeviceIO: 2305: Cmd(0x4124007cc2c0) 0x2a, CmdSN 0x360d21 to dev "naa.600605b004135f2016ac39ca03d6753e" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-03-31T06:14:43.887Z cpu2:3547)ScsiDeviceIO: 2305: Cmd(0x4124007badc0) 0x2a, CmdSN 0x360d23 to dev "naa.600605b004135f2016ac39ca03d6753e" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-03-31T06:14:43.887Z cpu2:3547)ScsiDeviceIO: 2305: Cmd(0x412400781180) 0x28, CmdSN 0xdd to dev "naa.600605b004135f2016ac39ca03d6753e" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-03-31T06:14:43.887Z cpu2:3547)ScsiDeviceIO: 2305: Cmd(0x412400768a00) 0x28, CmdSN 0xd0 to dev "naa.600605b004135f2016ac39ca03d6753e" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-03-31T06:14:43.887Z cpu2:3547)ScsiDeviceIO: 2305: Cmd(0x4124007a6d40) 0x88, CmdSN 0x360d24 to dev "naa.600605b004135f2016ac39ca03d6753e" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-03-31T06:14:43.887Z cpu4:3801)ScsiDeviceIO: 2305: Cmd(0x412400781380) 0x2a, CmdSN 0xc6 to dev "naa.600605b004135f2016ac39ca03d6753e" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-03-31T06:14:43.942Z cpu5:2053)ScsiDeviceIO: 2305: Cmd(0x4124007cddc0) 0x28, CmdSN 0xd3 to dev "naa.600605b004135f2016ac39ca03d6753e" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-03-31T06:14:44.440Z cpu1:2049)ScsiDeviceIO: 2305: Cmd(0x412400725c40) 0x2a, CmdSN 0x360d28 to dev "naa.600605b004135f2016ac39ca03d6753e" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-03-31T06:14:44.685Z cpu0:1872490)<7>megaraid_sas: megasas_wait_for_outstanding: line 1875: AFTER HBA reset handler invoked without an internal reset condition: took 2 seconds. Max is 180.
2012-03-31T06:14:44.685Z cpu0:1872490)megaraid_sas: no more pending commands remain after reset handling.
2012-03-31T06:14:44.685Z cpu0:1872490)<5>megasas: reset successful

2012-03-31T06:14:44.685Z cpu5:3060)ScsiDeviceIO: 2288: Cmd(0x4124007cc2c0) 0x2a, CmdSN 0x360d21 to dev "naa.600605b004135f2016ac39ca03d6753e" failed H:0x8 D:0x0 P:0x0
2012-03-31T06:14:44.685Z cpu5:3060)ScsiDeviceIO: 2288: Cmd(0x4124007badc0) 0x2a, CmdSN 0x360d23 to dev "naa.600605b004135f2016ac39ca03d6753e" failed H:0x8 D:0x0 P:0x0
2012-03-31T06:14:44.685Z cpu5:3060)ScsiDeviceIO: 2288: Cmd(0x4124007097c0) 0x2a, CmdSN 0x360d25 to dev "naa.600605b004135f2016ac39ca03d6753e" failed H:0x8 D:0x0 P:0x0
2012-03-31T06:14:44.685Z cpu5:3060)ScsiDeviceIO: 2288: Cmd(0x412400711940) 0x2a, CmdSN 0x360d22 to dev "naa.600605b004135f2016ac39ca03d6753e" failed H:0x8 D:0x0 P:0x0
2012-03-31T06:14:44.685Z cpu5:3060)ScsiDeviceIO: 2288: Cmd(0x412400776c80) 0x2a, CmdSN 0x360d26 to dev "naa.600605b004135f2016ac39ca03d6753e" failed H:0x8 D:0x0 P:0x0
2012-03-31T06:14:44.685Z cpu0:2120)HBX: 2313: Waiting for timed out [HB state abcdef02 offset 4079616 gen 165 stampUS 3315978691840 uuid 4f440815-e7ef42fc-1fae-5404a6a68908 jrnl <FB 112800> drv 14.54] on vol 'Store1'
2012-03-31T06:14:44.685Z cpu0:3547)HBX: 2313: Waiting for timed out [HB state abcdef02 offset 4079616 gen 165 stampUS 3315978691840 uuid 4f440815-e7ef42fc-1fae-5404a6a68908 jrnl <FB 112800> drv 14.54] on vol 'Store1'
2012-03-31T06:14:44.686Z cpu7:3790)HBX: 2313: Waiting for timed out [HB state abcdef02 offset 4079616 gen 165 stampUS 3315978691840 uuid 4f440815-e7ef42fc-1fae-5404a6a68908 jrnl <FB 112800> drv 14.54] on vol 'Store1'
2012-03-31T06:14:44.785Z cpu7:2069)HBX: 231: Reclaimed heartbeat for volume 4f32a059-77730f9f-af26-5404a6a68908 (Store1): [Timeout] [HB state abcdef02 offset 4079616 gen 165 stampUS 3315990813475 uuid 4f440815-e7ef42fc-1fae-5404a6a68908 jrnl <FB 112800>$

While I haven't experienced concrete problems due to this yet, the messages leave me a bit worried. What commands are lost there? Are they repeated after recovery from the "busy state"? Is data that's supposed to get written to the HDD lost? Will VMs get informed about this?

Any insight on this would be greatly appreciated.

Kind regards,

Frank

VTsukanov · ‎04-01-2012

Accoding to Interpreting SCSI sense codes your SCSI code

H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0

means that the device is busy. Self-check controller may cause to such errors.

So I suggest to watch the logs after the finishing of the self-check procedure

Loc2262 · ‎04-01-2012

Hellos and thanks for your reply!

Yeah I quite agree about the "busy" part, since the log even explicitly mentions that, like so:

2012-03-31T06:14:43.886Z cpu2:3547)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1055 Host Busy vmhba1:2:0:0 (driver name: LSI Logic SAS based MegaRAID driver) - Message repeated 799 times

I do not see any "evident errors" that would suggest trouble after the process finished. But I'm not sure as to how to interpret the lines I see there.

I listed the entire sequence of error messages in my first post. I see those failure messages from ScsiDeviceIO, and "wait timed out" messaged from HBX, yet then it says

2012-03-31T06:14:44.685Z cpu0:1872490)<7>megaraid_sas: megasas_wait_for_outstanding: line 1875: AFTER HBA reset handler invoked without an internal reset condition: took 2 seconds. Max is 180.
2012-03-31T06:14:44.685Z cpu0:1872490)megaraid_sas: no more pending commands remain after reset handling.
2012-03-31T06:14:44.685Z cpu0:1872490)<5>megasas: reset successful
2012-03-31T06:14:44.785Z cpu7:2069)HBX: 231: Reclaimed heartbeat for volume 4f32a059-77730f9f-af26-5404a6a68908 (Store1): [Timeout] [HB state abcdef02 offset 4079616 gen 165 stampUS 3315990813475 uuid 4f440815-e7ef42fc-1fae-5404a6a68908 jrnl <FB 112800>$

How do I interpret the "no more pending commands remain"? Have those commands that ran into the "busy" situation been dropped? Or have they been successfully repeated and thus executed after the busy state ended?

I only see this kind of messages (host busy, SCSI command failed, timeout, reset successful, reclaimed heartbeat) during or immediately after the self-test procedure.

So, does this mean the vmkernel notes like so a "hiccup and slight delay, nothing to worry about"? Or rather a "your device became too busy during self-test, and data was lost"?

VTsukanov · ‎04-01-2012

Hi

From our experience, when RAID controller performs self-check, the host has very poor performance, but data are not lost. Nevertheless the backup is very important

> How do I interpret the "no more pending commands remain"?

Command queue is cleared after the reset command, so as far as I think it means that it is the result of execution reset command

I suppose that it more probable that assumption "hiccup and slight delay, nothing to worry about" is true, than the another.

Nevertheless you have to watch the logs

All

ESXi 5 and hardware RAID: SCSI error messages during array self-check