Re: Repeat SDSTAT_GOOD to SDSTAT_BUSY msgs in vmke...

msimms · ‎07-19-2007

Hello Experts!

I've been getting these messages in my vmkernel log file repeated every 5 minutes for a few weeks now, is this a sign of bad things coming? Repeated messages are as follows:

Jul 17 22:04:00 MIGVMCL10 vmkernel: 24:11:20:55.153 cpu0:1024)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Jul 17 22:04:06 MIGVMCL10 vmkernel: 24:11:21:00.393 cpu0:1024)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK

Jul 17 22:04:06 MIGVMCL10 vmkernel: 24:11:21:00.393 cpu0:1024)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Jul 17 22:04:10 MIGVMCL10 vmkernel: 24:11:21:04.625 cpu0:1024)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK

Jul 17 22:04:10 MIGVMCL10 vmkernel: 24:11:21:04.625 cpu0:1024)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Jul 17 22:06:45 MIGVMCL10 vmkernel: 24:11:23:39.398 cpu0:1024)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK

Jul 17 22:06:45 MIGVMCL10 vmkernel: 24:11:23:39.398 cpu0:1024)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Jul 17 22:06:46 MIGVMCL10 vmkernel: 24:11:23:41.010 cpu0:1175)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK

Jul 17 22:06:46 MIGVMCL10 vmkernel: 24:11:23:41.010 cpu0:1175)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Jul 17 22:06:48 MIGVMCL10 vmkernel: 24:11:23:43.227 cpu0:1175)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK

Jul 17 22:06:48 MIGVMCL10 vmkernel: 24:11:23:43.227 cpu0:1175)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Jul 17 22:06:50 MIGVMCL10 vmkernel: 24:11:23:44.537 cpu0:1175)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK

Jul 17 22:06:50 MIGVMCL10 vmkernel: 24:11:23:44.537 cpu0:1175)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Jul 17 22:06:52 MIGVMCL10 vmkernel: 24:11:23:46.652 cpu0:1075)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK

Jul 17 22:06:52 MIGVMCL10 vmkernel: 24:11:23:46.652 cpu0:1075)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Jul 17 22:06:53 MIGVMCL10 vmkernel: 24:11:23:47.962 cpu0:1024)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK

Jul 17 22:06:53 MIGVMCL10 vmkernel: 24:11:23:47.962 cpu0:1024)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Jul 17 22:06:54 MIGVMCL10 vmkernel: 24:11:23:49.172 cpu0:1075)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK

Jul 17 22:06:54 MIGVMCL10 vmkernel: 24:11:23:49.172 cpu0:1075)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Jul 17 22:06:56 MIGVMCL10 vmkernel: 24:11:23:50.582 cpu0:1173)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK

Jul 17 22:06:56 MIGVMCL10 vmkernel: 24:11:23:50.582 cpu0:1173)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Jul 17 22:06:57 MIGVMCL10 vmkernel: 24:11:23:51.892 cpu0:1173)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK

Jul 17 22:06:57 MIGVMCL10 vmkernel: 24:11:23:51.892 cpu0:1173)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Jul 17 22:06:59 MIGVMCL10 vmkernel: 24:11:23:54.109 cpu0:1075)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK

-My setup consists of a 4 node ESX cluster sharing same disks from an HP EVA4000, the above messages are only from 1 of the nodes, the other 3 ESX nodes report no such messages in /var/log/vmkernel.

-All disks presented to ESX boxes are used as RDM disks running in Virtual mode except for 2.

-There are no errors in the Command View logs for the EVA that I can interpret.

Could anybody shed idea as to what could be the cause of such messages in vmkernel log? Any thoughts would be greatly appreciated.

Thanks,

Matt

BUGCHK · ‎07-20-2007

I would check if the Fibre Channel links of the server are stable and I would check the host entry on the EVA whether it has the proper operating system assigned (='VMware' on current XCS firmware).

msimms · ‎07-20-2007

Looks like 'evaperf ps' is listing a couple more 'Discard Frames' on FP1 every day than the last.

Yesterday's evaperf 'ps' lists Discard Frames for FP1 at 18 and 34

Today's evaperf 'ps' lists Discard Frames for FP1 and 20 and 38.

Does 'Discard Frames' in the 'evaperf ps' mean possible physical Fiber Channel line errors?

Thanks,

Matt

Message was edited by:

msimms

BUGCHK · ‎07-24-2007

I have done a little digging, but did not find an explanation of these counters

Have you looked at the Fibre Channel switch counters, yet?

RParker · ‎07-24-2007

How do you look at the Fibre Channel switch counters?

BUGCHK · ‎07-24-2007

Well, that depends on the switch vendor.

For Brocade (this is an old, old modell 2800) I use:

fcswa1:admin> portshow 0

portFlags: 0x20041 PRESENT U_PORT LED

portType: 3.1

portState: 2 Offline

portPhys: 4 No_Light

portScn: 2 Offline

portRegs: 0x80030000

portData: 0x10308950

portId: 050000

portWwn: 20:00:00:60:69:xx:xx:xx

portWwn of the device(s) connected:

None

Distance: normal

Speed: 1Gbps

Interrupts: 102 Link_failure: 4 Frjt: 0

Unknown: 8 Loss_of_sync: 8 Fbsy: 0

Lli: 50 Loss_of_sig: 5

Proc_rqrd: 52 Protocol_err: 0

Timed_out: 0 Invalid_word: 0

Rx_flushed: 0 Invalid_crc: 0

Tx_unavail: 0 Delim_err: 0

Free_buffer: 0 Address_err: 0

Overrun: 0 Lr_in: 6

Suspended: 0 Lr_out: 6

Parity_err: 0 Ols_in: 6

Ols_out: 4

fcswa1:admin> porterrshow

frames enc crc too too bad enc disc link loss loss frjt fbsy

tx rx in err shrt long eof out c3 fail sync sig

\----

0: 49 52 0 0 0 0 0 7 0 4 8 5 0 0

1: 37 39 0 0 0 0 0 7 0 6 6 5 0 0

2: 145 156 0 0 0 0 0 24 0 10 24 13 0 0

3: 156 156 0 0 0 0 0 24 0 12 24 13 0 0

4: 0 0 0 0 0 0 0 0 0 0 0 1 0 0

5: 0 0 0 0 0 0 0 0 0 0 0 1 0 0

6: 0 0 0 0 0 0 0 0 0 0 0 2 0 0

7: 0 0 0 0 0 0 0 0 0 0 0 1 0 0

8: 632k 351k 0 0 0 0 0 23 0 0 26 12 0 0

9: 0 0 0 0 0 0 0 0 0 0 0 1 0 0

10: 0 0 0 0 0 0 0 0 0 0 0 1 0 0

11: 0 0 0 0 0 0 0 0 0 0 0 1 0 0

12: 0 0 0 0 0 0 0 0 0 0 0 1 0 0

13: 0 0 0 0 0 0 0 0 0 0 0 1 0 0

14: 186m 264m 0 0 0 0 0 306k 0 42 327k 124 0 0

15: 271m 129m 1 1 2 0 2 11 0 1 4 4 0 0

fcswa1:admin>

RParker · ‎07-24-2007

We use Netapp, not nearly as easy.

Is there a way to show bandwidth on a fibre channel? I see frame rate, but I wish I had a way to tell of 4 2gig Fibre cards, which ones are at their limit in bandwidth (I doubt we are, but it would be nice to see numbers to support this).

Thanks.

BUGCHK · ‎07-25-2007

Bandwidth is show in this line of the portshow[/b] command:

Speed: 1Gbps

Or do you mean something the throughput[/u] like shown in portperfshow[/b]?

Unfortunately it accumulates RX and TX.

msimms · ‎07-26-2007

Bugchk,

The port 14 on your brocade has a really high reading on your 'loss sync' field... is that something for you to be concerned about?

I changed the HBA mezzanine card for the blade creating the errors in evaperf and but it didn't seem to help, I'm sure the FC lines are running clean because the actual orange HBA cables come out of the blade enclosure is a trunk with all blades communicating through it, so all blades are using the same HBA cable to the EVA, however only one of the blades (node 3) causes the discard frames when traffic is generated 'portshowerr' cmd on the brocade switch doesn't show any abnormally high numbersin the counters between the problem blade and the rest of the blades..... I can't think of anything else to trouble shoot..

Any Thoughts?

Thanks,

Matt

BUGCHK · ‎07-27-2007

concerned about?

Nope, it's a test server with a QLogic adapter. Looks like it is jumpered so that the laser is not under firmware control. In that case the signal is invalid and confuses the Fibre Channel switch. Thanks for the info, though.

Do you use Virtual Connect in the Blade chassis?

msimms · ‎07-29-2007

Nope, no virtual connect. We have a Brocade 4/12 switch in the back of the enclosure.

Chris_S_UK · ‎07-30-2007

Is the server an IBM xSeries server without an RSA card, i.e. with just the BMC? If so, do you have the Director Agents loaded?

i get these messages on all such servers, and have yet to find a fix. In my case, they repeat constantly and fill up /var.....

Chris

MarkBK · ‎10-25-2007

We are having similar problems with a EVA 5000 connecting through a Brocade 4/12. I found a KB article (link below) that suggests that the Brocade firmware needs to be 5.1.0b or higher. What firmware are you using? We also are using p-class blades.

msimms · ‎11-28-2007

We are using 5.0.4 firmware on a Brocade 4/24 switch on a c-class enclosure. (I made a mistake, brocade wasn't 4/12). I've tried swapping out hardware with no success. I will try the firmware update to 5.1.0b

All

Repeat SDSTAT_GOOD to SDSTAT_BUSY msgs in vmkernel log