Hello Experts!
I've been getting these messages in my vmkernel log file repeated every 5 minutes for a few weeks now, is this a sign of bad things coming? Repeated messages are as follows:
Jul 17 22:04:00 MIGVMCL10 vmkernel: 24:11:20:55.153 cpu0:1024)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY
Jul 17 22:04:06 MIGVMCL10 vmkernel: 24:11:21:00.393 cpu0:1024)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK
Jul 17 22:04:06 MIGVMCL10 vmkernel: 24:11:21:00.393 cpu0:1024)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY
Jul 17 22:04:10 MIGVMCL10 vmkernel: 24:11:21:04.625 cpu0:1024)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK
Jul 17 22:04:10 MIGVMCL10 vmkernel: 24:11:21:04.625 cpu0:1024)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY
Jul 17 22:06:45 MIGVMCL10 vmkernel: 24:11:23:39.398 cpu0:1024)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK
Jul 17 22:06:45 MIGVMCL10 vmkernel: 24:11:23:39.398 cpu0:1024)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY
Jul 17 22:06:46 MIGVMCL10 vmkernel: 24:11:23:41.010 cpu0:1175)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK
Jul 17 22:06:46 MIGVMCL10 vmkernel: 24:11:23:41.010 cpu0:1175)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY
Jul 17 22:06:48 MIGVMCL10 vmkernel: 24:11:23:43.227 cpu0:1175)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK
Jul 17 22:06:48 MIGVMCL10 vmkernel: 24:11:23:43.227 cpu0:1175)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY
Jul 17 22:06:50 MIGVMCL10 vmkernel: 24:11:23:44.537 cpu0:1175)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK
Jul 17 22:06:50 MIGVMCL10 vmkernel: 24:11:23:44.537 cpu0:1175)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY
Jul 17 22:06:52 MIGVMCL10 vmkernel: 24:11:23:46.652 cpu0:1075)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK
Jul 17 22:06:52 MIGVMCL10 vmkernel: 24:11:23:46.652 cpu0:1075)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY
Jul 17 22:06:53 MIGVMCL10 vmkernel: 24:11:23:47.962 cpu0:1024)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK
Jul 17 22:06:53 MIGVMCL10 vmkernel: 24:11:23:47.962 cpu0:1024)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY
Jul 17 22:06:54 MIGVMCL10 vmkernel: 24:11:23:49.172 cpu0:1075)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK
Jul 17 22:06:54 MIGVMCL10 vmkernel: 24:11:23:49.172 cpu0:1075)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY
Jul 17 22:06:56 MIGVMCL10 vmkernel: 24:11:23:50.582 cpu0:1173)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK
Jul 17 22:06:56 MIGVMCL10 vmkernel: 24:11:23:50.582 cpu0:1173)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY
Jul 17 22:06:57 MIGVMCL10 vmkernel: 24:11:23:51.892 cpu0:1173)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK
Jul 17 22:06:57 MIGVMCL10 vmkernel: 24:11:23:51.892 cpu0:1173)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY
Jul 17 22:06:59 MIGVMCL10 vmkernel: 24:11:23:54.109 cpu0:1075)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK
-My setup consists of a 4 node ESX cluster sharing same disks from an HP EVA4000, the above messages are only from 1 of the nodes, the other 3 ESX nodes report no such messages in /var/log/vmkernel.
-All disks presented to ESX boxes are used as RDM disks running in Virtual mode except for 2.
-There are no errors in the Command View logs for the EVA that I can interpret.
Could anybody shed idea as to what could be the cause of such messages in vmkernel log? Any thoughts would be greatly appreciated.
Thanks,
Matt
I would check if the Fibre Channel links of the server are stable and I would check the host entry on the EVA whether it has the proper operating system assigned (='VMware' on current XCS firmware).
Looks like 'evaperf ps' is listing a couple more 'Discard Frames' on FP1 every day than the last.
Yesterday's evaperf 'ps' lists Discard Frames for FP1 at 18 and 34
Today's evaperf 'ps' lists Discard Frames for FP1 and 20 and 38.
Does 'Discard Frames' in the 'evaperf ps' mean possible physical Fiber Channel line errors?
Thanks,
Matt
Message was edited by:
msimms
I have done a little digging, but did not find an explanation of these counters
Have you looked at the Fibre Channel switch counters, yet?
How do you look at the Fibre Channel switch counters?
Well, that depends on the switch vendor.
For Brocade (this is an old, old modell 2800) I use:
portFlags: 0x20041 PRESENT U_PORT LED
portType: 3.1
portState: 2 Offline
portPhys: 4 No_Light
portScn: 2 Offline
portRegs: 0x80030000
portData: 0x10308950
portId: 050000
portWwn: 20:00:00:60:69:xx:xx:xx
portWwn of the device(s) connected:
None
Distance: normal
Speed: 1Gbps
Interrupts: 102 Link_failure: 4 Frjt: 0
Unknown: 8 Loss_of_sync: 8 Fbsy: 0
Lli: 50 Loss_of_sig: 5
Proc_rqrd: 52 Protocol_err: 0
Timed_out: 0 Invalid_word: 0
Rx_flushed: 0 Invalid_crc: 0
Tx_unavail: 0 Delim_err: 0
Free_buffer: 0 Address_err: 0
Overrun: 0 Lr_in: 6
Suspended: 0 Lr_out: 6
Parity_err: 0 Ols_in: 6
Ols_out: 4
fcswa1:admin> porterrshow
frames enc crc too too bad enc disc link loss loss frjt fbsy
tx rx in err shrt long eof out c3 fail sync sig
\----
0: 49 52 0 0 0 0 0 7 0 4 8 5 0 0
1: 37 39 0 0 0 0 0 7 0 6 6 5 0 0
2: 145 156 0 0 0 0 0 24 0 10 24 13 0 0
3: 156 156 0 0 0 0 0 24 0 12 24 13 0 0
4: 0 0 0 0 0 0 0 0 0 0 0 1 0 0
5: 0 0 0 0 0 0 0 0 0 0 0 1 0 0
6: 0 0 0 0 0 0 0 0 0 0 0 2 0 0
7: 0 0 0 0 0 0 0 0 0 0 0 1 0 0
8: 632k 351k 0 0 0 0 0 23 0 0 26 12 0 0
9: 0 0 0 0 0 0 0 0 0 0 0 1 0 0
10: 0 0 0 0 0 0 0 0 0 0 0 1 0 0
11: 0 0 0 0 0 0 0 0 0 0 0 1 0 0
12: 0 0 0 0 0 0 0 0 0 0 0 1 0 0
13: 0 0 0 0 0 0 0 0 0 0 0 1 0 0
14: 186m 264m 0 0 0 0 0 306k 0 42 327k 124 0 0
15: 271m 129m 1 1 2 0 2 11 0 1 4 4 0 0
fcswa1:admin>
We use Netapp, not nearly as easy.
Is there a way to show bandwidth on a fibre channel? I see frame rate, but I wish I had a way to tell of 4 2gig Fibre cards, which ones are at their limit in bandwidth (I doubt we are, but it would be nice to see numbers to support this).
Thanks.
Bugchk,
The port 14 on your brocade has a really high reading on your 'loss sync' field... is that something for you to be concerned about?
I changed the HBA mezzanine card for the blade creating the errors in evaperf and but it didn't seem to help, I'm sure the FC lines are running clean because the actual orange HBA cables come out of the blade enclosure is a trunk with all blades communicating through it, so all blades are using the same HBA cable to the EVA, however only one of the blades (node 3) causes the discard frames when traffic is generated 'portshowerr' cmd on the brocade switch doesn't show any abnormally high numbersin the counters between the problem blade and the rest of the blades..... I can't think of anything else to trouble shoot..
Any Thoughts?
Thanks,
Matt
concerned about?
Nope, it's a test server with a QLogic adapter. Looks like it is jumpered so that the laser is not under firmware control. In that case the signal is invalid and confuses the Fibre Channel switch. Thanks for the info, though.
Do you use Virtual Connect in the Blade chassis?
Nope, no virtual connect. We have a Brocade 4/12 switch in the back of the enclosure.
Is the server an IBM xSeries server without an RSA card, i.e. with just the BMC? If so, do you have the Director Agents loaded?
i get these messages on all such servers, and have yet to find a fix. In my case, they repeat constantly and fill up /var.....
Chris
We are using 5.0.4 firmware on a Brocade 4/24 switch on a c-class enclosure. (I made a mistake, brocade wasn't 4/12). I've tried swapping out hardware with no success. I will try the firmware update to 5.1.0b