VMware Cloud Community
DrNickT
Contributor
Contributor

After upgrading to Cisco UCS 2.2(1d) my VMWare VMs are having LSI errors in the Event Viewer

I upgraded my virtual environments to Cisco UCS Release 2.2(1d) over the past couple weeks.  We have had a few VM's freeze, and one error I have noticed in all the event viewers is

LSI_SAS

Reset to device, \Device\RaidPort0, was issued.

Some have it more than others.  I went back in history and this only occurred in each environment after upgrading to UCS 2.2  Is this a known issue?  I am running ESXi 5.1 Update 2 on my hosts.  each host is a B200 M3

Thanks

Tags (2)
32 Replies
JPM300
Commander
Commander

Hey DrNickT

Since you can correlate the time of the errors to after the 2.21d firmware upgrade I would open a Cisco TAC Case.  It could be a combination of 2.21d firmware and some of the other firmware versions of the blades.  I have opened up other Vmware cases with Cisco TAC in regards to some weird CiscO UCS issues and they have been very helpful.

More then likely the 2,21d firmware for the onboard LSI raid isn't sending notifications up properly to Vmware.  When you go into the hardware status screen in the VI client does it have a warning there as well.

0 Kudos
DrNickT
Contributor
Contributor

Thanks!  I just opened a case with them. 

0 Kudos
dhanarajramesh

Dear Friend, may i know How did you upgrade the firmware and what are steps , you were followed? I had the same issue So what I did that De-associiated the service profile and re-associated again. now it is functioning as normal.

0 Kudos
DrNickT
Contributor
Contributor

really?  exactly the same error?  I created a new firmware package policy, associated it with my service templates, then restarted each host.

0 Kudos
dhanarajramesh

most of the time, when ever I do upgrade on the service profile, the hardware is not fetching all the info from service profile. So I  always DE-associate and re-associate the service profile. lets try for one host if you have time and check it.  mostly you will get the same message from CISCO. I have experienced the answer. please check the HCL as well

0 Kudos
joeboyd
Contributor
Contributor

I'm having the same exact issue.  I'm also at version 2.2.1d. 

Did the upgrade of the firmware correct the problem or did you re-associate the blades with the service profiles?

Do you have RDM's attached to your VM's?   I do and that seems to be where most of my issues are coming from.

What kind of storage are you using?

Thanks


Joe

0 Kudos
DrNickT
Contributor
Contributor

I just tried re-associating a blade to see if that makes a difference.  We may be having another fabric issue.  So once that issue is resolved, I want to see if the errors go down.

I don't have any RDM's.  It's all EMC VNX storage. 

0 Kudos
joeboyd
Contributor
Contributor

thanks for the information.  I am using Clariion Cx4, plan on upgrading to vnx later this year, so I was hoping worse case scenario it will get corrected then, but doesn't seem like it if you have vnx.

Did you look at the DAVG's in esxtop?  when this problem happens ours go through the roof.

Also f you look in the vmkernel logs do you see a lot of fnic aborts?

I have a case open with vmware, cisco, and emc.  if I don't get a solution soon, I'm going to get our VAR in here to look at it.

If I come up with anything, I'll be sure to let you know.  I've been dealing with this for over a month now, and your post was the first I saw that looked anything like my problem, let alone look exactly like it.  So I know how frustrating this problem is.

JPM300
Commander
Commander

Hey Joeboyd,

I had a simular issue with a set of clarrions / VNX where the DVAG was excessivly high on some LUNS at times, like 300-2K at times.  We never really did get to the bottom of it but it did lead to a crash when one of our SAN switches went down during a bad firmware upgrade.  Due to the high DAVG when the LUN tried the other path selection it just timed out.   If you do get to the bottom of it I would be curious to know.

DrNickT
Contributor
Contributor

for DAVG, what kinda numbers do you see that are normal?  and through the roof?

0 Kudos
DrNickT
Contributor
Contributor

has anyone downgraded their UCS firmware to see if these errors go away?

0 Kudos
joeboyd
Contributor
Contributor

The DAVG, from what I understand, is considered normal under 20-25.  Generally I see anywhere from 0.x to about 20. Then when this problem is occurring it goes into the 10's of thousands for that particular LUN for 10-20 seconds.  I have been able to reproduce this problem by copying a large (3gb) file to RDM's attached to some VM's.  I can see the file copy freeze, the DAVG's skyrocket, the VM become unresponsive.  Once the VM becomes responsive, the file copy finishes and then I see the LSA_SAS Device Reset event show up.

In your array are your hosts set for Active/active ALUA or active/passive?  Are you running Powerpath VE on your hosts?  or are you using NMP set to round robin?

I ask this, because I found something very interesting today.  From one host I uninstalled PowerPath and ran my test using NMP Round Robin and saw the same exact results.  In storage adapters, on that particular LUN I changed it to Fixed Path, and ran the test on each vmhba (setting each to preferred) one at a time, performing the same file copy several times.  And not once did this problem occur!  It seems that it is a multipathing problem.  On my array my hosts are set to Active/Active, so I'm not sure yet if I change that to Active Passive and run it in Round Robin if it will work properly.  I'm testing that next.  I talked to the VMware engineer that I have been working this problem with, and he said it may be a problem on the array side.  I also need to check compatibility for vsphere 5.1 update 2, Active/Active Alua, and UCS 2.2(1d) to make sure it should all work together.  Maybe upgrading to vsphere 5.5 will fix it?  that may be my next step, before considering downgrading UCS firmware.  I haven't had my environment running on anything other than 5.1U2 and 2.2(1d), as I migrated from older Dell physical hosts running vsphere 4.1directly to the B200m3 hosts on 5.1u2.

0 Kudos
DrNickT
Contributor
Contributor

Right now one of my hosts shows 0.32 and 0.37 for the two HBA's on the DAVG/cmd

We use PowerPath/VE 5.9 on all our hosts.

The VMware engineer i talked to said the same thing.  So we are digging into the array side more.

0 Kudos
joeboyd
Contributor
Contributor

http://www.vmware.com/resources/compatibility/detail.php?deviceCategory=san&productid=4427&deviceCat...

Based on what the above link shows, PowerPath ve 5.7 is supported for vsphere 5.1 u2.  I've read 5.9 does too but I may try downgrading powerpath to 5.7 to see what happens

0 Kudos
joeboyd
Contributor
Contributor

I just realized that's for CX4, which I have and you are on VNX so it may not apply.

0 Kudos
DrNickT
Contributor
Contributor

In your vmkernel log, are you seeing FC aborts?  like

vmkernel: cpu22:8378)<7>fnic : 1 :: Abort Cmd called FCID 0x7909ef, LUN 0x6 TAG a0 flags 3

0 Kudos
joeboyd
Contributor
Contributor

Yes, I get those messages.

What are you using for fiber switches?

What version of UCS firmware were you at before you upgraded to 2.2.(1d)?

I am using two nexus 5548's

I'm at the point where I almost think there is come compatibility issue with vsphere 5.1 update 2 and UCS Firmware 2.2.(1d).  I may try upgrading both in the next few weeks, first vmware to 5.5, and then the UCS firmware if the vmware upgrade doesn't change anything, even though I'm not sure the UCS upgrade will do anything.

0 Kudos
DrNickT
Contributor
Contributor

Cisco MDS 9513's

2.1(2a) I think.

I just upgraded all the enic/fnic drivers and the problem still exists.  If UCS is the issue, it is also affecting other hosts attached to the fabric. 

I plan on upgrading to 5.5 shortly also.  Let me know how the upgrade goes.

0 Kudos
DrNickT
Contributor
Contributor

Hey joeboyd,

On your scsideviceio errors, what hex errors are you getting on yours?  The two I see mostly are

ScsiDeviceIO: 2331: Cmd(0x4124003a6a80) 0x1a, CmdSN 0x362 from world 9497 to dev "naa.610f311e19f500001978dac706a94406" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0

ScsiDeviceIO: 2331: Cmd(0x4124003a6a80) 0x4d, CmdSN 0x361 from world 9497 to dev "naa.610f311e19f500001978dac706a94406" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0


http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=103038...

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=289902


So if I am reading these documents right, it should mean something like:


H:0x0 D:0x2 P:0x0


Host - 0x0 GOOD

Device - 0x2 CHECK CONDITION

Plugin - 0x0 GOOD

Sense Key - 0x5 ILLEGAL REQUEST


The part that confuses me is the additional sense data.  What does 0x24 0x0 and 0x20 0x0 translate to?


All my errors seem to be 0x24 and 0x20.  Just curious to what you are seeing on yours.


0 Kudos