shankarsingh
Enthusiast
Enthusiast

hosts are not responding state/frozen state after upgrade from 5.5U3 to 6.5U2

Jump to solution

We recently upgraded esxi 5.5 U3 to esx 6.5 U2 with cisco customized image  on C240-M4S Server. We first upgrade cisco firmware from 2.0(6) to 4.0(1c) and then esxi host upgrade from 5.5 u3 to 6.5 U2.(Please find the attached text to know Driver and FW details before and after upgrade)

After upgrade, hosts are going not responding state/frozen state where in esxi hosts are reachable via PING over network, but unable to re-connect host back to vCenter.

During host not responding state ,we can login into putty with multiple session ,however we can’t see/run any commands (like, if df- h, to view logs under cat /var/log ) .When we ran df-h, hosts won’t display anything, gets struck until we close putty session and then can re-connect .

During host not responding state, vms continue to be running, but we can’t migrate those vms into another host and also we are unable to manage those vms via vCloud panel .

We have to reboot host to bring back host and then will connect to vcenter .

We working with Vmware and Cisco since from 3 weeks ,no resolution yet .

We can see lot of Valid sense data: 0x5 0x24 0x0 logs in vmkernel.logs and VMware suspect something with the LSI MegaRAID (MRAID12G) diver. So Vmware asked to contact hardware vendor to check hardware/firmware issues and LSI issues as well

2019-02-18T19:51:27.802Z cpu20:66473)ScsiDeviceIO: 3001: Cmd(0x439d48ebd740) 0x1a, CmdSN 0xea46b from world 0 to dev "naa.678da6e715bb0c801e8e3fab80a35506" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0
This command failed  4234 times on "naa.678da6e715bb0c801e8e3fab80a35506"

Display Name: Local Cisco Disk (naa.678da6e715bb0c801e8e3fab80a35506)
Vendor: Cisco | Model: UCSC-MRAID12G | Is Local: true | Is SSD: false

Cisco did not see any issues with Server /hardware after analyzing Tech support logs and also we performed Cisco diagnostics test on few servers,all components tests/ checks looks good .Only one recommendation given by cisco is to change Power Management policy from balance to High Performance under esxi host->configure->Hardware->Power Mgmt->Active policy ->High Performance

Can someone help me to find cause/fix .

  Thanks in advance

all components tests/ checks looks good 

0 Kudos
1 Solution

Accepted Solutions
Madmax01
Expert
Expert

once you switch back to megaraid-sas then need to disable the nativ ones. just in Case i paste:

esxcli system module set --enabled=true --module=megaraid_sas

esxcli system module load --module=megaraid_sas -f

esxcli system module set --enabled=false --module=lsi_mr3

after reboot you could check then with esxcfg-scsidevs -a

if you have any Intel controller and not in use (sata/sas) > good you disable inside the bios.  Just to avoid kinda Interrupts.

may helps.

finger crossed Smiley Wink


best regards

Max

View solution in original post

0 Kudos
23 Replies
sk84
Expert
Expert

After upgrade, hosts are going not responding state/frozen state where in esxi hosts are reachable via PING over network, but unable to re-connect host back to vCenter.

What's the version of vCenter?

We can see lot of Valid sense data: 0x5 0x24 0x0 logs in vmkernel.logs and VMware suspect something with the LSI MegaRAID (MRAID12G) diver. So Vmware asked to contact hardware vendor to check hardware/firmware issues and LSI issues as well

What's the exact product name of the LSI controller and the Cisco PID for it?

//EDIT

In addition you can grep through the vmkernel.log looking for lsi_mr3 and megaraid_sas driver issues:

grep 'lsi_mr3\|megaraid_sas' /var/log/vmkernel.log

Especially interesting are error events like these:

megaraid_sas: Event : Controller encountered a fatal error and was reset
megaraid_sas: Reset successful.

--- Regards, Sebastian VCP6.5-DCV // VCP7-CMA // vSAN 2017 Specialist Please mark this answer as 'helpful' or 'correct' if you think your question has been answered correctly.
0 Kudos
shankarsingh
Enthusiast
Enthusiast

Thanks for your reply .

Please find below details

Vcenter  Build number is 9451637  and version vCenter Server 6.5 U2c

ESXi Build Number is 10884925 and version is ESXi 6.5 P03

Product Name : Cisco 12 G SAS Modular RAID Controller 

Product ID: LSI Logic

Product PID: UCSCMRAID12G

Thanks

0 Kudos
shankarsingh
Enthusiast
Enthusiast

Hi,

I did  grep 'lsi_mr3\|megaraid_sas' /var/log/vmkernel.log and did not find any logs related to lsi_mr3.

Only numerous logs can see about cpu3:66472)ScsiDeviceIO: 3001: Cmd(0x439552d82a40) 0x1a, CmdSN 0x4239d from world 0 to dev "naa" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

Does any experience issues or help me to identify the issue .

Thanks

0 Kudos
sk84
Expert
Expert

Cmd(...) 0x1a ... from world 0 to dev "naa" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

According to the scsi sense codes in this error message, I would assume that there is a problem between driver and controller firmware.

Sense data 0x5 0x24 0x0 means it is an illegal request (0x5) due to an invalid field in the scsi command descriptor block.

See: https://www.virten.net/vmware/esxi-scsi-sense-code-decoder/?host=0&device=2&plugin=0&sensekey=5&asc=...

And this SCSI communication takes place between controller driver and controller firmware, both provided by Cisco (the controller driver is listed as async in the VMware HCL).

Based on this information, I therefore share VMware's view that Cisco needs to investigate and resolve this issue more closely.

And from my personal experience I can say that we also had some problems with storage components in Cisco UCS hardware in combination with ESXi and/or vSAN in the past. And I know the game from the Cisco TAC that they say there is no problem with the hardware. We unfortunately had to escalate these problems to a different Cisco BU each time with our Cisco Account Manager, so the UCS engineers could investigate it further.  And each time it was a problem with the firmware, drivers or chipset of the controllers (e.g. a faulty production tranche at LSI).

--- Regards, Sebastian VCP6.5-DCV // VCP7-CMA // vSAN 2017 Specialist Please mark this answer as 'helpful' or 'correct' if you think your question has been answered correctly.
0 Kudos
shankarsingh
Enthusiast
Enthusiast

Thanks for info.

FYI: The lsi-mr3 (LSI Raid Controller) running with  7.703.19.00-1OEM.650.0.0.4598673 driver version and FW 24.12.1-0411 which are supported by both Vmware and cisco .

Thanks

0 Kudos
sk84
Expert
Expert

This driver is displayed as async in the HCL. So it is not maintained by VMware, but by the manufacturer.

cisco-raid-controller-vmware-hcl.png

--- Regards, Sebastian VCP6.5-DCV // VCP7-CMA // vSAN 2017 Specialist Please mark this answer as 'helpful' or 'correct' if you think your question has been answered correctly.
0 Kudos
SureshKumarMuth
Commander
Commander

Are you using local storage or SAN ?

Regards, Suresh https://vconnectit.wordpress.com/
0 Kudos
shankarsingh
Enthusiast
Enthusiast

ESXi installed on local disk drive and OS load from local disk

0 Kudos
SureshKumarMuth
Commander
Commander

What about virtual machines ?

Regards, Suresh https://vconnectit.wordpress.com/
0 Kudos
shankarsingh
Enthusiast
Enthusiast

Since we upgraded ESXi host using Cisco customized image, so drivers gets installed from Cisco OEM.

If it is not maintained by VMware, is there any possibility of issue

0 Kudos
sk84
Expert
Expert

If it is not maintained by VMware, is there any possibility of issue

The driver, firmware and controller hardware is listed on the VMware HCL. So it is supported by VMware.

However, VMware does not test this combination of hardware and software itself. They only issue test specifications to the vendors, the vendors test according to these specifications and send the results back to VMware, who only check the results against the test reports. If everything is fine, VMware complements the HCL.

Since the ESXi driver is maintained and provided directly by Cisco, and the controller hardware and firmware comes from Cisco, and Cisco is also responsible for testing, the ball is clearly in Cisco's court when SCSI communication between these components is problematic.

--- Regards, Sebastian VCP6.5-DCV // VCP7-CMA // vSAN 2017 Specialist Please mark this answer as 'helpful' or 'correct' if you think your question has been answered correctly.
0 Kudos
shankarsingh
Enthusiast
Enthusiast

VM's are on shared Storage(FC Storage)

0 Kudos
SureshKumarMuth
Commander
Commander

When only ESXi resides on local datastore, the usage of LSI controller to perform RW operation on local storage is minimal when the OS is loaded in the memory already, only for any changes at the ESXi config level or if your scratch partition is configured in local storage, the RW operation may happen.

2019-02-18T19:51:27.802Z cpu20:66473)ScsiDeviceIO: 3001: Cmd(0x439d48ebd740) 0x1a, CmdSN 0xea46b from world 0 to dev "naa.678da6e715bb0c801e8e3fab80a35506" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0

Generally, the sense code related to 0x1a (mode sense)  0x1a, CmdSN 0xea46b from world 0 to dev "naa.678da6e715bb0c801e8e3fab80a35506" can be ignored as they are just to check the status, here your hostd agent is freezing which is causing the issue, if this is something related to storage you should see read (0x28), write (0x2a) execution failures,

Hence, I think something else is causing the host to freeze other then LSI firmware/driver

1. In your case the VM are working fine without any issues, so this indicates that your HBA/iSCSI network is fine and no driver or firmware issue at the HBA/NIC level.

2. Since VMs are working fine only the host is unmanageable via VC, ESXi kernel works fine without any issues, only the hostd agent is impacted.

3. hostd should be up and running fine for managing ESXi host, if it goes down, you will face issues like host disconnected from VC, unable to execute commands in ssh etc

You have to check the vmkernel logs for the string "Hostd went to non-responsive state", and analyze the logs what was done at host level during that time which made the hostd to go to unresponsive state.

Since, you have a vmware support contract. When you face the issue next time , you have to create a hostd live dump and send the dump to VMware for dump analysis if you are not getting enough information on logs to find the actual cause. Also ensure, your hardware is supported for 6.x version ( I hope VMware and Cisco already would have checked the support matrix and confirmed if the ESXi 6.5 is supported to run on that hardware )

When your host freezes next time, contact VMware immediately to generate dump and ask them to analyze .

Regards, Suresh https://vconnectit.wordpress.com/
0 Kudos
shankarsingh
Enthusiast
Enthusiast

Thanks for info

We already uploaded vcenter and esxi host logs (After reboot of esxi),Vmware says that host goes un-responsive due to too many below ScsiDeviceIO Error Count on Local disk drive ,so vmware suspecting either driver/firmware issue and  asked to check with Hardware vendor

These commands are appearing in the logs large number of time

1904 times on "naa.abcd"

3448 times "naa.abcd"

2628 times on "naa.abcd"

1054 times on"naa.abcd"

1399 times on "naa.abcd"

4234 times on "naa.abcd"

Vm founded bug in Cisco (https://quickview.cloudapps.cisco.com/quickview/bug/CSCuw38385) which indicates that if is happening large number of times this might cause unresponsive hosts with the following conditions,but Cisco says that current model is not impacted with any bugs and given Bug link has been terminated

Server: C240-M4SX or C240-M4S.

OS: ESXI 5.5 and 6.0.

RAID Controller: Cisco 12G SAS Modular Raid Controller

0 Kudos
SureshKumarMuth
Commander
Commander

The initial suspicion may be related to the driver, if Cisco confirms that this is not an issue at driver/firmware , then we have to prove them with the logs. I feel the next step at ESXi level is to debug the hostd log to confirm what causes the hostd agent to go to unresponsive state. If they find that the issue is due to LSI then you can ask Cisco to recheck if this is a new bug in 6.5 driver/firmware.

Regards, Suresh https://vconnectit.wordpress.com/
0 Kudos
SureshKumarMuth
Commander
Commander

sorry , I meant hostd dump analysis not hostd logs.

Regards, Suresh https://vconnectit.wordpress.com/
0 Kudos
shankarsingh
Enthusiast
Enthusiast

We have been still working with Vendor Cisco and Vmware, still no resolution /cause of the issue .

However we did some  below changes to fix the issue as Local disk drive causing the issue to hosts to be non-responsive state ,As VMware suspecting that There is an issue either with the boot disk or with the lsi_mr3 driver| firmware, but Cisco did not find any know issue with HDD or Firmware version .So  we performed below to identify the issue

We configured scratch partition  to store logs on dedicated LUN for further analysis .

We upgraded HDD firmware of Seagate  from 0003/0004 to A005 firmware version (ESXi hosts are cisco C240-M4s )

Even after upgrade of HDD Firmware, still having issues, then we ordered new disks of Seagate and replaced all old disks to new Seagate HDD with FW N0B1.

But still we are facing issues with host not responding state .

During host not responding state ,I tried to aa=analysis logs from scratch partition ,I found that local disk drive I/O error

vmkwarning

2019-03-22T23:43:00.828Z cpu13:68580)WARNING: Partition: 1158: Partition table read from device naa.678da6e715bb0c801e8e3fab80a35506 failed: I/O error

2019-03-22T23:44:20.833Z cpu1:68583)WARNING: Partition: 1158: Partition table read from device naa.678da6e715bb0c801e8e3fab80a35506 failed: I/O error

2019-03-22T23:45:01.405Z cpu34:2412446)ALERT: hostd detected to be non-responsive

  1. Vmkernel.log

2019-03-23T04:19:19.576Z cpu0:68539)WARNING: Partition: 1158: Partition table read from device naa.678da6e715bb0c801e8e3fab80a35506 failed: I/O error

2019-03-23T04:19:19.576Z cpu11:66468)NMP: nmp_ThrottleLogForDevice:3562: last error status from device naa.678da6e715bb0c801e8e3fab80a35506 repeated 456 times

2019-03-23T04:19:19.576Z cpu11:66468)NMP: nmp_ThrottleLogForDevice:3616: Cmd 0x28 (0x43955db34080, 67662) to dev "naa.678da6e715bb0c801e8e3fab80a35506" on path "vmhba4:C2:T0:L0" Failed: H:0x0 D:0x8 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:NONE

2019-03-23T04:19:19.576Z cpu11:66468)ScsiDeviceIO: 2980: Cmd(0x43955db34080) 0x28, CmdSN 0x1 from world 67662 to dev "naa.678da6e715bb0c801e8e3fab80a35506" failed H:0x0 D:0x8 P:0x0 Invalid sense data: 0x61 0x74 0x68.

2019-03-23T04:19:19.715Z cpu0:66468)NMP: nmp_ThrottleLogForDevice:3545: last error status from device naa.678da6e715bb0c801e8e3fab80a35506 repeated 10 times

Please find the attached logs for more info and please help if you find any clue /resolution

0 Kudos
Madmax01
Expert
Expert

i'am not sure if it's also same Issue on 6.5

me at 6.0.   and had lots Issues  with lsi-mr3 Driver.  Either   sfcbd-watchdog   stucked and was not killable.   or got buffer errors once started a bigger punch of vms after a Host Crash.Or System just stucked after a while.

i have also LSI controller. and for me the only stable Driver was megaraid-sas.  i had to disable  lsi-mr3 and to enable megaraid legacy.

Since then all running fine.


May worth to check it

0x2  = STATUS CODE:CHECK CONDITION

0x5 = SENSE CODE:ILLEGAL REQUEST

0x24 = ADDITIONAL SENSE CODE:INVALID FIELD IN CDB

0x1a = COMMAND CODE:MODE SENSE(6)

if i understand this - the commands which getting tried > the controller doesn't understand them

Best regards

Max

0 Kudos
shankarsingh
Enthusiast
Enthusiast

Thanks for info.

We have Avago (LSI / Symbios Logic) MegaRAID SAS Invader Controller ,installed with  7.703.19.00-1OEM driver and FW is 24.12.1-0433.

So below are drivers details ,so I can atleast try to change driver to 6.610.16.00-1OEM.600.0.0.2494585 to see how host responses .

May I know which version of megaraid driver is enabled to have stable version  ,so I can take look, check compatibility and can try to change .And please can you let me know how to disable current driver and enable legacy mega raid driver please if possible

[root:~] esxcli software vib list | grep mega

scsi-megaraid-sas 6.610.16.00-1OEM.600.0.0.2494585 Avago VMwareCertified   2019-01-25

scsi-megaraid-mbox             2.20.5.1-6vmw.650.0.0.4564106          VMW                 VMwareCertified   2019-01-25

scsi-megaraid2 2.00.4-9vmw.650.0.0.4564106 VMW VMwareCertified   2019-01-25

lsu-lsi-megaraid-sas-plugin    1.0.0-8vmw.650.1.26.5969303            VMware              VMwareCertified   2019-01-25

[root:~] esxcli software vib list | grep lsi

lsi-mr3 7.703.19.00-1OEM.650.0.0.4598673 Avago VMwareCertified   2019-01-25

lsi-msgpt35 04.00.03.00-1OEM.650.0.0.4598673 Avago VMwareCertified   2019-01-25

lsi-msgpt2                     20.00.01.00-4vmw.650.2.50.8294253      VMW                 VMwareCertified   2019-01-25

lsi-msgpt3 16.00.01.00-1vmw.650.2.50.8294253 VMW VMwareCertified   2019-01-25

lsu-lsi-lsi-mr3-plugin         1.0.0-11vmw.650.2.75.10884925          VMware              VMwareCertified   2019-01-25

lsu-lsi-lsi-msgpt3-plugin      1.0.0-7vmw.650.1.26.5969303            VMware              VMwareCertified   2019-01-25

lsu-lsi-megaraid-sas-plugin    1.0.0-8vmw.650.1.26.5969303            VMware              VMwareCertified   2019-01-25

lsu-lsi-mpt2sas-plugin         2.0.0-6vmw.650.1.26.5969303            VMware              VMwareCertified   2019-01-25

vmhba4  lsi_mr3           link-n/a  sas.578da6e715bb0c80                    (0000:03:00.0) Avago (LSI / Symbios Logic) MegaRAID SAS Invader Controller

Key Value Instance: lsi_mr3-578da6e715bb0c80/LSI Incorporation

Listing keys:

Name: MR-DriverVersion

Type:   string

value:  7.703.19.00

Name:   MR-HBAModel

Type:   string

value:  Avago (LSI) HBA 1000:5d:1137:db

Name:   MR-FWVersion

Type:   string

value:  Fw Rev. 24.12.1-0433

Name: MR-ChipRevision

Type:   string

value:  Chip Rev. C0

Name:   MR-CtrlStatus

Type:   string

value:  FwState c0000000

Key Value Instance: MOD_PARM/qlogic

Listing keys:

Name:   DRIVERINFO

Type:   string

value:

Driver version 2.1.53.0

0 Kudos