VMware Cloud Community
shankarsingh
Enthusiast
Enthusiast
Jump to solution

hosts are not responding state/frozen state after upgrade from 5.5U3 to 6.5U2

We recently upgraded esxi 5.5 U3 to esx 6.5 U2 with cisco customized image  on C240-M4S Server. We first upgrade cisco firmware from 2.0(6) to 4.0(1c) and then esxi host upgrade from 5.5 u3 to 6.5 U2.(Please find the attached text to know Driver and FW details before and after upgrade)

After upgrade, hosts are going not responding state/frozen state where in esxi hosts are reachable via PING over network, but unable to re-connect host back to vCenter.

During host not responding state ,we can login into putty with multiple session ,however we can’t see/run any commands (like, if df- h, to view logs under cat /var/log ) .When we ran df-h, hosts won’t display anything, gets struck until we close putty session and then can re-connect .

During host not responding state, vms continue to be running, but we can’t migrate those vms into another host and also we are unable to manage those vms via vCloud panel .

We have to reboot host to bring back host and then will connect to vcenter .

We working with Vmware and Cisco since from 3 weeks ,no resolution yet .

We can see lot of Valid sense data: 0x5 0x24 0x0 logs in vmkernel.logs and VMware suspect something with the LSI MegaRAID (MRAID12G) diver. So Vmware asked to contact hardware vendor to check hardware/firmware issues and LSI issues as well

2019-02-18T19:51:27.802Z cpu20:66473)ScsiDeviceIO: 3001: Cmd(0x439d48ebd740) 0x1a, CmdSN 0xea46b from world 0 to dev "naa.678da6e715bb0c801e8e3fab80a35506" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0
This command failed  4234 times on "naa.678da6e715bb0c801e8e3fab80a35506"

Display Name: Local Cisco Disk (naa.678da6e715bb0c801e8e3fab80a35506)
Vendor: Cisco | Model: UCSC-MRAID12G | Is Local: true | Is SSD: false

Cisco did not see any issues with Server /hardware after analyzing Tech support logs and also we performed Cisco diagnostics test on few servers,all components tests/ checks looks good .Only one recommendation given by cisco is to change Power Management policy from balance to High Performance under esxi host->configure->Hardware->Power Mgmt->Active policy ->High Performance

Can someone help me to find cause/fix .

  Thanks in advance

all components tests/ checks looks good 

Reply
0 Kudos
23 Replies
Madmax01
Expert
Expert
Jump to solution

i need to say that i don't have Cisco Hardware.  but just saw kinda similar pain issues with Ibm/supermicro once used nativ driver or older legacys.

i have currently this legacy driver in use: Version 6.612.07.00

also i installed storcli. as its very helpfully

i also disabled /etc/init.d/smartd  .    as it makes no sense for an Raid Device.  only make sense if it takes straight to disks.

you're using lsi providers?

Best regards

Max

Reply
0 Kudos
shankarsingh
Enthusiast
Enthusiast
Jump to solution

Thanks for response .

No,We are not using LSI provider

Reply
0 Kudos
Madmax01
Expert
Expert
Jump to solution

once you switch back to megaraid-sas then need to disable the nativ ones. just in Case i paste:

esxcli system module set --enabled=true --module=megaraid_sas

esxcli system module load --module=megaraid_sas -f

esxcli system module set --enabled=false --module=lsi_mr3

after reboot you could check then with esxcfg-scsidevs -a

if you have any Intel controller and not in use (sata/sas) > good you disable inside the bios.  Just to avoid kinda Interrupts.

may helps.

finger crossed Smiley Wink


best regards

Max

Reply
0 Kudos
shankarsingh
Enthusiast
Enthusiast
Jump to solution

Finally issue has been fixed by enabling mega raid driver and disabling LSI driver .

Thanks for your great help

Reply
0 Kudos