thatnetguy
Contributor
Contributor

Esxi 7.02u2 / HP z6g4 workstation / LSI 9341-8i - cannot get it to work

TIme to punt and ask for help...

HP z6g4 workstation, dual xeon processors (supported processors for esxi 7), LSI 9341-8i, latest HP bios for this model (but previous 2 bios versions showed the same issue).  Card is flashed to latest version of firmware.  ESXi has been updated to U2d.  Latest vib for the card is installed.

Install works fine, sees the LSI/Avago card and raid volume.  I'm installing to an NVME module, NOT to anything on the raid card though.

Reboot, boots, but the lsi card is now gone - it is shown in hardware (various ssh based commands show it is there - even shows in the gui pci device list as a megaraid FURY controller) but there are boot/dmesg references to megaraid and "FW in FAULT state", "diag reset adapter never cleared".

I have tried every combination I can in bios related to Legacy, Secureboot, UEFI as this seems to be what everyone points to.  These would be legacy/secureboot off, legacy off/secureboot off, legacy off, secureboot on.  And, the oprom settings set to either all legacy, or all uefi (that setting is dependent on the secureboot settings so only some combinations can be set).

If I go pure legacy, system boots but i have the firmware issue above and cant see (or create) a datastore on the raid array.  Other combinations of Legacy/Secureboot/uefi either result in same behavior OR in a system "stuck" at the 'shutting down firmware services' part of the boot process.

I have booted 2 different linux based liveCD distro's on this same hardware.  Both see the card.  Both see megaraid and report FW is all good/online.  I even used it to reformat the partition once to an NTFS one just to be sure that they were talking to the card/array properly.  

I also reloaded and tested this with 6.7u3.  Same issue occurs - though I did not experiment as much because i need to avoid 6.7 due to EOL later this year.

So it seems like there is some odd interaction with the card, the system bios, and ESXi.  Maybe if it'd go past the 'shutting down firmware' state...

Reached out to an HP contact but nothing yet.  May downgrade BIOS to the oldest I can just to test it.

The card is good - I tried the card /drives in an HP z640 - initially had the same problem, but after tweaking the BIOS settings to UEFI it worked fine - and controller fw was reported as online/good.   Those same settings on the z6g4 dont seem to have the same effect - but, those are different versions of firmware.   

It's almost like the esxi version expects this to be uefi controlled card/interaction and if it is not, then it reports the fw-not-ready error.  But when it's all UEFI then it freezes in the boot process.  Seems a hard freeze.  drive light stays on, numlock stops responding.

Anyone done this?   Anyone know some magic combination for the HP bios settings.  Or perhaps something more to get past the stuck boot.  Or a specific BIOS version that seems to work.

I'm sure someone will say - just use the z640... I would, but it's out of warranty and the z6g4 is not.

 

0 Kudos
3 Replies
ESXiClash
Enthusiast
Enthusiast

Follow the article , Ref Section of Storage

https://kb.vmware.com/s/article/1027206

can you share output?

0 Kudos
thatnetguy
Contributor
Contributor

To get it to boot had to go to legacy mode.
[root@localhost:~] lspci|grep -i mega
0000:15:00.0 RAID bus controller: Broadcom MegaRAID SAS Fury Controller [vmhba1]

[root@localhost:~] esxcfg-scsidevs -a
vmhba0 vmw_ahci link-n/a sata.vmhba0 (0000:00:11.5) Intel Corporation Wellsburg RAID Controller
vmhba2 nvme_pcie link-n/a pcie.1600 (0000:16:00.0) Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
[root@localhost:~] esxcli storage core adapter list
HBA Name Driver Link State UID Capabilities Description
-------- --------- ---------- ----------- ------------ -----------
vmhba0 vmw_ahci link-n/a sata.vmhba0 (0000:00:11.5) Intel Corporation Wellsburg RAID Controller
vmhba2 nvme_pcie link-n/a pcie.1600 (0000:16:00.0) Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
[note that the raid array isnt listed]

[root@localhost:~] dmesg|grep -i lsi_mr3
2022-01-20T14:27:20.553Z cpu7:1049188)Loading module lsi_mr3 ...
2022-01-20T14:27:20.555Z cpu7:1049188)Elf: 2060: module lsi_mr3 has license ThirdParty
2022-01-20T14:27:20.562Z cpu7:1049188)lsi_mr3: 7.719.02.00
2022-01-20T14:27:20.562Z cpu7:1049188)Device: 211: Registered driver 'lsi_mr3' from 48
2022-01-20T14:27:20.562Z cpu7:1049188)Mod: 4789: Initialization of lsi_mr3 succeeded with module ID 48.
2022-01-20T14:27:20.562Z cpu7:1049188)lsi_mr3 loaded successfully.
2022-01-20T14:27:20.569Z cpu7:1049188)lsi_mr3: mfi_AttachDevice:863: mfi: Attach Device.
2022-01-20T14:27:20.569Z cpu7:1049188)lsi_mr3: mfi_AttachDevice:871: mfi: mfiAdapter Instance Created(Instance Struct Base_Address): 0x430a37602000
2022-01-20T14:27:20.569Z cpu7:1049188)lsi_mr3: mfi_SetupIOResource:379: mfi bar: 1.
2022-01-20T14:27:20.569Z cpu7:1049188)lsi_mr3: fusion_init:1688: RDPQ mode not supported
2022-01-20T14:27:20.570Z cpu7:1049188)lsi_mr3: fusion_init:1722: fusion_init Allocated MSIx count 2 MaxNumCompletionQueues 2
2022-01-20T14:27:20.570Z cpu7:1049188)lsi_mr3: fusion_init:1737: Dual QD exposed
2022-01-20T14:27:20.570Z cpu7:1049188)lsi_mr3: fusion_init:1794: maxSGElems 64 max_sge_in_main_msg 8 max_sge_in_chain 64
2022-01-20T14:27:20.570Z cpu7:1049188)lsi_mr3: fusion_init:1881: fw_support_ieee = 67108864.
2022-01-20T14:27:20.570Z cpu7:1049188)lsi_mr3: fusionIocInit:579: FW doest not support interrupt coalescing feature
2022-01-20T14:27:20.570Z cpu7:1049188)lsi_mr3: fusionIocInit:589: Driver doesn't enable interrupt coalescing
2022-01-20T14:27:21.570Z cpu7:1049188)WARNING: lsi_mr3: fusion_init:1888: Failed to Initialise IOC
2022-01-20T14:27:21.570Z cpu7:1049188)lsi_mr3: fusion_cleanup:1974: mfi: cleanup fusion.
2022-01-20T14:27:21.570Z cpu7:1049188)WARNING: lsi_mr3: mfi_FirmwareInit:2329: adapter init failed.
2022-01-20T14:27:21.570Z cpu7:1049188)WARNING: lsi_mr3: mfi_AttachDevice:915: mfi: failed to init firmware.
2022-01-20T14:27:21.570Z cpu7:1049188)lsi_mr3: mfi_FreeAdapterResources:680: mfi: destroying timer queue.
2022-01-20T14:27:21.570Z cpu7:1049188)lsi_mr3: mfi_FreeAdapterResources:691: mfi: destroying locks.
2022-01-20T14:27:21.570Z cpu7:1049188)WARNING: lsi_mr3: mfi_AttachDevice:948: Failed - Failure
2022-01-20T14:27:22.502Z cpu36:1049253)lsi_mr3: mfi_AttachDevice:863: mfi: Attach Device.
2022-01-20T14:27:22.502Z cpu36:1049253)lsi_mr3: mfi_AttachDevice:871: mfi: mfiAdapter Instance Created(Instance Struct Base_Address): 0x430a37602000
2022-01-20T14:27:22.502Z cpu36:1049253)lsi_mr3: mfi_SetupIOResource:379: mfi bar: 1.
2022-01-20T14:27:22.503Z cpu36:1049253)WARNING: lsi_mr3: mfiCheckFwReady:1961: megasas: FW in FAULT state!!
2022-01-20T14:27:22.503Z cpu36:1049253)WARNING: lsi_mr3: mfi_FirmwareInit:2319: FW not in READY state
2022-01-20T14:29:05.685Z cpu36:1049253)WARNING: lsi_mr3: mfiDoChipReset:4337: megaraid_sas: Diag reset adapter never cleared!
2022-01-20T14:29:05.685Z cpu36:1049253)WARNING: lsi_mr3: mfi_AttachDevice:915: mfi: failed to init firmware.
2022-01-20T14:29:05.685Z cpu36:1049253)lsi_mr3: mfi_FreeAdapterResources:680: mfi: destroying timer queue.
2022-01-20T14:29:05.685Z cpu36:1049253)lsi_mr3: mfi_FreeAdapterResources:691: mfi: destroying locks.
2022-01-20T14:29:05.685Z cpu36:1049253)WARNING: lsi_mr3: mfi_AttachDevice:948: Failed - Failure

some of the other commands return nothing because the vmhba1 device isnt there.

Obviously it goes back to the initialization.  But this works within multiple other OSes that say that the FW is ready.

0 Kudos
thatnetguy
Contributor
Contributor

Tried this on another z6g4. Older version of bios (from 2019). Same result.  Cant get any of the older versions to try and downgrade because they dont have them on HPs site and looks like they get pulled (HPs site doesnt have any bios versions past 2020).  My HP contact couldnt give me a source for archived versions.  I was going to see if I could go back to the first version of bios released for the hardware... but that doesnt seem to be an option.

I am now pretty convinced that this is some odd interaction with the bios/uefi mode AND esxi

Boots if i set legacy (as much legacy as I can set). But then esxi thinks that there is an issue with the cards firmware.

If i enable uefi instead of legacy the system won't boot (freezes at the part of the boot process where it shuts down firmware services) and there are no logs generated that I can look at later.

I know others encountered the freeze at firmware unload - but for them going to full legacy mode seems to have helped. With my config, it doesnt look like I can go full legacy on this HP/bios combination. Really a shame that there is not something more on the vmware side about the freeze and what it's really doing...

The fact that this all works pretty much flawlessly with every other OS I have loaded on the hardware tells me that there is an issue with the esxi interaction/driver/bios.

Oh, the freeze on 'shutting down firmware services' happens even if the LSI card is removed further indicating issue between esxi / bios (when in uefi mode).

Think I'll give up on this combination - too much time spent and nothing to show for it.

I've reported it to HP but dont really expect anything to come of it. Maybe I'll keep an eye out for new bios releases and try them. Or if vmware ever re-releases .03 (or newer) maybe I'll give it a try.

0 Kudos