VMware Cloud Community
YanSi
Enthusiast
Enthusiast

Dell R730xd on VSAN 6.2 Boot Hung at "vmkfbft loaded successfully"

Hi Guys

Am facing esxi host hung issue while booting ESXI at the stage of vmkfbft loaded successfully.

we have installed lsi_mr3 divers also upgraded RAID controller firmware. but still facing same issue. if any one faced same and resolved. could you please help.

Thanks in advance.

ALT+F12 and look for error messages. Always Display "lsi_mr3: fusionReset:2779: megraid_sas: Hardware critical error, returning FAILED"

QQ图片20160516164822.png

My Hardware Environment:

BIOS 2.0.2

iDRAC/LCC 2.30.30

My Software Environment:

root@esxi01:~] vmkload_mod -s lsi_mr3
vmkload_mod module information
input file: /usr/lib/vmware/vmkmod/lsi_mr3
Version: 6.903.85.00-1OEM.600.0.0.2768847
Build Type: release
License: GPLv2
Required name-spaces:
  com.vmware.vmkapi#v2_3_0_0
Parameters:
  lb_pending_cmds: int
    Change raid-1 load balancing outstanding threshold.Valid Values are 1-128. Default: 4
  msix_vectors: int
    MSI-X max vector count. Default: Set by FW
  disable_dual_qd: int
    Disable dual queue depths. Default: 0
  mfiDumpFailedCmd: int
    Hex dump of failed command in driver log
  max_sectors: int
    Maximum number of sectors per IO command
[root@esxi01:~]

[root@esxi01:~] vmkload_mod -s lsi_mr3 | grep Version
Version: 6.903.85.00-1OEM.600.0.0.2768847
[root@esxi01:~]

[root@esxi01:~] for a in $(esxcfg-scsidevs -a |awk '{print $1}') ;do vmkchdev -l |grep $a ;done
0000:03:00.0 1000:005d 1028:1f47 vmkernel vmhba0
0000:00:11.4 8086:8d62 1028:0627 vmkernel vmhba1
0000:00:1f.2 8086:8d02 1028:0627 vmkernel vmhba2
[root@esxi01:~]

[root@esxi01:~] vmware -v
VMware ESXi 6.0.0 build-3620759
[root@esxi01:~]

RAID Setting and Firmware Version:

1.jpg2.jpg3.jpgiDRAC_PERCH730PMini_Firmware_Information.png4.jpg

Reply
0 Kudos
26 Replies
marcseitz
Enthusiast
Enthusiast

Hi again,

thanks for this message.

Unfortunately for us this information came out too late. At the moment I'm on the way to move all data back to SAN-Storage.

We are going to disable VSAN completly!

Since we've updated to ESXi 6.0 Update 2, we are facing one problem afer the other.

VMware-Support told us that it's a hardware issue. Dell-Support told us we have to update ESXi-Version to 6.0.0 4192238 (VMware-KB: 2144936) and disble T10 in the raid-controller.

So, we've....

- disabled the T10-Feature in the RAID-Controller via perccli

- updated ESXi to 6.0.0 4192238

- verified the timeout-settings for the disks

- verified the driver- and firmware version(s) of all components

=> Everything should be fine (confirmed by VMware- and Dell-Support) - And on monday one host failed again...

So for my opinion, it's too risky to run VSAN in a production environment on ESXi 6.0 U2 actually!

(I don't know if it is a VMware-Problem or a Dell-Problem.)

If you have more information about that, I'm really interessted (maybe we can discuss that on a call...)!

Regards,

Marc

Reply
0 Kudos
lkrishnarajpet
Contributor
Contributor

Hi Marc,

My name is Lokesh Krishnarajpet from VSAN Engineering and my email is lkrishnarajpet@vmware.com. I want to reach out to you regarding this issue.  Can you please let me know VMware Support Request # related to this issue? We will review the case and engage the required resources for the issue and we can schedule a call if as needed.

Lokesh

Reply
0 Kudos
marcseitz
Enthusiast
Enthusiast

Hi Lokesh,

thank you very much for that offer - I really appreciate that!!

I've sent a mail to you a few minutes ago!

Regards,

Marc

Reply
0 Kudos
MichaelGi
Enthusiast
Enthusiast

I have issues with our vsan cluster servers starting up also.  It will hang on various services trying to start and cold booting 1 or 2 times will get it to start.  We are using HP Proliant servers and I didn't notice this happening until vsan 6.2.  It also takes a very long time for the ssd initialization which makes doing maintenance a nightmare.

Reply
0 Kudos
admin
Immortal
Immortal

This particular issue is not a problem for HP server and storage controllers.

--------

For those of you that suspect this is related to a problem you are seeing, you can narrow it down and help our support teams by doing the following as soon as possible at the time of the error:

1. Capture a screen shot of the alt+F11 log screen on the host console

NOTE: if you see a number of IO Abort, timeout messages, or hardware critical error on the alt+F11 screen, then continue

2. Get the host to boot; per reset / cold boot method previously mentioned

3. look through /var/log/vmkernel.log for "hardware critical error", if you see this in the logscontinue.

4. run vm-support and put the bundle some safe so you can send it to vmware and Dell for support

5. Install latest perccli tools from Dell; see

6. collect hardware logs using perccli, per instructions in the above link

  cd /opt/lsi/perccli ; ./perccli /c0 show termlog

7. save off the hardware logs so you can send it to vmware and Dell for support

8. contact your vmware and Dell support representatives

Our support staff may have you install vsphere patches, BIOS updates, firmware updates, drivers, etc.  They may ask you to do a number of things to ensure your system is up to a known version state and if possible have you reproduce the issue, if it is feasible.  These things are important to ensure we haven't already fixed your problem in some update.

If you are not seeing the specific "hardware critical error" in the logs (step #3) then you may have a different issue, and you can save yourself some time by going through and making sure your vsphere patches, drivers and firmware are up to date with the HCL and Dell recommendations.

Once this issue is resolved on our end, it will go through certification process and you can expect a KB article, explaining this further.

We sincerely appreciate your patience and understanding while we fix this problem.

D.

Reply
0 Kudos
admin
Immortal
Immortal

Hi Guys

This is Shen from GSS China

My customer report a totally same issue , vsan 6.2 booting stuck in vmkfbft. After my customer contact dell , dell offer a new custom image , then the issue was fixed .

Here is my customr hardware info and ISO download link ,

Key Value Instance: lsi_mr3-51866da0751f6600/LSI Incorporation

Listing keys:

Name: MR-DriverVersion

Type:   string

value:  6.903.85.00

Name:   MR-HBAModel

Type:   string

value:  Avago (LSI) HBA 1000:5d:1028:1f

Name:   MR-FWVersion

Type:   string

value:  Fw Rev. 25.4.1.0004

Name:   MR-ChipRevision

Type:   string

value:  Chip Rev. C0

Name:   MR-CtrlStatus

Type:   string

value:  FwState c0000000

http://downloads.dell.com/FOLDER03955014M/1/VMware-VMvisor-Installer-6.0.0.update02-4192238.x86_64-D...

If your vendor is DELL , please contact dell which is a fastest way to fix this issue.

Thanks

Shen

Reply
0 Kudos
Techie101
Contributor
Contributor

I show the VSAN HCL was updated for 6.0 U2, recommending driver lsi-mr3 version 6.904.43.00-1OEM.600.0.0.2768847 & FW version 25.5.0.0018 as stated on the VSAN HCL - VMware Compatibility Guide - I/O Device Search

Is this issue addressed with these revisions?  I have seen this issue with the following configuration:
H730 mini - FW 25.4.1.0004 / driver lsi_mr3 version 6.903.85.00

esxcfg-advcfg -s 100000 /LSOM/diskIoTimeout

esxcfg-advcfg -s 4 /LSOM/diskIoRetryFactor

sas expander backplane FW version 3.31

Reply
0 Kudos