VMware Cloud Community
YanSi
Enthusiast
Enthusiast

Dell R730xd on VSAN 6.2 Boot Hung at "vmkfbft loaded successfully"

Hi Guys

Am facing esxi host hung issue while booting ESXI at the stage of vmkfbft loaded successfully.

we have installed lsi_mr3 divers also upgraded RAID controller firmware. but still facing same issue. if any one faced same and resolved. could you please help.

Thanks in advance.

ALT+F12 and look for error messages. Always Display "lsi_mr3: fusionReset:2779: megraid_sas: Hardware critical error, returning FAILED"

QQ图片20160516164822.png

My Hardware Environment:

BIOS 2.0.2

iDRAC/LCC 2.30.30

My Software Environment:

root@esxi01:~] vmkload_mod -s lsi_mr3
vmkload_mod module information
input file: /usr/lib/vmware/vmkmod/lsi_mr3
Version: 6.903.85.00-1OEM.600.0.0.2768847
Build Type: release
License: GPLv2
Required name-spaces:
  com.vmware.vmkapi#v2_3_0_0
Parameters:
  lb_pending_cmds: int
    Change raid-1 load balancing outstanding threshold.Valid Values are 1-128. Default: 4
  msix_vectors: int
    MSI-X max vector count. Default: Set by FW
  disable_dual_qd: int
    Disable dual queue depths. Default: 0
  mfiDumpFailedCmd: int
    Hex dump of failed command in driver log
  max_sectors: int
    Maximum number of sectors per IO command
[root@esxi01:~]

[root@esxi01:~] vmkload_mod -s lsi_mr3 | grep Version
Version: 6.903.85.00-1OEM.600.0.0.2768847
[root@esxi01:~]

[root@esxi01:~] for a in $(esxcfg-scsidevs -a |awk '{print $1}') ;do vmkchdev -l |grep $a ;done
0000:03:00.0 1000:005d 1028:1f47 vmkernel vmhba0
0000:00:11.4 8086:8d62 1028:0627 vmkernel vmhba1
0000:00:1f.2 8086:8d02 1028:0627 vmkernel vmhba2
[root@esxi01:~]

[root@esxi01:~] vmware -v
VMware ESXi 6.0.0 build-3620759
[root@esxi01:~]

RAID Setting and Firmware Version:

1.jpg2.jpg3.jpgiDRAC_PERCH730PMini_Firmware_Information.png4.jpg

0 Kudos
26 Replies
Linjo
Leadership
Leadership

Did you open a support request with VMware and Dell?

There have been a lot of issues with this combo before so they would probably be able to help you out.

// Linjo

Best regards, Linjo Please follow me on twitter: @viewgeek If you find this information useful, please award points for "correct" or "helpful".
0 Kudos
YanSi
Enthusiast
Enthusiast

This problem occurs during the boot process.

Unfortunately, VMware GSS do not think the problem is VSAN.

They think that is driven or firmware problem.

Do not raise the support level of the case.

0 Kudos
depping
Leadership
Leadership

What is the SR number?

0 Kudos
YanSi
Enthusiast
Enthusiast

VNware Support Request 16105201105

0 Kudos
depping
Leadership
Leadership

Thanks, tried reading it but most of it is in chinese, so I will ask one of my colleagues to have a look.

0 Kudos
YanSi
Enthusiast
Enthusiast

Hi Duncan Epping

Thank you for your support this case.

My customers are very attention to this case.

I think it's may be a issues of drive or firmware.

We do not have sufficient technical capacity.

So, I can not help customers solve this problem.

My customer was in a hurry to solve this problem.

Because their VSAN 6.2 will to be used in a production environment.

Just waiting for some manufacturers released new firmware as soon as possible and drive.

I don't know this drive or firmware how to get VMware certified.

If you need any specific environmental information can contact me directly.

My E-Mail: chen.yansi@msn.com

0 Kudos
Deasion
Contributor
Contributor

From the vmkernel log

the lsi_mgr3 is the driver what work for RAID Controller.

it report the error should be detect the Hareware error from the RAID card.

it should be the DELL RAID CARD hareware issue. have you open a DELL case to follow this issue?

we know when the hardware occur some error will be detected by firmware and report to driver. the driver will display the error to kernel log. because the driver is load as a module by vmkernel.

as I know the driver is provide by DELL. not VMware.

ASK DELL is better way to find the root caused why the driver detected the error. !

0 Kudos
admin
Immortal
Immortal

1.the error logged by the lsi_mr3 driver.

lsi_mr3 : fusionReset:2779 megaraid_sas Hardware critical error returning FAILED.

2.what Hareware critical error by driver capture?

I strongly recommend collect the DELL raid card log ask DELL for analysis.

Avago_bootbank_lsi-mr3_6.903.85.00-1OEM.600.0.0.2768847:

  Name: lsi-mr3

  Version: 6.903.85.00-1OEM.600.0.0.2768847

  Type: bootbank

  Vendor: Avago

  Acceptance Level: VMwareCertified----------------------------------->>>this driver is provide by DELL

  Summary: Avago (LSI) Native MegaRAID SAS

  Description: Avago (LSI) Native MegaRAID SAS driver for vmkernel

  ReferenceURLs:

  Creation Date: 2016-04-25

  Depends: vmkapi_2_3_0_0

  Conflicts:

  Replaces:

  Provides:

  Maintenance Mode Required: False

  Hardware Platforms Required:

  Live Install Allowed: False

  Live Remove Allowed: False

  Stateless Ready: True

  Overlay: False

  Tags: module, driver

  Payloads: lsi-mr3

the HCL for this RAID card

http://www.vmware.com/resources/compatibility/detail.php?deviceCategory=vsanio&productid=34857

in the page the driver type is "async" it means the driver provide by DELL.

So we Strongly recommend involve DELL to look in this issue.

0 Kudos
YanSi
Enthusiast
Enthusiast

Our PERC H730 series card has already been replaced, there have been such a problem.

Drive is provided by vSphere Update Manger.

It cannot say that VMware does not have any responsibility.

How VMware certification of drivers?

Compatibility should coordinate with vendors and testing.

0 Kudos
YanSi
Enthusiast
Enthusiast

I asked DELL to rewrite drive and firmware do you think will pay attention to me?

Why do not the vendor to the customer guarantee compatibility.

I need to solve the technical problem, not the problem of how to open Case.

0 Kudos
depping
Leadership
Leadership

Dear Yansi, I understand your problem and concern, but at this point there isn't much anyone on the community forum can do for you unfortunately. We need the hardware logs to be filed and the VMware customer support contract details to be provided. Also, in the majority of these cases it is (SSD) firmware related, which Dell typically helps the customer with. Again, without all the details / logs it is impossible to solve the problem. Please provide these so we can help move this forward.

0 Kudos
YanSi
Enthusiast
Enthusiast

My Disk Model is:

[root@esxi01:~] esxcli storage core device list | egrep -i model

   Model: AL14SEB120N

   Model: AL14SEB120N

   Model: AL14SEB120N

   Model: AL14SEB120N

   Model: AL14SEB120N

   Model: AL14SEB120N

   Model: AL14SEB120N

   Model: AL14SEB120N

   Model: AL14SEB120N

   Model: AL14SEB120N

   Model: IDSDM

   Model: AL14SEB120N

   Model: AL14SEB120N

   Model: PX04SMB080

   Model: AL14SEB120N

   Model: AL14SEB120N

   Model: PX04SMB080

   Model: BP13G+EXP

   Model: AL14SEB120N

   Model: AL14SEB120N

[root@esxi01:~]

Firmware is DM05(HDD)and SSD(AM04)

QQ图片20160531160004.pngQQ图片20160531160032.png

HCL URL:

Product Id:PX04SMB080

http://www.vmware.com/resources/compatibility/detail.php?deviceCategory=ssd&productid=39769&vcl=true

Product Id: AL14SEB120N

http://www.vmware.com/resources/compatibility/detail.php?deviceCategory=hdd&productid=39402&vcl=true

0 Kudos
YanSi
Enthusiast
Enthusiast

screenshot of startup failures

1.jpg2.jpg3.jpg4.jpg

0 Kudos
yangzhidong
Contributor
Contributor

I had the same problem,But we didn't find any solution. !

0 Kudos
TonyLu0507
Contributor
Contributor

I also encountered the same problem and hope to have solutions!!

0 Kudos
admin
Immortal
Immortal

Hey folks.  I just wanted you to know you are not alone in this problem; it is rare in the field and intermittent as far as we know.  I can't share details but we are actively engaged with our partners to root cause this issue.

I can't re-iterate enough how important it is to report this to Dell so they are aware of how critical and severe this issue is for you.

Here's what I know about how to clear the condition:

1. in some cases a warm reset is enough to clear the "hardware critical error" and the system will boot normal

2. in other cases sometimes multiple (up to ~5 times) warm resets are required to clear it and once its clear, it won't bother you again; unless you reboot Smiley Sad

3. in some cases a full power cycle is required; a full power cycle is a graceful shutdown, wait 30s, power back on.

Watch for updates to this thread, as I will update you when we can share more details.

I am following this thread as well.  If you are a new user experiencing this issue, chime in here but do contact Dell for support.

0 Kudos
admin
Immortal
Immortal

What did you replace your H730 controllers with?

Regarding "who is responsible for testing" and "why wasn't this caught in certification"?

There are more than just drivers and controllers at play in a system configuration such as this.

In the path of this issue are many things in addition to vmware:

device drivers -> IO controller HW -> IO controller firmware -> backplane HW -> backplane firmware -> drives -> drive firmware

Each of these paths has unique relationship to potential issues.  The reason you choose a reputable OEM is to get support for these very tricky hardware paths.  On our side (vmware) we are investigating our potential play in this as well, and we are always looking for ways to improve your experience in both hardware and software.

Thank you,

D.

0 Kudos
marcseitz
Enthusiast
Enthusiast

Hi Guys,

we are facing the exact same issue. We are using Dell R930 with dual perc 730p.

The issue came up for the 4th time right now. We always have to reset the ESXi-Host (with only running ~80 MSSQL-VMs on it Smiley Sad).

(It's not a fun to handle this failure with all the customers)

I'm not sure, if VSAN is the best choice actually - For my opinion it's not ready for production use!! :smileyangry:

Are there any ideas from you how to get the host back to vCenter to vMotions?

(The VMware-Support is unable to help in this case... had opened some SRs in the past ...)

Regards,

Marc

0 Kudos
admin
Immortal
Immortal

We have identified the root cause of this issue with Dell, and testing for a fix is in progress.  I cannot provide any schedule details or other commitments regarding Dell testing or release plans.

What we have found is that this problem is intermittent and rare but when it happens there is nothing VSAN can do to remedy the state of the controller and requires a warm reset or cold boot to clear the condition; sometimes several times.  Once the condition is cleared it means the system will boot and drives will be detected again by VSAN.  Unfortunately a successful boot doesn't mean the problem will not resurface again and we have not identified a workaround for avoiding this issue.

Please work with your Dell customer support path to ensure Dell is aware of your specific issue and can address your needs appropriately.

D.

0 Kudos