Hi Guys
Am facing esxi host hung issue while booting ESXI at the stage of vmkfbft loaded successfully.
we have installed lsi_mr3 divers also upgraded RAID controller firmware. but still facing same issue. if any one faced same and resolved. could you please help.
Thanks in advance.
ALT+F12 and look for error messages. Always Display "lsi_mr3: fusionReset:2779: megraid_sas: Hardware critical error, returning FAILED"
My Hardware Environment:
BIOS 2.0.2
iDRAC/LCC 2.30.30
My Software Environment:
root@esxi01:~] vmkload_mod -s lsi_mr3
vmkload_mod module information
input file: /usr/lib/vmware/vmkmod/lsi_mr3
Version: 6.903.85.00-1OEM.600.0.0.2768847
Build Type: release
License: GPLv2
Required name-spaces:
com.vmware.vmkapi#v2_3_0_0
Parameters:
lb_pending_cmds: int
Change raid-1 load balancing outstanding threshold.Valid Values are 1-128. Default: 4
msix_vectors: int
MSI-X max vector count. Default: Set by FW
disable_dual_qd: int
Disable dual queue depths. Default: 0
mfiDumpFailedCmd: int
Hex dump of failed command in driver log
max_sectors: int
Maximum number of sectors per IO command
[root@esxi01:~]
[root@esxi01:~] vmkload_mod -s lsi_mr3 | grep Version
Version: 6.903.85.00-1OEM.600.0.0.2768847
[root@esxi01:~]
[root@esxi01:~] for a in $(esxcfg-scsidevs -a |awk '{print $1}') ;do vmkchdev -l |grep $a ;done
0000:03:00.0 1000:005d 1028:1f47 vmkernel vmhba0
0000:00:11.4 8086:8d62 1028:0627 vmkernel vmhba1
0000:00:1f.2 8086:8d02 1028:0627 vmkernel vmhba2
[root@esxi01:~]
[root@esxi01:~] vmware -v
VMware ESXi 6.0.0 build-3620759
[root@esxi01:~]
RAID Setting and Firmware Version:
Did you open a support request with VMware and Dell?
There have been a lot of issues with this combo before so they would probably be able to help you out.
// Linjo
This problem occurs during the boot process.
Unfortunately, VMware GSS do not think the problem is VSAN.
They think that is driven or firmware problem.
Do not raise the support level of the case.
What is the SR number?
VNware Support Request 16105201105
Thanks, tried reading it but most of it is in chinese, so I will ask one of my colleagues to have a look.
Hi Duncan Epping
Thank you for your support this case.
My customers are very attention to this case.
I think it's may be a issues of drive or firmware.
We do not have sufficient technical capacity.
So, I can not help customers solve this problem.
My customer was in a hurry to solve this problem.
Because their VSAN 6.2 will to be used in a production environment.
Just waiting for some manufacturers released new firmware as soon as possible and drive.
I don't know this drive or firmware how to get VMware certified.
If you need any specific environmental information can contact me directly.
My E-Mail: chen.yansi@msn.com
From the vmkernel log
the lsi_mgr3 is the driver what work for RAID Controller.
it report the error should be detect the Hareware error from the RAID card.
it should be the DELL RAID CARD hareware issue. have you open a DELL case to follow this issue?
we know when the hardware occur some error will be detected by firmware and report to driver. the driver will display the error to kernel log. because the driver is load as a module by vmkernel.
as I know the driver is provide by DELL. not VMware.
ASK DELL is better way to find the root caused why the driver detected the error. !
1.the error logged by the lsi_mr3 driver.
lsi_mr3 : fusionReset:2779 megaraid_sas Hardware critical error returning FAILED.
2.what Hareware critical error by driver capture?
I strongly recommend collect the DELL raid card log ask DELL for analysis.
Avago_bootbank_lsi-mr3_6.903.85.00-1OEM.600.0.0.2768847:
Name: lsi-mr3
Version: 6.903.85.00-1OEM.600.0.0.2768847
Type: bootbank
Vendor: Avago
Acceptance Level: VMwareCertified----------------------------------->>>this driver is provide by DELL
Summary: Avago (LSI) Native MegaRAID SAS
Description: Avago (LSI) Native MegaRAID SAS driver for vmkernel
ReferenceURLs:
Creation Date: 2016-04-25
Depends: vmkapi_2_3_0_0
Conflicts:
Replaces:
Provides:
Maintenance Mode Required: False
Hardware Platforms Required:
Live Install Allowed: False
Live Remove Allowed: False
Stateless Ready: True
Overlay: False
Tags: module, driver
Payloads: lsi-mr3
the HCL for this RAID card
http://www.vmware.com/resources/compatibility/detail.php?deviceCategory=vsanio&productid=34857
in the page the driver type is "async" it means the driver provide by DELL.
So we Strongly recommend involve DELL to look in this issue.
Our PERC H730 series card has already been replaced, there have been such a problem.
Drive is provided by vSphere Update Manger.
It cannot say that VMware does not have any responsibility.
How VMware certification of drivers?
Compatibility should coordinate with vendors and testing.
I asked DELL to rewrite drive and firmware do you think will pay attention to me?
Why do not the vendor to the customer guarantee compatibility.
I need to solve the technical problem, not the problem of how to open Case.
Dear Yansi, I understand your problem and concern, but at this point there isn't much anyone on the community forum can do for you unfortunately. We need the hardware logs to be filed and the VMware customer support contract details to be provided. Also, in the majority of these cases it is (SSD) firmware related, which Dell typically helps the customer with. Again, without all the details / logs it is impossible to solve the problem. Please provide these so we can help move this forward.
My Disk Model is:
[root@esxi01:~] esxcli storage core device list | egrep -i model
Model: AL14SEB120N
Model: AL14SEB120N
Model: AL14SEB120N
Model: AL14SEB120N
Model: AL14SEB120N
Model: AL14SEB120N
Model: AL14SEB120N
Model: AL14SEB120N
Model: AL14SEB120N
Model: AL14SEB120N
Model: IDSDM
Model: AL14SEB120N
Model: AL14SEB120N
Model: PX04SMB080
Model: AL14SEB120N
Model: AL14SEB120N
Model: PX04SMB080
Model: BP13G+EXP
Model: AL14SEB120N
Model: AL14SEB120N
[root@esxi01:~]
Firmware is DM05(HDD)and SSD(AM04)
HCL URL:
Product Id:PX04SMB080
http://www.vmware.com/resources/compatibility/detail.php?deviceCategory=ssd&productid=39769&vcl=true
Product Id: AL14SEB120N
http://www.vmware.com/resources/compatibility/detail.php?deviceCategory=hdd&productid=39402&vcl=true
screenshot of startup failures
I had the same problem,But we didn't find any solution. !
I also encountered the same problem and hope to have solutions!!
Hey folks. I just wanted you to know you are not alone in this problem; it is rare in the field and intermittent as far as we know. I can't share details but we are actively engaged with our partners to root cause this issue.
I can't re-iterate enough how important it is to report this to Dell so they are aware of how critical and severe this issue is for you.
Here's what I know about how to clear the condition:
1. in some cases a warm reset is enough to clear the "hardware critical error" and the system will boot normal
2. in other cases sometimes multiple (up to ~5 times) warm resets are required to clear it and once its clear, it won't bother you again; unless you reboot
3. in some cases a full power cycle is required; a full power cycle is a graceful shutdown, wait 30s, power back on.
Watch for updates to this thread, as I will update you when we can share more details.
I am following this thread as well. If you are a new user experiencing this issue, chime in here but do contact Dell for support.
What did you replace your H730 controllers with?
Regarding "who is responsible for testing" and "why wasn't this caught in certification"?
There are more than just drivers and controllers at play in a system configuration such as this.
In the path of this issue are many things in addition to vmware:
device drivers -> IO controller HW -> IO controller firmware -> backplane HW -> backplane firmware -> drives -> drive firmware
Each of these paths has unique relationship to potential issues. The reason you choose a reputable OEM is to get support for these very tricky hardware paths. On our side (vmware) we are investigating our potential play in this as well, and we are always looking for ways to improve your experience in both hardware and software.
Thank you,
D.
Hi Guys,
we are facing the exact same issue. We are using Dell R930 with dual perc 730p.
The issue came up for the 4th time right now. We always have to reset the ESXi-Host (with only running ~80 MSSQL-VMs on it ).
(It's not a fun to handle this failure with all the customers)
I'm not sure, if VSAN is the best choice actually - For my opinion it's not ready for production use!! :smileyangry:
Are there any ideas from you how to get the host back to vCenter to vMotions?
(The VMware-Support is unable to help in this case... had opened some SRs in the past ...)
Regards,
Marc
We have identified the root cause of this issue with Dell, and testing for a fix is in progress. I cannot provide any schedule details or other commitments regarding Dell testing or release plans.
What we have found is that this problem is intermittent and rare but when it happens there is nothing VSAN can do to remedy the state of the controller and requires a warm reset or cold boot to clear the condition; sometimes several times. Once the condition is cleared it means the system will boot and drives will be detected again by VSAN. Unfortunately a successful boot doesn't mean the problem will not resurface again and we have not identified a workaround for avoiding this issue.
Please work with your Dell customer support path to ensure Dell is aware of your specific issue and can address your needs appropriately.
D.