Hi everybody!
I have a problem with VMware ESXi 6.0.
I have a VMware Cluster with 3 ESXi 6.0 host. Yesterday evening 2 ESXi host became unresponsive. The affected ESXi hosts, responds to ping, but disconnect vCenter, cannot connect direct to host with vSphere client and unresponsive on DCUI. The VMs - which running the affected hosts - became unresponsive (VMware HA doesn't reboot VMs, because the host locked the VMs file). Only workaround: hard reset the hosts. After hard reset the hosts, HA restart VMs on another host, and the affected host working normal. The problem occurd when high I/O (backup, file-level, inside VM) on HBAs.
In the /var/log/vmkernel.log I see a lot of messages at the "crash" time:
WARNING: lpfc: lpfc_sli_issue_abort:9956: 1:3169 Abort failed: Abort INP: Data: x0 xcd0 x8 x98
ScsiPath: 7133: Set retry timeout for failed TaskMgmt abort for CmdSN 0x0, status Failure, path vmhba5:C0:T0:L0
The hosts configuration:
Host type: IBM x3850 X5
VMware version: Lenovo Customized ESXi 6.0 + VMware ESXi 6.0 Express Patch 2
FC: 2 * Emulex LightPulse FC SCSI 10.4.236.0 IBM 42D0494 8Gb 2-Port PCIe FC HBA for System x Emulex firmware version: 2.02X11 Emulex driver version: 10.4.236.0-1OEM.600.0.0.2159203
Hosts firmware versions are the latest.
VMware installed on USB key (Clean install, Not upgraded), LOG dir on FC Datastore.
The storage and FC switches side have no error/warning messages.
I see the VMware KB 2086025 and 2125904. In this KB articles the symptoms are very similar to our situation, but our hosts have newer Emulex driver version (KB articles: version earlier than 10.2.340.18, our version 10.4.236.0)
I tried the latest Emulex firmware (version: 10.6.126.0, install & restart host) but the host become unresponsive again and the log same as earlier.
Today a new problem, when I collect diagnostic info (Export Logs) from the host:
I didn't find any solution.
Any ideas?
Thanks for your help!
Have you seen the unresponse host issue before 6.0 upgrade. If not, I strongly suspect this to be due to "VMware KB: ESXi 6.0 network connectivity is lost with NETDEV WATCHDOG timeouts in the vmkernel.log" . Try upgrading to the latest 6.0 release which contains fix to this.
The Purple screen could be a different product bug and I doubt it may not happen always. Based on the screenshot, it looks like log collection ( vm-support) was going through some VSI nodes and failed at that stage. So, that issue could be specifically happening only during some vm-support invocation. Good to raise a complaint with VMware support.
A couple of things come to mind:
This might also be a good time to open a support case with VMware if you can.
Have you seen the unresponse host issue before 6.0 upgrade. If not, I strongly suspect this to be due to "VMware KB: ESXi 6.0 network connectivity is lost with NETDEV WATCHDOG timeouts in the vmkernel.log" . Try upgrading to the latest 6.0 release which contains fix to this.
The Purple screen could be a different product bug and I doubt it may not happen always. Based on the screenshot, it looks like log collection ( vm-support) was going through some VSI nodes and failed at that stage. So, that issue could be specifically happening only during some vm-support invocation. Good to raise a complaint with VMware support.
Did you get this fixed? We have similar issues with the same hardware that weren't resolved by the latest ESX 6.0 Update 1a.
Regards en TIA.
Hi Poort443!
We have no problem for 3 weeks.
The changes (I'm not sure, that all modifications are necessary):
esxcli software vib update -d "/vmfs/volumes/Datastore/DirectoryName/PatchName.zip"
esxcli system settings advanced set -i 0 -o /VMFS3/UseATSForHBOnVMFS5
esxcli system settings kernel set --setting=iovDisableIR -v TRUE
Regards,
Hi Nower,
did you ever resolve this issue? We have the same issue on HP hardware with Emulex LPe1605 HBAs.
Thanks
Regards,
Andreas
We are running the following Emulex HBA in our 3850x5
IBM 42D0494 LPe12000
VID: 10df
DID: f100
SVID:10df
SSID: f100
firmware: 2.02x11
driver: lpfc
version: 11.0.237.0-1OEM.600.0.0.2768847
We are running ESXi 6 Update 2, 3620759 and it has been stable.
Obtained the driver from:
