VMware Cloud Community
nower
Contributor
Contributor
Jump to solution

ESXi 6.0 hosts become unresponsive

Hi everybody!

I have a problem with VMware ESXi 6.0.

I have a VMware Cluster with 3 ESXi 6.0 host. Yesterday evening 2 ESXi host became unresponsive. The affected ESXi hosts, responds to ping, but disconnect vCenter, cannot connect direct to host with vSphere client and unresponsive on DCUI. The VMs - which running the affected hosts - became unresponsive (VMware HA doesn't reboot VMs, because the host locked the VMs file). Only workaround: hard reset the hosts. After hard reset the hosts, HA restart VMs on another host, and the affected host working normal. The problem occurd when high I/O (backup, file-level, inside VM) on HBAs.

In the /var/log/vmkernel.log I see a lot of messages at the "crash" time:

WARNING: lpfc: lpfc_sli_issue_abort:9956: 1:3169 Abort failed: Abort INP: Data: x0 xcd0 x8 x98

ScsiPath: 7133: Set retry timeout for failed TaskMgmt abort for CmdSN  0x0, status Failure, path vmhba5:C0:T0:L0

The hosts configuration:

Host type: IBM x3850 X5

VMware version: Lenovo Customized ESXi 6.0 + VMware ESXi 6.0 Express Patch 2

FC: 2 * Emulex LightPulse FC SCSI 10.4.236.0 IBM 42D0494 8Gb 2-Port PCIe FC HBA for System x Emulex firmware version: 2.02X11 Emulex driver version: 10.4.236.0-1OEM.600.0.0.2159203

Hosts firmware versions are the latest.

VMware installed on USB key (Clean install, Not upgraded), LOG dir on FC Datastore.

The storage and FC switches side have no error/warning messages.

I see the VMware KB 2086025 and 2125904. In this KB articles the symptoms are very similar to our situation, but our hosts have newer Emulex driver version (KB articles: version earlier than 10.2.340.18, our version 10.4.236.0)

I tried the latest Emulex firmware (version: 10.6.126.0, install & restart host) but the host become unresponsive again and the log same as earlier.

Today a new problem, when I collect diagnostic info (Export Logs) from the host:

  • first host: disconnect from vCenter for seconds 3 times (flapping state), and the log download failed, when the host disconnect, the VMs (which running this host) not responding on LAN
  • second host: log download start, after 10 minutes purple screen:

purple_screen.png

I didn't find any solution.

Any ideas?

Thanks for your help!

Reply
0 Kudos
1 Solution

Accepted Solutions
Techie01
Hot Shot
Hot Shot
Jump to solution

Have you seen the unresponse host  issue before 6.0 upgrade. If not, I strongly suspect this to be due to "VMware KB: ESXi 6.0 network connectivity is lost with NETDEV WATCHDOG timeouts in the vmkernel.log"  . Try upgrading to the latest 6.0 release which contains fix to this.

The Purple screen could be a different product bug and I doubt it may not happen always. Based on the screenshot, it looks like log collection ( vm-support) was going through some VSI nodes and failed at that stage. So, that issue could be specifically happening only during some vm-support invocation. Good to raise a complaint with VMware support.

View solution in original post

Reply
0 Kudos
6 Replies
mcrape
Enthusiast
Enthusiast
Jump to solution

A couple of things come to mind:

  • Have you tried upgrade to ESXi 6U1?
  • Have you tried the recommended version of the drive in the KB article? (10.2.340.18)? Even though the problem should be gone in the newer version, if you still receive the error with the recommended version then you should be able to hopefully rule out this KB as the issue.

This might also be a good time to open a support case with VMware if you can.

Reply
0 Kudos
Techie01
Hot Shot
Hot Shot
Jump to solution

Have you seen the unresponse host  issue before 6.0 upgrade. If not, I strongly suspect this to be due to "VMware KB: ESXi 6.0 network connectivity is lost with NETDEV WATCHDOG timeouts in the vmkernel.log"  . Try upgrading to the latest 6.0 release which contains fix to this.

The Purple screen could be a different product bug and I doubt it may not happen always. Based on the screenshot, it looks like log collection ( vm-support) was going through some VSI nodes and failed at that stage. So, that issue could be specifically happening only during some vm-support invocation. Good to raise a complaint with VMware support.

Reply
0 Kudos
Poort443
Enthusiast
Enthusiast
Jump to solution

Did you get this fixed? We have similar issues with the same hardware that weren't resolved by the latest ESX 6.0 Update 1a.

Regards en TIA.

Reply
0 Kudos
nower
Contributor
Contributor
Jump to solution

Hi Poort443!

We have no problem for 3 weeks.

The changes (I'm not sure, that all modifications are necessary):

  • Patch the Lenovo Customized ESXi 6.0 with 6.0 U1a (VMware KB 2124669 , like wrote Techie01):

         esxcli software vib update -d "/vmfs/volumes/Datastore/DirectoryName/PatchName.zip"

  • VMware ESXi 5.5 U2 versions changed the VMFS Heartbeat method and we have IBM Storwize base storage, we disabled VAAI ATS heartbeat (VMware KB 2113956:disappointed_face:

          esxcli system settings advanced set -i 0 -o /VMFS3/UseATSForHBOnVMFS5

          esxcli system settings kernel set --setting=iovDisableIR -v TRUE

  • Restore the original Emulex driver from Lenovo image (version: 10.4.236.0)

Regards,

Reply
0 Kudos
andreasaster
Contributor
Contributor
Jump to solution

Hi Nower,

did you ever resolve this issue? We have the same issue on HP hardware with Emulex LPe1605 HBAs.

Thanks

Regards,

Andreas

Reply
0 Kudos
touimet
Enthusiast
Enthusiast
Jump to solution

We are running the following Emulex HBA in our 3850x5

IBM 42D0494 LPe12000

VID: 10df

DID: f100

SVID:10df

SSID: f100

firmware: 2.02x11

driver: lpfc

version: 11.0.237.0-1OEM.600.0.0.2768847

We are running ESXi 6 Update 2, 3620759 and it has been stable.

Obtained the driver from:

Support Documents and Downloads

Reply
0 Kudos