VMware Cloud Community
Krede
Enthusiast
Enthusiast

ESXi 5.1 hosts randomly disconnects from vCenter

Hi.

After installing vSphere vCenter and ESXi 5.1 (with latest patches), we are starting to see hosts disconnecting randomly from vCenter.

We have different server models but so far, we have only seeing the problem on our HP Proliant BL465c Gen8 servers.

To solve this, we right click the host in vCenter and click reconnect – and sometimes we need to use the Shell to restart the management agent.

Anyone familiar with this issue?

59 Replies
asrarguna
Enthusiast
Enthusiast

Hi Mark,

Thank You for your suggestions. It would help me a lot as well. Would it be possible for you to provide us the HP case number so that we can refer to it while speaking with their tech support? I too haev a call open with them.

Thanks,

AG

Reply
0 Kudos
nirvy
Commander
Commander

Hi AG

I have not yet opened a call with HP.  It was VMware who told me that HP had confirmed (to them) that the latest NC550SFP firmware resolved the disconnection issues.

Cheers

Mark

Reply
0 Kudos
Krede
Enthusiast
Enthusiast

We are on version 4.1.450.7  - If anyone dare to upgrade to the new version please lets us know the outcome :slightly_smiling_face:

Reply
0 Kudos
Gkeerthy
Expert
Expert

In my case also by updating the latest firmware didnt solved, still the hp is investigating the issue

my version is

VMware ESXi 5.0.0 build-914586

NIC: vmnic14

Driver: be2net

Firmware Version: 4.2.401.605

Driver Version: 4.2.327.0

i am using hp blade g7, with flex nics, that is the Emulux CNA

these are the logs which i saw in the esxi

2013-02-27T09:59:12.217Z cpu21:8213)be_get_stats_timer_handler: vmnic14 async ioctl timeout..

2013-02-27T09:59:12.545Z cpu40:12730)vmnic14: MBOX Timeout happened ...

vobd.log:2013-02-27T09:59:12.802Z: [netCorrelator] 98600724019us: [vob.net.vmnic.linkstate.down] vmnic vmnic19 linkstate down

vobd.log:2013-02-27T09:59:12.804Z: [netCorrelator] 98600724047us: [vob.net.vmnic.linkstate.down] vmnic vmnic18 linkstate down

vobd.log:2013-02-27T09:59:12.805Z: [netCorrelator] 98600724755us: [vob.net.vmnic.linkstate.down] vmnic vmnic17 linkstate down

Also when you face the nics down issue, check the below

ethtool –S vmnic14

rx_crc_errors: 80
rx_frame_errors: 738

tx_errors: 72018694161836

  link_down_reason: 12884901888

Also open case with vmware, there is some issue in the Hardware/Firmware

Please don't forget to award point for 'Correct' or 'Helpful', if you found the comment useful. (vExpert, VCP-Cloud. VCAP5-DCD, VCP4, VCP5, MCSE, MCITP)
Reply
0 Kudos
asrarguna
Enthusiast
Enthusiast

Are there any updates on this. Same issue going on here as well:

http://communities.vmware.com/thread/436713?start=0&tstart=0

nirvy
Commander
Commander

I've been running on fw 4.2.401.605 and 4.2.327.0 since Thursday.  I have not had any further disconnects.  If something should change I'll post back to this thread.

Reply
0 Kudos
Krede
Enthusiast
Enthusiast

No disconnects in the last week or two - haven't done anything yet.

On next disconnect we will upgrade firmware and driver!

Thanks everyone for sharing your experience.

Reply
0 Kudos
nirvy
Commander
Commander

My environment suffered massive network outage last night on the 10gb interfaces only.  This has lead to knock-on problems in my environment.

I can no longer recommend updating, at least until VMware figures out what caused this latest issue... though the only thing changed was the firmware/driver patches I made on thursday/weekend.

Disconnects were a major pain, but my situation is now a million times worse!  Smiley Sad

Reply
0 Kudos
Krede
Enthusiast
Enthusiast

Auch!

Let us know if VMware / HP thinks it has anything todo with the new firmwares / drivers.

Reply
0 Kudos
cdc1
Expert
Expert

If you're using vNIC mode on the cards with tagged packets, turn it off.  There's a known issue with tagged packets "leaking" outside of the VLAN when the card is in vNIC mode.  You may not be experiencing that here, but just to be sure, thought I'd mention it.

Reply
0 Kudos
asrarguna
Enthusiast
Enthusiast

Finally I upgraded the firmware on 1 server 3 days ago. I now have the latest version of firmware and NIC driver:

driver: be2net

Driver version: 4.2.327.0

firmware-version: 4.2.401.605

As I mentioned, I am using HP BL685c Blade servers and I today got these alerts in the exact order even after the Firmware and NIC driver upgrade.

Host is not responding error (This causes Red Alarm on the host in vCenter).
3/10/2013 10:23:45 AM
VMs show disconnected (greyed out) and so does the esxi hosts.
Alarm 'Host connection and power state' on esxi-Host changed from Green to Red
info 3/10/2013 10:23:45 AM
Alarm 'Host memory usage' on esxi-Host changed from Green to Gray
info 3/10/2013 10:23:45 AM
Alarm 'Host cpu usage' on esxi-Host changed from Green to Gray
info 3/10/2013 10:23:45 AM
Alarm 'Host service console swap rates' on esxi-Host changed from Green to Gray
info 3/10/2013 10:23:45 AM
Alarm 'Network connectivity lost': an SNMP trap for entity esxi-Host was sent
info 3/10/2013 10:24:06 AM
Alarm 'Network connectivity lost' on esxi-Host triggered an action
info 3/10/2013 10:24:06 AM
Alarm 'Network uplink redundancy lost': an SNMP trap for entity esxi-Host was sent
info 3/10/2013 10:24:06 AM
Alarm 'Network uplink redundancy lost' on esxi-Host triggered an action
info 3/10/2013 10:24:06 AM
The hosts will be back to normal after 3 or 4 seconds and VMs and host that were greyed out will be back to normal.
Firmware adn NIC upgrade did not work.
Thanks,
AG
Reply
0 Kudos
Krede
Enthusiast
Enthusiast

Reply
0 Kudos
asrarguna
Enthusiast
Enthusiast

Upgraded the firmware on the Blade  servers. Updated the drivers. Did ESXi patching using Update Manager.  The hosts worked fine for 5 days and today 3 out of 10 esxi 5.1 hosts  showed the error again: Host connection failure.

Thanks- AG

Reply
0 Kudos
Krede
Enthusiast
Enthusiast

Thanks for the update.

Reply
0 Kudos
nirvy
Commander
Commander

I noticed the other day that HP have an earlier BIOS (4.1.450.1707) posted for the NC550 SFP card dated 25 March.  Haven't tried it yet though - bit concerned by the fact that its older than the 19th feb version.

In any case I have partially solved my issue through the use of a replacement QLogic card (NC522 SFP).  Check out the dropped packet results (6 days of gathered stats):

dropped.png

The host with 0 packets is unsurprisingly the one with the QLogic installed.  I plan to replace the rest of my cards since no fix is in site yet with the Emulex line.

If anyone was interested, the PowerCLI code i'm using to pull out these stats is:

 

$output = @()

foreach ($hostname in (Get-Cluster -Name "<cluster_name>" | Get-VMHost)) {
    $esxCli = Get-EsxCli -VMHost $hostname
    $output += $esxcli.network.nic.stats.get("vmnic4") | Select-Object @{n="Host";e={$hostname.Name}},*
    $output += $esxcli.network.nic.stats.get("vmnic5") | Select-Object @{n="Host";e={$hostname.Name}},*
}

$path = "C:\test\results.csv"
$output | Export-Csv -Path $path -NoTypeInformation

Rebooting the hosts will clear the NIC stats.

Cheers

Mark

Reply
0 Kudos
Krede
Enthusiast
Enthusiast

Reply
0 Kudos
Krede
Enthusiast
Enthusiast

Two of our hosts has been updated to -

NIC Firmware: 4.2.401.605

NIC driver: 4.4.231.0

Still having Dropped recieved packages :disappointed_face:

Reply
0 Kudos
TBKing
Enthusiast
Enthusiast

I too am chasing random disconnects on new HP blades - BL460c Gen8

Flex-Fabric

ESXi5.0u1 - HP build ISO

Latest-greatest firmware, drivers

However, my symptoms are name resolution - Sometimes vCenter can resolve the ESX host name, Sometimes not

When the host drops, it still responds to a ping by IP.

.. I'm going to be working with our network guys to dig into this, assuming it really is a name resolution issue.

However, I haven't seen this issue on our Cisco UCS blades... which makes this thread all the more interesting.

Reply
0 Kudos
DonalB
Enthusiast
Enthusiast

I'm seeing something similar on our IBM PureFlex environment with the CN4054 cna which is Emulex OCE/be2net based. I have frequent occurrences of vCenter losing heartbeat to hosts. vCenter is a VM running on one of the hosts, VLAN tagging in place with vCenter VM on a different VLAN to the vmkernel ports on the blade nodes. I haven't gotten to dig into the dropped packets stats yet but I do see reports of heartbeats missed in the vpxd.log on vCenter. The heartbeat traffic is UDP over 902 or 903, but similar to TBKing I can ping the hosts continuously with no traffic loss even while the disconnect happens which makes me wonder if it is UDP that's affected only (dns is udp 53 ). We are also using FCoE which seems to be working fine although no major troubleshooting has gone on there yet but not seeing anything major in the vmkernel logs to indicate storage disconnects

Driver and Firmware versions are as recommended by Emulex/IBM:

Firmware: 4.4.180.3

Ethernet Driver: 4.2.327.0

FC Driver Kit: 8.2.4.141.55

Will be doing some more investigation on it today so will report anything else I find

Cheers

DB

Reply
0 Kudos
jquest21
Contributor
Contributor

Not sure if any of this is 100% network related. There is an issue with HBA's disconnecting, and with 5.1 if hosts lost connection to Storage will also be in disconnected state.

Need to enable SSH on ESX host, and make sure SSH ports open on Firewall of ESX host. Then can connect via SSH and PUTTY.

Enter command:

esxcli system settings kernel list -o iovDisableIR, will show status…if it is False will need to run next command to set to true:

esxcli system settings kernel set --setting=iovDisableIR -v TRUE

Then need to reboot to enable setting on Host. Could re-run first command after reboot to verify settings.

Please see VMware KB 1030265