Hi.
After installing vSphere vCenter and ESXi 5.1 (with latest patches), we are starting to see hosts disconnecting randomly from vCenter.
We have different server models but so far, we have only seeing the problem on our HP Proliant BL465c Gen8 servers.
To solve this, we right click the host in vCenter and click reconnect – and sometimes we need to use the Shell to restart the management agent.
Anyone familiar with this issue?
Hi Mark,
Thank You for your suggestions. It would help me a lot as well. Would it be possible for you to provide us the HP case number so that we can refer to it while speaking with their tech support? I too haev a call open with them.
Thanks,
AG
Hi AG
I have not yet opened a call with HP. It was VMware who told me that HP had confirmed (to them) that the latest NC550SFP firmware resolved the disconnection issues.
Cheers
Mark
We are on version 4.1.450.7 - If anyone dare to upgrade to the new version please lets us know the outcome ![]()
In my case also by updating the latest firmware didnt solved, still the hp is investigating the issue
my version is
VMware ESXi 5.0.0 build-914586
NIC: vmnic14
Driver: be2net
Firmware Version: 4.2.401.605
Driver Version: 4.2.327.0
i am using hp blade g7, with flex nics, that is the Emulux CNA
these are the logs which i saw in the esxi
2013-02-27T09:59:12.217Z cpu21:8213)be_get_stats_timer_handler: vmnic14 async ioctl timeout..
2013-02-27T09:59:12.545Z cpu40:12730)vmnic14: MBOX Timeout happened ...
vobd.log:2013-02-27T09:59:12.802Z: [netCorrelator] 98600724019us: [vob.net.vmnic.linkstate.down] vmnic vmnic19 linkstate down
vobd.log:2013-02-27T09:59:12.804Z: [netCorrelator] 98600724047us: [vob.net.vmnic.linkstate.down] vmnic vmnic18 linkstate down
vobd.log:2013-02-27T09:59:12.805Z: [netCorrelator] 98600724755us: [vob.net.vmnic.linkstate.down] vmnic vmnic17 linkstate down
Also when you face the nics down issue, check the below
ethtool –S vmnic14
rx_crc_errors: 80
rx_frame_errors: 738
tx_errors: 72018694161836
link_down_reason: 12884901888
Also open case with vmware, there is some issue in the Hardware/Firmware
Are there any updates on this. Same issue going on here as well:
http://communities.vmware.com/thread/436713?start=0&tstart=0
I've been running on fw 4.2.401.605 and 4.2.327.0 since Thursday. I have not had any further disconnects. If something should change I'll post back to this thread.
No disconnects in the last week or two - haven't done anything yet.
On next disconnect we will upgrade firmware and driver!
Thanks everyone for sharing your experience.
My environment suffered massive network outage last night on the 10gb interfaces only. This has lead to knock-on problems in my environment.
I can no longer recommend updating, at least until VMware figures out what caused this latest issue... though the only thing changed was the firmware/driver patches I made on thursday/weekend.
Disconnects were a major pain, but my situation is now a million times worse! ![]()
Auch!
Let us know if VMware / HP thinks it has anything todo with the new firmwares / drivers.
If you're using vNIC mode on the cards with tagged packets, turn it off. There's a known issue with tagged packets "leaking" outside of the VLAN when the card is in vNIC mode. You may not be experiencing that here, but just to be sure, thought I'd mention it.
Finally I upgraded the firmware on 1 server 3 days ago. I now have the latest version of firmware and NIC driver:
driver: be2net
Driver version: 4.2.327.0
firmware-version: 4.2.401.605
As I mentioned, I am using HP BL685c Blade servers and I today got these alerts in the exact order even after the Firmware and NIC driver upgrade.
Emulex has released "another" new driver: 4.4.231.0
Upgraded the firmware on the Blade servers. Updated the drivers. Did ESXi patching using Update Manager. The hosts worked fine for 5 days and today 3 out of 10 esxi 5.1 hosts showed the error again: Host connection failure.
Thanks- AG
Thanks for the update.
I noticed the other day that HP have an earlier BIOS (4.1.450.1707) posted for the NC550 SFP card dated 25 March. Haven't tried it yet though - bit concerned by the fact that its older than the 19th feb version.
In any case I have partially solved my issue through the use of a replacement QLogic card (NC522 SFP). Check out the dropped packet results (6 days of gathered stats):
The host with 0 packets is unsurprisingly the one with the QLogic installed. I plan to replace the rest of my cards since no fix is in site yet with the Emulex line.
If anyone was interested, the PowerCLI code i'm using to pull out these stats is:
$output = @()
foreach ($hostname in (Get-Cluster -Name "<cluster_name>" | Get-VMHost)) {
$esxCli = Get-EsxCli -VMHost $hostname
$output += $esxcli.network.nic.stats.get("vmnic4") | Select-Object @{n="Host";e={$hostname.Name}},*
$output += $esxcli.network.nic.stats.get("vmnic5") | Select-Object @{n="Host";e={$hostname.Name}},*
}$path = "C:\test\results.csv"
$output | Export-Csv -Path $path -NoTypeInformation
Rebooting the hosts will clear the NIC stats.
Cheers
Mark
Has enyone tried new firmware: OneConnect-Flash-4.1.450.1707 ?
Two of our hosts has been updated to -
NIC Firmware: 4.2.401.605
NIC driver: 4.4.231.0
Still having Dropped recieved packages ![]()
I too am chasing random disconnects on new HP blades - BL460c Gen8
Flex-Fabric
ESXi5.0u1 - HP build ISO
Latest-greatest firmware, drivers
However, my symptoms are name resolution - Sometimes vCenter can resolve the ESX host name, Sometimes not
When the host drops, it still responds to a ping by IP.
.. I'm going to be working with our network guys to dig into this, assuming it really is a name resolution issue.
However, I haven't seen this issue on our Cisco UCS blades... which makes this thread all the more interesting.
I'm seeing something similar on our IBM PureFlex environment with the CN4054 cna which is Emulex OCE/be2net based. I have frequent occurrences of vCenter losing heartbeat to hosts. vCenter is a VM running on one of the hosts, VLAN tagging in place with vCenter VM on a different VLAN to the vmkernel ports on the blade nodes. I haven't gotten to dig into the dropped packets stats yet but I do see reports of heartbeats missed in the vpxd.log on vCenter. The heartbeat traffic is UDP over 902 or 903, but similar to TBKing I can ping the hosts continuously with no traffic loss even while the disconnect happens which makes me wonder if it is UDP that's affected only (dns is udp 53 ). We are also using FCoE which seems to be working fine although no major troubleshooting has gone on there yet but not seeing anything major in the vmkernel logs to indicate storage disconnects
Driver and Firmware versions are as recommended by Emulex/IBM:
Firmware: 4.4.180.3
Ethernet Driver: 4.2.327.0
FC Driver Kit: 8.2.4.141.55
Will be doing some more investigation on it today so will report anything else I find
Cheers
DB
Not sure if any of this is 100% network related. There is an issue with HBA's disconnecting, and with 5.1 if hosts lost connection to Storage will also be in disconnected state.
Need to enable SSH on ESX host, and make sure SSH ports open on Firewall of ESX host. Then can connect via SSH and PUTTY.
Enter command:
esxcli system settings kernel list -o iovDisableIR, will show status…if it is False will need to run next command to set to true:
esxcli system settings kernel set --setting=iovDisableIR -v TRUE
Then need to reboot to enable setting on Host. Could re-run first command after reboot to verify settings.
Please see VMware KB 1030265
