VMware Cloud Community
Krede
Enthusiast
Enthusiast

ESXi 5.1 hosts randomly disconnects from vCenter

Hi.

After installing vSphere vCenter and ESXi 5.1 (with latest patches), we are starting to see hosts disconnecting randomly from vCenter.

We have different server models but so far, we have only seeing the problem on our HP Proliant BL465c Gen8 servers.

To solve this, we right click the host in vCenter and click reconnect – and sometimes we need to use the Shell to restart the management agent.

Anyone familiar with this issue?

59 Replies
pesante_pr
Enthusiast
Enthusiast

Are you running vCenter on Windows Server 2008 R2?  If so, check to make sure that after the upgrade the Windows firewall profile or rules have not changed. 

VCP 4/5, VSTP 4/5, VSP 4/5 @fjpesante
Reply
0 Kudos
skumflum42
Contributor
Contributor

I have three generation of HP Blade in one C7000 enclosure utilizing VC FLEX10.

BL460c G6 (intel)

BL465c G7 (Amd)

BL465c Gen8 (Amd)

I to experience random disconnects from vCenter and I see the same pattern as you people “Receive packets dropped” but NOT on our G6 blades (and no disconnects)!

I tried to chase HBA as suggested by #jquest21 but I don’t the “ALERT: APIC: 1823: APICID 0x00000000 - ESR = 0x40” in the logs so I’m not eager to just disable Interrupt mapping.

Driver/firmware:

BL465c Gen8 (Amd)

NIC:

Emulex Corporation HP FlexFabric 10Gb 2-port 554FLB Adapter

driver: be2net

version: 4.4.231.0

firmware-version: 4.2.401.605

HBA:

Emulex Corporation LPe12000 8Gb Fibre Channel Host Adapter

vmhba5  lpfc820           link-up   fc.2000009c0224964e:1000009c0224964e    (0:5:0.0) Emulex Corporation LPe12000 8Gb Fibre Channel Host Adapter

Version: Version 0:8.2.3.1-127vmw, Build: 799733, Interface: 9.2 Built on: Aug  1 2012

BL465c G7 (Amd)

Emulex Corporation NC551i Dual Port FlexFabric 10Gb Adapter

driver: be2net

version: 4.1.255.11

firmware-version: 4.1.450.16

bus-info: 0000:04:00.0

HBA:

QLogic Corp ISP2532-based 8Gb Fibre Channel to PCI Express HBA

Version: Version 911.k1.1-26vmw, Build: 472560, Interface: 9.2 Built on: Feb  8 2012

BL460c G6 (intel)

NIC:

Broadcom Corporation NetXtreme II BCM57711E/NC532i 10 Gigabit Ethernet

driver: bnx2x

version: 1.74.17.v50.1

firmware-version: bc 6.2.25 phy baa0.105

bus-info: 0000:02:00.0

HBA

QLogic Corp ISP2432-based 4Gb Fibre Channel to PCI Express HBA

Version: Version 934.5.6.0-1vmw, Build: 472560, Interface: 9.2 Built on: Sep 21 2012

DeaconZ
Enthusiast
Enthusiast

jquest21 wrote:

Not sure if any of this is 100% network related. There is an issue with HBA's disconnecting, and with 5.1 if hosts lost connection to Storage will also be in disconnected state.

Need to enable SSH on ESX host, and make sure SSH ports open on Firewall of ESX host. Then can connect via SSH and PUTTY.

Enter command:

esxcli system settings kernel list -o iovDisableIR, will show status…if it is False will need to run next command to set to true:

esxcli system settings kernel set --setting=iovDisableIR -v TRUE

Then need to reboot to enable setting on Host. Could re-run first command after reboot to verify settings.

Please see VMware KB 1030265

I think this is the ticket. However, when you have a cluster of ESXi hosts running hundreds of production servers and they can't stay connected to vCenter long enough to migrate them, its hard to reboot the host isn't it? Smiley Happy

Reply
0 Kudos
jquest21
Contributor
Contributor

in our case when we had that issue initially...one would see the HBA's drop connectivity on the SAN switch. It would not always be both ports that would disconnect.

Some times just one....then the second one would disconnect later.

Not all hosts would happen simultaneously.

We do have a number of ESXi hosts and 100's of servers running...we didn't have issue of not being able to vMotion off the host.


However, lately we have been having random disconnects as well....like once every few weeks a host would disconnect. We had the HBA settings above, so was not the issue this time.
With VMware support they said there was hotfix released last week to address.

Patch ESXi510-201307401-BG

Have applied that this week...so time will tell if that ultimately resolves.

We normally do monthly patching of ESX environment, on second week of the month....so would have applied this in next couple of weeks...but accelerated the patching due to the issue.

Reply
0 Kudos
DeaconZ
Enthusiast
Enthusiast

Hmm... yeah, mine are disconnecting constantly. They stay connected for about 1 minute and then disconnect. It makes it hard to vmotion. We're going to have shut down all the customer VM's to be able to reboot. Smiley Sad

And I think its storage, because the clusters share datastores, except for a few hosts that use a different set of datastores, and those one's stay connected. This is affecting Dell on 4.1, and HP and Cisco UCS on 5.1. The one's staying connected are on the same UCS fabric too as the others, just are not mapped to the common datastores.

Reply
0 Kudos
a_p_
Leadership
Leadership

The "1 minute" issue reminds me of a DNS resolution issue. Can you confirm that the FQDNs of all hosts as well as vCenter Server can properly be resolved by all hosts?

André

Reply
0 Kudos
jquest21
Contributor
Contributor

you have access to the fiber switches?

on ours it would show no link when issue occurred.

So if truly a storage related issue with what was described in that article, you would see no link on your fiber switches.

If not the case...think issue may be elsewhere.

Reply
0 Kudos
jquest21
Contributor
Contributor

also to reiterate....at time only one switch would have no link...however host connection didn't drop until both switches were showing no link....as even with one down...still had a path. So be sure to look at both switches when host disconnects.

Reply
0 Kudos
DeaconZ
Enthusiast
Enthusiast

Fiber switches seem to be ok. My VNX is also showing the hosts connected and the ones that are staying connected are on the same SAN, just some different LUNs. It is weird.

Reply
0 Kudos
DeaconZ
Enthusiast
Enthusiast

Found the problem. One of my engineers is implementing Splunk/VMware app and was installing the OVA's. He made a change to the firewall between the esx host network and the vCenter and accidentally blocked UDP 902. When we opened it again, great things began to happen.

Reply
0 Kudos
AlbertWT
Virtuoso
Virtuoso

yes, same thing happened with my on my C7000 cnslosure blades, i too got mixed Blades servers like you:

HP BL 465c G7 and 465c G8 (are all AMD) it happens intermittently and disconnecting randomly as well. while my HP BL 460c G7 and 460c G8 all Intel processor never have this problem before.

Any kind of help and suggestion or update would be greatly appreciated.

Thanks

/* Please feel free to provide any comments or input you may have. */
Reply
0 Kudos
Imcbride
Contributor
Contributor

We had this issue too with c7000's HP Flex 10 and ESXi 5.1, it turned out to be cached mac addressing in ESXi.

As we had physically moved some blades around and assigned new virtual connect profiles, but ESXi retained the old MAC address that virtual connect had assigned it.

Once the old profile is deleted the MAC address goes back into the free pool to be handed out, ending up with MAC address conflicts.

The following KB was what helped me out;

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=103111...

Hope this helps

Ian.

AlbertWT
Virtuoso
Virtuoso

Ok, I have found the resolution for my csae here:

My ESXi host version is ESXi 5.1.0 build-1065491

This issue occurs because the vCenter Server agent on the host (vpxa) fails to send heartbeats to the vCenter Server.

This is a known issue affecting ESXi/ESX 4.x and ESXi 5.x, the resolution is to update the ESXi host to Build 1157734 ( ESXi 5.1 Patch 2 )

Reference -  VMware KB: ESXi and ESX hosts randomly disconnect from VMware vCenter Server

Patch details - VMware KB: VMware ESXi 5.1, Patch Release ESXi510-201307001


Hope this help you guys.

/* Please feel free to provide any comments or input you may have. */
Reply
0 Kudos
Icecubbe
Contributor
Contributor

Hi All,

Also make sure that you use the HP custom VSphere images.

Has anyone updated to a newer version/update of VSphere and does the issue still occur?

Why not upgrade the hosts to 5.5 or update 2 of 5.1 to check the state then?

Kind Regards

Reply
0 Kudos
Imcbride
Contributor
Contributor

This issue reappeared for me when using the HP Custom ISO, the problem I had is with the driver version ‘10.2.453.0-2263645’ as is discussed on http://ict-freak.nl/2014/10/01/hps-september-vmware-driver-bundle-and-issues-with-emulex-cnas/

The latest driver from VMware for the Emulex card I am using is ‘be2net-4.9.234.8-2365770.zip’ dated 2014-12-17 works great.

However Update manager does not detect this driver as an upgrade, so I rolled back the driver and installed manually as per KB: 2079279.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=207927...

Before carrying out this make sure that the firmware on the NIC is up to date as per;

http://vibsdepot.hp.com/hpq/recipes/HP-VMware-Recipe.pdf

The hardware that I have had the issue on;

HP ProLiant BL465c G7 - Emulex Corporation HP NC552m Dual Port Flex-10 10Gbe BL-c Adapter

HP ProLiant BL460c Gen8 - Emulex Corporation HP FlexFabric 10Gb 2-port 554FLB Adapter

Hope this helps.

Reply
0 Kudos
ShoaibVM
Contributor
Contributor

Sorry for replying to an old Post, but Windows Firewall caused all this for me for days I couldn't find the reason.

As my Host are sitting in different VLAN and when I installed VCentre it only opened ports for DOMAIN Profile not private or public.

Thanks for your helpful tip.

Reply
0 Kudos
AlbertWT
Virtuoso
Virtuoso

yes, I'm having the same problem as you too guys, all in a sudden the ESXi 5.1 Update 1 host is disconnected from the VCenter and I cannot right click to bring it back connected to the VCenter, so I had to perform hard reset from the iLO

I'm running HP BL 465c G7 and G8 and here's the detailed version of y firmware and the hardware model:

NIC: Emulex Corporation NC551i

HBA: ISP2532

~ # ethtool -i vmnic0

driver: be2net

version: 4.2.327.0

firmware-version: 4.2.401.6

bus-info: 0000:04:00.0

may I know what was the fix for this issue here ?

/* Please feel free to provide any comments or input you may have. */
Reply
0 Kudos
Imcbride
Contributor
Contributor

Hi AlbertWT,

VMware have released a KB specifically for this KB:2044681

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=204468...

Essentially you need to upgrade the firmware and the driver, I have a BL 465c G7 too with NC551i using the following configuration which is working fine now;

driver: be2net

version: 4.9.234.8

firmware-version: 4.9.416.2

If you can get SSH connected to your blade by other means, such as using a Mezz Card? Copying the .scexe firmware file locally and running is much easier, if not HP Sum will do the job.

http://h20564.www2.hp.com/hpsc/swd/public/readIndex?sp4ts.oid=5033634

https://my.vmware.com/web/vmware/details?downloadGroup=DT-ESXI51-EMULEX-BE2NET-492348&productId=285

AlbertWT
Virtuoso
Virtuoso

@lmcbride: Yes, HP got the advisory page: HP Support document - HP Support Center but somehow it is stated like the following:

Update the Emulex adapter firmware to version 4.2.401.2215 (or later) for the HP ProLiant BL465 G7 server and the HP ProLiant BL685 G7 server:


but when I checked the firmware version on the ESXi host, it is giving me:

~ # ethtool -i vmnic0

driver: be2net

version: 4.2.327.0

firmware-version: 4.2.401.6

bus-info: 0000:04:00.0


So should I apply the firmware version suggested by HP knowing that my version above is greater than what the advisories suggested ?

/* Please feel free to provide any comments or input you may have. */
Reply
0 Kudos
bheemeswararao
Enthusiast
Enthusiast

Have you tried delete VPXA user it automatically disconnect server from Vcenter and then try add again that recreate VPXA that should fix issue. I got this issue on ESX4.0 I fixed the issue the same way

service vmware-vpxa stop

service mgmt-vmware stop

cat /etc/passwd |grep vpxuser

userdel vpxuser

rpm -qa |grep -iE 'vpx|aam'

rpm -e VMware-vpxa-5.0.0-773848

service mgmt-vmware start

Reply
0 Kudos