Solved: VMs intermittent loses network connectivity.

Johan77 · ‎11-25-2017

Hi,

We have a strange problem/bug in our new VMware cluster.

Environment

BL460 gen10 with HP FlexFabric 20Gb 2-port 650FLB Adapter

HPE C7000 chassis

vSphere 6.0 (build 6775062)

ESX01,ESX03 and ESX03 are in chassi01

ESX04,ESX05 and ESX are in chassi02

VMs intermittent loses network connectivity.

When this happens the “remedy” is to migrate the specific VM to some other host in the cluster.

So far it seems that it doesn’t matter if I migrate the VM to a VMhost inside the same chassis or to the other chassis , just a migration seems to solve the issue. (I can’t migrate it back to the same host though)

I have around 150 VMs in this cluster and so far I’ve had issues with 5-6 of them , completely random.

They could be on any of my VMhosts in the cluster.

Haven’t created any support case with VMware or HPE yet , this forum post is my first advance to tackle this problem.

All firmware is updated to the latest from HPE

Someone who have seen similar issues?

Regards

Johan

andreaspa · ‎05-16-2018

HPE has published this advisory:

HPE Support document - HPE Support Center

Advisory: VMware - HPE ProLiant Server Configured With Certain Network Adapters And Running VMware ESXi 6.0 U3 May Randomly Lose Connection to Individual Virtual Machines

View solution in original post

hussainbte · ‎11-25-2017

This is an issue with vLAN availability on the hosts the VM is currently running on.

Or vLAN availability on one of the 2 or more nics you are using for that portgroup.

Please check if the vLAN is available on all the nics the switch uses.

you can use CDP to discover the same.

or below command form ESXi ssh.

vim-cmd hostsvc/net/query_networkhint

If you found my answers useful please consider marking them as Correct OR Helpful Regards, Hussain https://virtualcubes.wordpress.com/

Johan77 · ‎11-25-2017

Hi hussainbte,

It's not a VLAN availability problem.

We use SUS (Shared uplink set) on our virtual connect switches , and VLAN config is verified both on VC/ServerProfiles and on our juniper switches.

A VM suddenly loses network connectivity, No vmotion has happened when this occur.

Like I wrote before the remedy is to migrate the VM to some other host , then after 5-10 minutes its possible to vMotion the VM to its original host.

To me, it sounds like some CAM table somewhere which won't update mac addresses or maybe some bug in the VC switches. Or maybe some garp issue somewhere ...

Regards,

Johan

YushkovSergey · ‎12-03-2017

Hi Johan!

I got the same problem, my setup:

esxi 6.5 - 6765664

HP C7000 enclosure with HP VC Flex-10/10D Module

ProLiant BL460c Gen9 with HP FlexFabric 20Gb 2-port 650FLB Adapter

I updated all hosts from latest spp (Service Pack for ProLiant (SPP) Version 2017.10.1)

I've opened cases with VMware and HPE, but still no luck. Now we are trying to find right combination of network card firmware and drivers, sound a little bit weird

Can you show output from these commands?

esxcli software profile get

esxcli network nic get -n vmnic0

What version of VC do you have?

I got 4.61 and looks like this can be a cause https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00029108en_us

But to downgrade to 4.5 i have to shutdown whole enclosure, and setup everything from scratch, i hope to find another solution.

TylerDurden77 · ‎12-04-2017

Hi,

VC version 4.60 and 4.61 seems to be the problem. (We run on 4.61)

We also have a case with HPE , they tell us to downgrade to 4.50

But like you say , to be able to downgrade it seems that we have to shutdown the whole VC domain which isn't an option for us right now....

I find it very strange that their isn't a simple way to downgrade the VC.

Hopefully, HPE will get back to us during the day with some guidance.

Cheers

YushkovSergey · ‎12-05-2017

It may be, or may be not related to 4.6.0-1

I have another VMware cluster in this enclosure, it uses the same virtual distributed switch, and the same uplinks through the same virtual connect modules.

But the servers are ProLiant BL460c Gen8, 3 of them with "HP Flex-10 10Gb 2-port 530FLB Adapter" - and there are no issues with virtual machines on them.

And one with "HP FlexFabric 20Gb 2-port 650FLB Adapter" like my problem cluster with BL460c Gen9.

And guess what? It also has this problem.

So the theory about right combination of firmware and driver may be true.

These three combination was wrong:

Firmware: 11.1.183.62

Driver: 11.2.1149.0

Firmware: 11.2.1263.19

Driver: 11.4.1205.0

Firmware: 11.2.1263.19

Driver: 11.2.1149.0

And now i'm testing (like HPE support was told me)

Firmware: 11.1.183.23

Driver: 11.1.196.3

Can you check you firmware and driver with these commands?

esxcli network nic list

esxcli network nic get -n vmnic0

TylerDurden77 · ‎12-05-2017

Hi ,

Like you I only see the "issues" on gen9 and gen10 servers.

Gen8 with HP "FlexFabric 10Gb 2-Port 534FLB Adapter" > No issues (Have around 12 gen8 servers)

Gen9 with HP "FlexFabric 20Gb 2-port 650FLB Adapter" > Some issues, seen VMs acting weird, packet drops etc.

Gen10 HP "FlexFabric 20Gb 2-port 650FLB Adapter" > Big issues, VMs random loses network connectivity, packet drops.

Gen10

esxcli network nic list

Name PCI Device Driver Admin Status Link Status Speed Duplex MAC Address MTU Description

------ ------------ ------ ------------ ----------- ----- ------ ----------------- ---- -----------------------------------------------------------

vmnic0 0000:37:00.0 elxnet Up Up 10000 Full 70:10:6f:43:84:48 1500 Emulex Corporation HP FlexFabric 20Gb 2-port 650FLB Adapter

vmnic1 0000:37:00.1 elxnet Up Up 10000 Full 70:10:6f:43:84:50 1500 Emulex Corporation HP FlexFabric 20Gb 2-port 650FLB Adapter

vmnic2 0000:37:00.2 elxnet Up Up 10000 Full 70:10:6f:43:84:49 1500 Emulex Corporation HP FlexFabric 20Gb 2-port 650FLB Adapter

vmnic3 0000:37:00.3 elxnet Up Up 10000 Full 70:10:6f:43:84:51 1500 Emulex Corporation HP FlexFabric 20Gb 2-port 650FLB Adapter

vmnic4 0000:37:00.4 elxnet Up Up 10000 Full 70:10:6f:43:84:4a 1500 Emulex Corporation HP FlexFabric 20Gb 2-port 650FLB Adapter

vmnic5 0000:37:00.5 elxnet Up Down 0 Half 70:10:6f:43:84:52 1500 Emulex Corporation HP FlexFabric 20Gb 2-port 650FLB Adapte

esxcli network nic get -n vmnic0

Advertised Auto Negotiation: true

Advertised Link Modes: 1000baseT/Full, 10000baseT/Full, 20000baseT/Full

Auto Negotiation: true

Cable Type:

Current Message Level: 4631

Driver Info:

Bus Info: 0000:37:00:0

Driver: elxnet

Firmware Version: 11.2.1263.19

Version: 11.2.1149.0

Link Detected: true

Link Status: Up

Name: vmnic0

PHYAddress: 0

Pause Autonegotiate: true

Pause RX: true

Pause TX: true

Supported Ports:

Supports Auto Negotiation: true

Supports Pause: true

Supports Wakeon: true

Transceiver: external

Virtual Address: 00:50:56:5f:66:dc

Wakeon: MagicPacket(tm)

Right now we have evacuated one chassis and downgrading to 4.50

Cheers

Johan

YushkovSergey · ‎12-05-2017

Hi Johan, thank you for information!

Please post you results on 4.5, I have to decide what to do next

I spend one day on Firmware 11.1.183.23 and on Driver 11.1.196.3 with no errors.

msripada · ‎12-05-2017

Try isolating it when the VM is losing network, check the esxtop -> Press N -> find which NIC it is using. If you have multiple NIC's configured for the VM portgroup, try to uncheck the VM network (click ok) and check it back, which then switches the NIC, you can confirm that in the esxtop.

If the VM network is working fine, then you can isolate the NIC that way.

If you are seeing this same way on multiple hosts then one nic on each host (need to isolate) and check the physical switch configuration to which the NIC's are connected or try to check if they are same as the other NIC where the VM running.

Thanks,

MS

TylerDurden77 · ‎12-06-2017

Hi Sergey,

We have 4 chassis

In each chassis we have:

3 gen10 servers.

10 gen9 servers.

3 gen8 servers.

All blades except 2 are ESXi 6.0u3 hosts.

In our case it feels like the problem escalated somehow when we took the gen10 server into production. But we are not certain...

To try to pinpoint the problem we have now done the following:

In chassis 1 and 2 we have downgraded the CNA firmware/driver (your hint) on the gen10 ESXi hosts, VC firmware is still 4.61

In chassis 3 and 4 we have put the gen10 servers into maintenance mode. VC firmware is downgraded to 4.50

After we downgraded the VC firmware in C3 and C4 we still had issues. (Random packet loss on VMs running in C3 and C4)

But after we downgraded the CNAs on the gen10 blades in C1 and C2 we havent seen any issues and our environment seems stable. ( Only 8 hours now though)

It's a very strange problem, hard to troubleshoot, so intermittent.

How are things in your environment? Still good after the downgrade?
Do you have any types of loadbalancers? (Wonder if our F5s could have something to do with the problem)

Cheers

Johan

YushkovSergey · ‎12-06-2017

Hi Johan, thank you for sharing results with vc 4.5.

I have not seen any issue for 50+ hours with CNA firmware: 11.1.183.23 and driver: 11.1.196.3

We dont have any load balancers it this configuration and also we don't have any G10 servers yet.

During troubleshooting i try to simplify everything as possible. So right now our configuration looks like this:

bay 1 - HP VC Flex-10/10D Module

bay 2 - HP VC Flex-10/10D Module

Two SUS, each with only one physical uplink (no LACP). Every uplink is a trunk, so there is a bunch of vlans in it.

profile attached to esxi server:

vlans from sus uplink1 goes to port 1, and from uplink 2 to port 2

On distributed switch we have distributed port group called "servers", all problem virtual machines attached to it. All traffic goes through one uplink "servers01" which points to vmnic0 on every esxi server. So no load balancing here to.

Our issues started after we replace or old virtual connect modules with new HP VC Flex-10/10D Modules (and update it to 4.61 from very beginning), and add new G9 servers to the enclosure (and update them from latest SPP). We do all this as one step, that's why i'm uncertain what to blame VC or CNA

Hope that right cna firmware/driver will help us.

TylerDurden77 · ‎12-07-2017

Hi Sergey,

We are pretty sure that we have pinpointed the "bug"

Has nothing to do with our new gen10 blades and it's not the VC firmware.

It's the CNA firmware (11.2.1263.19) from the Okt SPP.

Done a lot of testing and we can reproduce the problem on VMhosts with the "11.2.1263.19" firmware. (Both on VC 4.50 and 4.61)

We have now downgraded the CNA firmware to "11.1.183.62" and our environment is stable again.

I find it very strange that HPE doesn't know about this problem, must be many customers around the world who have issues like we did.

Cheers

Johan

glamic26 · ‎12-08-2017

Hello all,

Just wanted to add to the investigation here as we are seeing the same issue of VMs intermittently dropping off the network and the fix being to vMotion the VMs to another host.

We are running a similar setup:

6x BL460C Gen10 blades with 650FLB adapters

C7000 chassis with FlexFabric 20/40 F8 modules

vSphere 6.0 (build 6921384)

ESX01, 02 and 03 are in chassis01

ESX04, 05 and 06 are in chassis02

vDS version 6.0

PortGroup Settings:

Promiscuous Mode: Reject

MAC Address Changes: Accept

Forged Transmits: Accept

Load Balancing: route based on physical NIC load

Network Failover Detection: Beacon Probing

Notify Switches: Yes

Failback: Yes

dvUplink1 and dvUplink2 both Active Uplinks

The VCs are firmware version 4.50 (we previously downgraded this because 4.60 and 4.61 were revoked by HPE)

We first started seeing this issue with all 6 hosts running

Firmware Version: 11.1.183.62 (having to use older firmware due to an issue with recovering from a fibre cable loss on newer firmware - host unable to see paths to storage again even after fibre cable replaced until the host was rebooted)

Driver Version: 11.2.1149.0

We have since upgraded two of the hosts to firmware version 11.2.1263.19 but have had repeat issues with VMs on these hosts so this hasn't fixed the issue.

So to re-clarify some of the suggestions on here and cover them off:

VC firmware downgrade to 4.50 doesn't fix the issue

Firmware version 11.1.183.62 doesn't fix the issue (with driver 11.2.1149.0)

Firmware version 11.2.1263.19 does fix the issue (with driver 11.2.1149.0)

I'll be logging this with VMware and HPE today. Does anyone else have any other open cases with them that I could reference to improve our chances of finding a fix?

There are suggestions in other posts (with not so similar hardware setup) that the issue is likely to be with the MAC address tables on the physical switches. Because we have Notify Switches turned on on the vDS PortGroups when a vMotion completes it notifies switches to update their MAC Address tables and this fixes the issue. So possibly somehow the Physical Switches are losing the correct MAC address for the IP address of the VM and the vMotion fixes this by notifying the switches of the MAC address.

Thanks,

glamic26

TylerDurden77 · ‎12-08-2017

Hi,
Just a quick update.

On our gen10 servers, we have this setup which has been stable for the last 36 hours.

@glamic26

You say that you have seen problems with "11.1.183.62" ? (with driver 11.2.1149.0)

Cheers

Johan

quinny100 · ‎12-21-2017

Any updates from HPE on this?

I think we're experiencing the same issue. 2 C7000 enclosures, 24 blades with Virtual Connect modules and 650FLB NIC's in the blades. ESXi 6, VM's will randomly drop off the network and come back when vMotioned.

Rebooting the hosts seems to make the issue go away for a while - last time we didn't see it for about 20 days after rebooting all the blades.

We are currently downgrading firmware on the NIC's to see if this helps.

rtortora · ‎12-21-2017

We are having very similar issues in our environment now but only after upgrading to 6.5 from 6.0. We've upgraded our emulex drivers to 11.2.1269 and network drivers to 11.2.1149 but continue to have issues with VMs dropping communication with VMs outside of the host on the same port-group (can communicate with VMs on the same port-group on the same host). Our VC firmware version is on 4.45 but it seems from the dialogue that the VC isnt the problem. Additionally, these VMs cannot talk to any other port-group on the same or different host either. It's not until we vmotion the VM or disable / enable the VM NIC that the RARP brings the VM online again - with the upstream switches / gateway that is. We may have a VM or VMs go down all within a short time, or we may go a couple of days without an issue - we dont see any pattern to what is triggering this event.

Environment:

C7000

BL460C Gen9 blades

FlexFabric 20Gb 2-port 650FLB Adapter 11.2.1269 / 11.2.1149

Virtual Connect firmware 4.45

esxi 6.5 w/ distributed switches

We have done a lot to try and stabilize the situation, including:

Initially upgraded our emulex drivers from 10.5 to 11.2.1269

Recreated the port-groups for the original migrated DVS

Recreated the DVS from scratch along with all of the port-groups.

Rebooted the upstream switches

Changed the port-group load balancing method to 'route based on originating virtual port' from 'NIC load'

Created static MAC address entries on 2 VMs to test communication between each other (failed)

Created interface IP(s) on the upstream switch(es) on the failed VM subnet to test connectivity to the VM (failed)

Removed MAC address entry in the address table on the upstream switch

Upstream switches do not show any issues with flapping during a failure event

VMWare logs/Log InSight/vROPS/ have no visibility into the issue as no events are logged during these failures

We had VMs fail on both sides of the chassis/VC

Any update to your own situation would be appreciated.

Thank you.

jlonsdale · ‎01-10-2018

Hey,

Did anyone get an answer to this issues? Does anyone have an HPE case number I can reference my local HPE support team with?

Thanks

ashishsingh1508 · ‎02-02-2018

Hi There,

Have you tried doing this ?

To reduce burst traffic drops in Windows Buffer Settings:

Click Start > Control Panel > Device Manager.
Right-click vmxnet3 and click Properties.
Click the Advanced tab.
Click Small Rx Buffers and increase the value. The default value is 512 and the maximum is 8192.
Click Rx Ring #1 Size and increase the value. The default value is 1024 and the maximum is 4096.

Ashish Singh VCP-6.5, VCP-NV 6, VCIX-6,VCIX-6.5, vCAP-DCV, vCAP-DCD

ashishsingh1508 · ‎02-02-2018

This is applicable for vmxnet3

and most of the time this resolves the issue

Ashish Singh VCP-6.5, VCP-NV 6, VCIX-6,VCIX-6.5, vCAP-DCV, vCAP-DCD

tomtom203 · ‎02-08-2018

Anyone get answer from HPE or VMWare ?

We had like same issue using Flexfabric 650M.

But the issue has gone after reboot host a few times or down/up vmnic usng esxcli command.

The issue is happened on E1000 adapter.

Guest‘s MAC address record on Flex-10 did not change from old port to new port when I did vMotion.

I think Flex-10 does not receive RARP or something packets for updating MAC address table...