VMware Cloud Community
Johan77
Enthusiast
Enthusiast
Jump to solution

VMs intermittent loses network connectivity.

Hi,

We have a strange problem/bug in our new VMware cluster.

Environment

BL460 gen10 with HP FlexFabric 20Gb 2-port 650FLB Adapter

HPE C7000 chassis

vSphere 6.0 (build 6775062)

ESX01,ESX03 and ESX03 are in chassi01

ESX04,ESX05 and ESX are in chassi02

VMs intermittent loses network connectivity. 

When this happens the “remedy” is to migrate the specific VM to some other host in the cluster.

So far it seems that it doesn’t matter if I migrate the VM to a VMhost inside the same chassis or to the other chassis , just a migration seems to solve the issue. (I can’t migrate it back to the same host though)

I have around 150 VMs in this cluster and so far I’ve had issues with 5-6 of them , completely random.

They could be on any of my VMhosts in the cluster.

Haven’t created any support case with VMware or HPE yet , this forum post is my first advance to tackle this problem.

All firmware is updated to the latest from HPE 

Someone who have seen similar issues?

Regards

Johan

57 Replies
sorenemig
Enthusiast
Enthusiast
Jump to solution

Ok... so far, so good. No network issues on any VM since VC upgrade. Nothing in the release note that can explain this!

I am cautiously optimistic but not convinced.

I have had a case open with VMware.

He could not see anything particular from the logs. The support engineer was leaning toward the direction of the external switches. But it is very limited what you can do in Virtual Connect controlled by Oneview

In the core switch everything looked fine.

Reply
0 Kudos
sorenemig
Enthusiast
Enthusiast
Jump to solution

I just had the issue again. This time a Windows 2008R2 Server with VMXNET3.

In the past Linux servers was the most affected ones.


Damn this is is beginning to frustrate me Smiley Sad

Reply
0 Kudos
petestone
Contributor
Contributor
Jump to solution

We have upgraded everything but the Onboard administrators (they are not part of the communication path anyway) and we are still experiencing the issues with VMs randomly losing network connectivity. Everything seems fine, but we cant ping in or out.

Most of the time a simple disconnect/reconnect of the network interface within the VM works. But sometimes we have to do a vmotion to get it to work again and that always seems to work.

Will raise the priority of the case to HP on Monday and I will request that they engage VMware rather than me having to coordinate two vendors.

I will post again once I have an update.

Reply
0 Kudos
MarcoCrv
Contributor
Contributor
Jump to solution

Hello,

We are experiencing the exact same problem.

We too have a case open on both HP and vmware.

It seems to occur only on Linux VMs.

Lets hope they will fix it quick. It is becoming very annoying.

Bye.

Reply
0 Kudos
chnb
VMware Employee
VMware Employee
Jump to solution

You can try update the elxnet driver, to >=11.2.1271 if you were using a 11.2.x driver, or >=11.4.1255 if you were using a 11.4.X driver. These versions contain a workaround for an issue that may cause loss of VM connectivity. If you see large number for "rx_pkts_on_quiesce" in the stats of elxnet vmnics after loss of VM connectivity, it's almost certainly this issue, and these driver updates should resolve it.

sorenemig
Enthusiast
Enthusiast
Jump to solution

chnb, I have search on google like crazy but I am unable to find anything related to elmulex/elxnet 11.4.1255 or 11.2.1271.

Can you provide a deep link to a release note or download page?

Reply
0 Kudos
unsigned1138
Enthusiast
Enthusiast
Jump to solution

We've seen something similar on the HP 556FLR Emulex adapter. The latest driver HP has published is 11.4.1205....

Reply
0 Kudos
andreaspa
Hot Shot
Hot Shot
Jump to solution

HPE has published this advisory:

HPE Support document - HPE Support Center

Advisory: VMware - HPE ProLiant Server Configured With Certain Network Adapters And Running VMware ESXi 6.0 U3 May Randomly Lose Connection to Individual Virtual Machines

petestone
Contributor
Contributor
Jump to solution

Hi,

We have installed the driver released by vmware (11.2.1271) with good results. We have 1,5Weeks of stable operations and counting.

It may look like it's an older version than the one released by HP but apparently, according to HP support, they use different numbering (although the same basic format).

We have now upgraded all our hosts to the latest HP SPP (2018.03.0.B) and the latest vmware release and latest HP drivers (apart from the Vmware supplied ELXNET driver). All our blades are BL460c Gen9 with 650FLB adapters and we run virtual connect 4.62. Hopefully this means we will have stable operations during summer, but you never know Smiley Happy

Reply
0 Kudos
ujjwal2018
VMware Employee
VMware Employee
Jump to solution

Hello,

Please provide the below details as -

1. Can the virtual machine ping other virtual  machine on the same host-same port group , when the issue occurs ?

2. If yes,  then let us know the kind of adapter used for the virtual machine - e1000/ vmxnet3 .

3. If e1000 , have we tried to change the adapter to vmxnet3 adapter ( If possible, please install latest VMware tools and replace e1000 with vmxnet3 adapter )

4. If vmxnet3, then make sure the we got enough buffer on the adapter ( VMware Knowledge Base  )

5. Also provide the below details as

  • vmware -lv
  • esxcfg-nics -l
  • esxcli network nic get -n vmnix
  • esxcfg-vswitch -l

If required , we can request for some advanced stats after checking the above.

Regards,

UJ

Reply
0 Kudos
ujjwal2018
VMware Employee
VMware Employee
Jump to solution

Is everything working fine after performing the updated driver ?

Reply
0 Kudos
andreaspa
Hot Shot
Hot Shot
Jump to solution

We haven't seen any issues so far, and we've been running them for about a month.

Reply
0 Kudos
ujjwal2018
VMware Employee
VMware Employee
Jump to solution

Great !  Let me know if the issue re-occurs .

Regards,

UJ

Reply
0 Kudos
golddiggie
Champion
Champion
Jump to solution

Are things still good, or did the problem come back??

Reply
0 Kudos
munirasu
Contributor
Contributor
Jump to solution

Hey Johan

I have faced the similar issue on Dell servers.

In my case after downgrade the network driver issue resolved, but the driver is not supported for the hardware as per VMware HCL.

Now I have upgraded the dirver to latest and  disbled the NetQueue on ESXi host (depends upon your driver) now it seems to be no issues.

Reply
0 Kudos
TylerDurden77
Enthusiast
Enthusiast
Jump to solution

Hi all,

We are facing the problem again in one of our environments.

ESXi 6.7

VC 4.62

HP FlexFabric 20Gb 2-port 650FLB Adapter

elxnet version 12.0.1115.0 with firmware 12.0.1110.11

Regards

Johan

Reply
0 Kudos
RayEspinoza
Contributor
Contributor
Jump to solution

Hi, Did you find any solution for this problem?

 

Regards,

Ray.

Reply
0 Kudos