VMware Cloud Community
Johan77
Enthusiast
Enthusiast
Jump to solution

VMs intermittent loses network connectivity.

Hi,

We have a strange problem/bug in our new VMware cluster.

Environment

BL460 gen10 with HP FlexFabric 20Gb 2-port 650FLB Adapter

HPE C7000 chassis

vSphere 6.0 (build 6775062)

ESX01,ESX03 and ESX03 are in chassi01

ESX04,ESX05 and ESX are in chassi02

VMs intermittent loses network connectivity. 

When this happens the “remedy” is to migrate the specific VM to some other host in the cluster.

So far it seems that it doesn’t matter if I migrate the VM to a VMhost inside the same chassis or to the other chassis , just a migration seems to solve the issue. (I can’t migrate it back to the same host though)

I have around 150 VMs in this cluster and so far I’ve had issues with 5-6 of them , completely random.

They could be on any of my VMhosts in the cluster.

Haven’t created any support case with VMware or HPE yet , this forum post is my first advance to tackle this problem.

All firmware is updated to the latest from HPE 

Someone who have seen similar issues?

Regards

Johan

57 Replies
bulabog
Enthusiast
Enthusiast
Jump to solution

hi,

we are also experiencing the same issue...

Reply
0 Kudos
RAJ_RAJ
Expert
Expert
Jump to solution

Hi ,

you have to update firmware

Remove the NIC from profile add new one . configure ESxi host  with nic  , it should fix the issue .

RAJESH RADHAKRISHNAN VCA -DCV/WM/Cloud,VCP 5 - DCV/DT/CLOUD, ,VCP6-DCV, EMCISA,EMCSA,MCTS,MCPS,BCFA https://ae.linkedin.com/in/rajesh-radhakrishnan-76269335 Mark my post as "helpful" or "correct" if I've helped resolve or answered your query!
Reply
0 Kudos
DeepakNegi420
Contributor
Contributor
Jump to solution

Is it possible for you to provide vmkernel.log, hostd.log & VM's vmware.log file?

1- Are all virtual machines isolated from the network or just one?

2- When one virtual machine is isolated from network, can you ping it from a different VM from the same VLAN and see if it's reachable?

3- I noticed that you mentioned about replacement of VC module, did you try to roll back your change?
4- Are you running VM snapshot based backups?

Regards, Deepak Negi
Reply
0 Kudos
YushkovSergey
Contributor
Contributor
Jump to solution

Our g9 servers are still stable on CNA firmware: 11.1.183.23 and driver: 11.1.196.3

But recently I got several new g10 servers with  "HP FlexFabric 20Gb 2-port 650FLB Adapter".

I use VMware-ESXi-6.5.0-Update1-7388607-HPE-650.U1.10.2.0.23-Feb2018.iso for esxi installation. And there are no new software/firmware on 2017.10.1 spp for g10 servers, so nothing to update here.

Will try g10 with firmware version: 11.4.1231.6 Drivers & Software - HPE Support Center. 

and driver version: 11.4.1205.0 (this version comes with hpe esxi iso)

Reply
0 Kudos
steckrgx2
Contributor
Contributor
Jump to solution

Hello, Iam facing exactly the same issues with 8 BL460c Gen9 having HP Flex 20Gb 2-port 650FLB adapter and 2 c7000 enclosures.

Virtual Connect has been upgraded to version 4.60 and we then had few VMs randomly losing connectivity. HP then advised to downgrade VC in version 4.50 but we are still facing this issue.

We have tried the following FW and driver combination : 11.2.1263.19 (FW) + 11.2.1149.0 (driver) and 11.2.1226.20 (FW) + 11.2.1149.0 (driver) and 11.1.183.62 (FW) +

11.2.1149.0 (driver).

Issue is still present at the moment but my question is anyone logged a call with HPE ?

Reply
0 Kudos
petestone
Contributor
Contributor
Jump to solution

We are experiencing intermittent VM disconnects that we have to resolve by doing vmotion to another host. We have also experienced packet drops on some VMs in the past. Not sure how common it is right now.

None of our Driver/firmware combinations are supported according to the Vmware compatibility guide (VMware Compatibility Guide - I/O Device Search ) but it is also nearly impossible to find these firmware/driver combinations from HP.  I have found Firmware 11.1.183.62 for download, but not 11.2.1226.5.

I have logged a support case with HP and I have supplied the requested logs, I have also put forward the question how to find firmware 11.2.1226.5 to comply with the VCG.

As steckrgx2​ is also experiencing issues with 11.2.1226.20 (FW) + 11.2.1149.0 (driver) I suspect the best option is to downgrade/upgrade to Firmware 11.1.183.63 Driver 11.1.145.0 across the board as this combination is specified as supported by Vmware.

I am hoping HP will inform us on how to proceed and/or whether this is a confirmed issue with certain driver/firmware combinations. Will update if I get any information that I can relay.

See info below:

We have 2 sites, 2 , 2 separate vmware datacenters and each running multiple clusters.

Each blade is BL460 gen9 with 650FLB adapters.

Site A is running

vsphere 6.0 U3 b6921384

Firmware 11.2.1263.19

Driver 11.2.1149.0

Site B is running mainly

vSphere 6.0 U3 b5050593

Firmware 11.1.183.23

Driver 11.1.145.0

But some hosts are running Because they got updated drivers by mistake, so we elected to upgrade firmware to the closest we could find.

vsphere 6.0 U3 b6921384

Firmware 11.2.1263.19

Driver 11.2.1149.0

Reply
0 Kudos
EugenRodekuhr
Contributor
Contributor
Jump to solution

VMs intermittent loses network connectivity

We are experiencing intermittent VM disconnects that we have to resolve by doing vMotion to another host. We have also experienced packet drops on some VMs in the past. Not sure how common it is right now.

Our basis are HP ProLiant 460c G9 server in C7000 Enclosures.

The main issues is, that we can't nail down the root cause as it is intermittent. It is happening in our production environment and for that reason we can't do a deep troubleshooting and in most cases we need to vMotion the VM to get it back to work.

We have involved VMware and HPE but both are thinking that the issues is related to a network problem in the data centre network switches and is a layer 2 issue, somehow related to MAC address issues or duplicate MAC Addresses.

We are not in control of the DC Network switches for that reason we decided to tackle the issues from the server side onwards, to eliminate involved components one by one.

The start of the happenings could not be exactly fixed to one or the other action we took to keep the VMware environment in a supported status.

It seems that the first time we saw it was with the following setup:

  • OA                                          4.70
  • VC                                          4.60
  • Blade Bios                             I36 - 02-17-2017
  • FLB650 Firmware                11.2.1226.20
  • FLB650 Driver                      11.2.1149.0
  • ESXi-Build 6.0.U3a             5572656

We did some research and updated in single steps to the versions below.

Current Setup:

  • OA                                          4.70
  • VC                                          4.60
  • Blade Bios                             I36 - 25-10-2017
  • FLB650 Firmware                11.2.1263.19
  • FLB650 Driver                      11.2.1149.0
  • ESXi-Build 6.0.U3d             6921384

Is looks like the amount of failures has decreased, but we are not 100% sure, as the failure is intermittent.

So far no we could not find any final solution from anybody, for that reason we decided the do a step by upgrade using the following steps and try to finally solve the issue.

Future Setup:

  • OA                                          4.70
  • VC                                          4.62
  • Blade Bios                             I36 - 01-22-2018
  • FLB650 Firmware                11.4.1223.x
  • FLB650 Driver                      11.4.1210.0
  • ESXi-Build 6.0.U3d             6921384

HPE has released VC Firmware 4.62 and SPP 03.2018, but the HPE has not committed that there is an issue, a fix or solution is not mentioned in the release notes.

To keep the setup in a supported condition we do not really have another chance then to proceed this way.

By the way the mentioned supported bios/firmware is nit available for download in all sites we searched for it, and at least it is a very old version,

The newer released versions should fix the issue anyway.

Our next steps are:

  1. Verify if the amount of failures has really decreased
  2. Update to VC firmware 4.62 and check if the failure is gone
  3. Update to SPP 03.2018 and check if the failure is gone

We keep you posted about the outcome.

Any feedback and suggestions are welcome.

Best Regards

Eugen

Reply
0 Kudos
stevenanderson
Contributor
Contributor
Jump to solution

Hi Eugen

was you able to see if this fixed your environment.  We have similar problems affecting VM communication in different VLANs,  out of no where we would get network timeout affecting a particular VLAN.

as for the environment we are running exactly what you have with the exception that VC firmware is on 4.50. Now operations are blaming the Cisco Nexus switches but I am certain this is not the cause since I have another ESX cluster running off proliant dl580 and no reported issues.

thanks

Reply
0 Kudos
EugenRodekuhr
Contributor
Contributor
Jump to solution

Hello Steven,

we have not been able tp go forward so far due to the fact that we need an approved change ticket from our customer.

Hopefully the change will be approved and we can start with our activity on wednesday this week.

I keep you posted.

A collegue of mine had similiar failures with a Nexus vswitch, he solved it with adding uplink ports to his Nexus, maybe worth to check if you  an buffer overflow somewhere on your Nexus.

another collegue had seen duplicate MAC Addresses, even if they are in different VLANS, some Component _ VC-Ethernet Module or vSwitch seem to see them and stop responding.

We are trying to identify the root cause step by step, will eliminate the VC Module from  their setup, to verify where the issues is caused.

We will upgrade the VC Module, then upgrade the BIOS of the blades and then schek the VMware setuop again.

Keep you posted.

Reply
0 Kudos
RSNTeam
Contributor
Contributor
Jump to solution

Hello together,

has someone tried to solve the problem by updating the environment to the newest firmware?

CNA: 11.4.1205.0

VC: 4.62

This is from the latest SPP 03.2018.

We will begin with this starting next week, but I would be curious to know if anyone had success with this already.

Thanks in advance!

cheers,

RSNTeam

Reply
0 Kudos
EugenRodekuhr
Contributor
Contributor
Jump to solution

We will start this week - with uzpdating to VC firmware 4.62.

After cheicng the result and looking at the environment we will update to SPP 03.2018.

Keep you posted.

Reply
0 Kudos
EugenRodekuhr
Contributor
Contributor
Jump to solution

Hello All,

sorry for any inconvenience caused, but we need to postpone our upgrade activities for one week.

Our customer has released a change freeze und due to that we can't go forward right now.

Regards

Eugen

Keep you posted about our progress.

Reply
0 Kudos
EugenRodekuhr
Contributor
Contributor
Jump to solution

We will start our update activity on the 20th of match 2018. we will start updating to SSP 2018-03 to eliminate the Onboard NIC (FLB650) as the root cause. The second step will be to update the VC firmware to 4.62 to see if we are further improving.

Keep you posted about the outcome.

Reply
0 Kudos
skumflum42
Contributor
Contributor
Jump to solution

We have the same problems. Have you had time to finish the upgrade?

One more thing. Have you tried to disable/enable the NIC from within Windows? This and doing a vMotion fixes things for us

EugenRodekuhr
Contributor
Contributor
Jump to solution

Hello All.

we have been able to create an custom SSP from the HPE SSP site, this bundle was accepted from Oneview and we could update our environment to the versions below:

  • HP ProLiant BL460c G9 BIOS to I36 22.01.2018
  • HP ProLiant BL460c G9 FlexFabric 20Gb 2-Port 650FLB Firmware 11.4.1231.6
  • HP ProLiant BL460c G9 HP QMH2670 16Gb FC HBA Firmware v2.1.57.1
  • HP ProLiant BL460c G9 iLO-4 firmware 2.55
  • C7000 Onboard Adminiastrator firmware to 4.80
  • C700 Virtual Connect firmware 4.62 (Ehternet 4.62 / FC 8GB/20Port 2.15 - FC 8GB/24Port 3.09)
  • VMware ESXi Version 6.0 Update 3 Build 6921384

It seems that our original issue is solved now and after roughly one week we do not see any issues.

We have not been able to clearly identify thge root cause of the issue, but is seems to be solved now.

From our point of view we think that the issues was caused by a mix of incompatibility of the Server NIC firmware, the VC firmware and the VMware Server version (Build).

Seems that this version is running as needed.

Keep you posted about the outcome.

Regards

Eugen

EugenRodekuhr
Contributor
Contributor
Jump to solution

We have not tested this inside windows.

The only way to workaround in our setup eas just to vMotion the Vm to another host.

we do not have access to the OS of the VMs and for that reason we did not even try this method.

hope our solution will help you to get rifd of the issue as well.

Regards

Eugen

Reply
0 Kudos
sorenemig
Enthusiast
Enthusiast
Jump to solution

Do you still consider the problem to be resolved??

We tried to upgrade the NIC firmware/driver only and remained VC at 4.50. This did not resolve the issue for me.

I will go ahead and upgrade the VC to 4.62.

Reply
0 Kudos
petestone
Contributor
Contributor
Jump to solution

We have planned and upgrade to virtual connect 4.62 as well. Will do a rolling upgrade of the firmware and drivers on all hosts as well but we will start with the virtual connect.

Will post once we have seen results, good or bad.

Reply
0 Kudos
sorenemig
Enthusiast
Enthusiast
Jump to solution

We have upgrade to Virtual Connect to 4.62 on all my C7000. I will post the result in a week's time

Reply
0 Kudos
EugenRodekuhr
Contributor
Contributor
Jump to solution

Sorry for the delay, yes it still seems that the issues have been fixed with our update routine.

Reply
0 Kudos