Solved: Re: VMs intermittent loses network connectivity. - Page 2

Johan77 · ‎11-25-2017

Hi,

We have a strange problem/bug in our new VMware cluster.

Environment

BL460 gen10 with HP FlexFabric 20Gb 2-port 650FLB Adapter

HPE C7000 chassis

vSphere 6.0 (build 6775062)

ESX01,ESX03 and ESX03 are in chassi01

ESX04,ESX05 and ESX are in chassi02

VMs intermittent loses network connectivity.

When this happens the “remedy” is to migrate the specific VM to some other host in the cluster.

So far it seems that it doesn’t matter if I migrate the VM to a VMhost inside the same chassis or to the other chassis , just a migration seems to solve the issue. (I can’t migrate it back to the same host though)

I have around 150 VMs in this cluster and so far I’ve had issues with 5-6 of them , completely random.

They could be on any of my VMhosts in the cluster.

Haven’t created any support case with VMware or HPE yet , this forum post is my first advance to tackle this problem.

All firmware is updated to the latest from HPE

Someone who have seen similar issues?

Regards

Johan

bulabog · ‎02-12-2018

hi,

we are also experiencing the same issue...

RAJ_RAJ · ‎02-13-2018

Hi ,

you have to update firmware

Remove the NIC from profile add new one . configure ESxi host with nic , it should fix the issue .

RAJESH RADHAKRISHNAN VCA -DCV/WM/Cloud,VCP 5 - DCV/DT/CLOUD, ,VCP6-DCV, EMCISA,EMCSA,MCTS,MCPS,BCFA https://ae.linkedin.com/in/rajesh-radhakrishnan-76269335 Mark my post as "helpful" or "correct" if I've helped resolve or answered your query!

DeepakNegi420 · ‎02-13-2018

Is it possible for you to provide vmkernel.log, hostd.log & VM's vmware.log file?

1- Are all virtual machines isolated from the network or just one?

2- When one virtual machine is isolated from network, can you ping it from a different VM from the same VLAN and see if it's reachable?

3- I noticed that you mentioned about replacement of VC module, did you try to roll back your change?
4- Are you running VM snapshot based backups?

Regards, Deepak Negi

YushkovSergey · ‎02-14-2018

Our g9 servers are still stable on CNA firmware: 11.1.183.23 and driver: 11.1.196.3

But recently I got several new g10 servers with "HP FlexFabric 20Gb 2-port 650FLB Adapter".

I use VMware-ESXi-6.5.0-Update1-7388607-HPE-650.U1.10.2.0.23-Feb2018.iso for esxi installation. And there are no new software/firmware on 2017.10.1 spp for g10 servers, so nothing to update here.

Will try g10 with firmware version: 11.4.1231.6 Drivers & Software - HPE Support Center.

and driver version: 11.4.1205.0 (this version comes with hpe esxi iso)

steckrgx2 · ‎02-27-2018

Hello, Iam facing exactly the same issues with 8 BL460c Gen9 having HP Flex 20Gb 2-port 650FLB adapter and 2 c7000 enclosures.

Virtual Connect has been upgraded to version 4.60 and we then had few VMs randomly losing connectivity. HP then advised to downgrade VC in version 4.50 but we are still facing this issue.

We have tried the following FW and driver combination : 11.2.1263.19 (FW) + 11.2.1149.0 (driver) and 11.2.1226.20 (FW) + 11.2.1149.0 (driver) and 11.1.183.62 (FW) +

11.2.1149.0 (driver).

Issue is still present at the moment but my question is anyone logged a call with HPE ?

petestone · ‎03-01-2018

We are experiencing intermittent VM disconnects that we have to resolve by doing vmotion to another host. We have also experienced packet drops on some VMs in the past. Not sure how common it is right now.

None of our Driver/firmware combinations are supported according to the Vmware compatibility guide (VMware Compatibility Guide - I/O Device Search ) but it is also nearly impossible to find these firmware/driver combinations from HP. I have found Firmware 11.1.183.62 for download, but not 11.2.1226.5.

I have logged a support case with HP and I have supplied the requested logs, I have also put forward the question how to find firmware 11.2.1226.5 to comply with the VCG.

As steckrgx2 is also experiencing issues with 11.2.1226.20 (FW) + 11.2.1149.0 (driver) I suspect the best option is to downgrade/upgrade to Firmware 11.1.183.63 Driver 11.1.145.0 across the board as this combination is specified as supported by Vmware.

I am hoping HP will inform us on how to proceed and/or whether this is a confirmed issue with certain driver/firmware combinations. Will update if I get any information that I can relay.

See info below:

We have 2 sites, 2 , 2 separate vmware datacenters and each running multiple clusters.

Each blade is BL460 gen9 with 650FLB adapters.

Site A is running

vsphere 6.0 U3 b6921384

Firmware 11.2.1263.19

Driver 11.2.1149.0

Site B is running mainly

vSphere 6.0 U3 b5050593

Firmware 11.1.183.23

Driver 11.1.145.0

But some hosts are running Because they got updated drivers by mistake, so we elected to upgrade firmware to the closest we could find.

vsphere 6.0 U3 b6921384

Firmware 11.2.1263.19

Driver 11.2.1149.0

EugenRodekuhr · ‎03-06-2018

VMs intermittent loses network connectivity

We are experiencing intermittent VM disconnects that we have to resolve by doing vMotion to another host. We have also experienced packet drops on some VMs in the past. Not sure how common it is right now.

Our basis are HP ProLiant 460c G9 server in C7000 Enclosures.

The main issues is, that we can't nail down the root cause as it is intermittent. It is happening in our production environment and for that reason we can't do a deep troubleshooting and in most cases we need to vMotion the VM to get it back to work.

We have involved VMware and HPE but both are thinking that the issues is related to a network problem in the data centre network switches and is a layer 2 issue, somehow related to MAC address issues or duplicate MAC Addresses.

We are not in control of the DC Network switches for that reason we decided to tackle the issues from the server side onwards, to eliminate involved components one by one.

The start of the happenings could not be exactly fixed to one or the other action we took to keep the VMware environment in a supported status.

It seems that the first time we saw it was with the following setup:

OA 4.70
VC 4.60
Blade Bios I36 - 02-17-2017
FLB650 Firmware 11.2.1226.20
FLB650 Driver 11.2.1149.0
ESXi-Build 6.0.U3a 5572656

We did some research and updated in single steps to the versions below.

Current Setup:

OA 4.70
VC 4.60
Blade Bios I36 - 25-10-2017
FLB650 Firmware 11.2.1263.19
FLB650 Driver 11.2.1149.0
ESXi-Build 6.0.U3d 6921384

Is looks like the amount of failures has decreased, but we are not 100% sure, as the failure is intermittent.

So far no we could not find any final solution from anybody, for that reason we decided the do a step by upgrade using the following steps and try to finally solve the issue.

Future Setup:

OA 4.70
VC 4.62
Blade Bios I36 - 01-22-2018
FLB650 Firmware 11.4.1223.x
FLB650 Driver 11.4.1210.0
ESXi-Build 6.0.U3d 6921384

HPE has released VC Firmware 4.62 and SPP 03.2018, but the HPE has not committed that there is an issue, a fix or solution is not mentioned in the release notes.

To keep the setup in a supported condition we do not really have another chance then to proceed this way.

By the way the mentioned supported bios/firmware is nit available for download in all sites we searched for it, and at least it is a very old version,

The newer released versions should fix the issue anyway.

Our next steps are:

Verify if the amount of failures has really decreased
Update to VC firmware 4.62 and check if the failure is gone
Update to SPP 03.2018 and check if the failure is gone

We keep you posted about the outcome.

Any feedback and suggestions are welcome.

Best Regards

Eugen

stevenanderson · ‎03-09-2018

Hi Eugen

was you able to see if this fixed your environment. We have similar problems affecting VM communication in different VLANs, out of no where we would get network timeout affecting a particular VLAN.

as for the environment we are running exactly what you have with the exception that VC firmware is on 4.50. Now operations are blaming the Cisco Nexus switches but I am certain this is not the cause since I have another ESX cluster running off proliant dl580 and no reported issues.

thanks

EugenRodekuhr · ‎03-12-2018

Hello Steven,

we have not been able tp go forward so far due to the fact that we need an approved change ticket from our customer.

Hopefully the change will be approved and we can start with our activity on wednesday this week.

I keep you posted.

A collegue of mine had similiar failures with a Nexus vswitch, he solved it with adding uplink ports to his Nexus, maybe worth to check if you an buffer overflow somewhere on your Nexus.

another collegue had seen duplicate MAC Addresses, even if they are in different VLANS, some Component _ VC-Ethernet Module or vSwitch seem to see them and stop responding.

We are trying to identify the root cause step by step, will eliminate the VC Module from their setup, to verify where the issues is caused.

We will upgrade the VC Module, then upgrade the BIOS of the blades and then schek the VMware setuop again.

Keep you posted.

RSNTeam · ‎03-13-2018

Hello together,

has someone tried to solve the problem by updating the environment to the newest firmware?

CNA: 11.4.1205.0

VC: 4.62

This is from the latest SPP 03.2018.

We will begin with this starting next week, but I would be curious to know if anyone had success with this already.

Thanks in advance!

cheers,

RSNTeam

EugenRodekuhr · ‎03-14-2018

We will start this week - with uzpdating to VC firmware 4.62.

After cheicng the result and looking at the environment we will update to SPP 03.2018.

Keep you posted.

EugenRodekuhr · ‎03-14-2018

Hello All,

sorry for any inconvenience caused, but we need to postpone our upgrade activities for one week.

Our customer has released a change freeze und due to that we can't go forward right now.

Regards

Eugen

Keep you posted about our progress.

EugenRodekuhr · ‎03-16-2018

We will start our update activity on the 20th of match 2018. we will start updating to SSP 2018-03 to eliminate the Onboard NIC (FLB650) as the root cause. The second step will be to update the VC firmware to 4.62 to see if we are further improving.

Keep you posted about the outcome.

skumflum42 · ‎03-29-2018

We have the same problems. Have you had time to finish the upgrade?

One more thing. Have you tried to disable/enable the NIC from within Windows? This and doing a vMotion fixes things for us

EugenRodekuhr · ‎03-29-2018

Hello All.

we have been able to create an custom SSP from the HPE SSP site, this bundle was accepted from Oneview and we could update our environment to the versions below:

HP ProLiant BL460c G9 BIOS to I36 22.01.2018
HP ProLiant BL460c G9 FlexFabric 20Gb 2-Port 650FLB Firmware 11.4.1231.6
HP ProLiant BL460c G9 HP QMH2670 16Gb FC HBA Firmware v2.1.57.1
HP ProLiant BL460c G9 iLO-4 firmware 2.55
C7000 Onboard Adminiastrator firmware to 4.80
C700 Virtual Connect firmware 4.62 (Ehternet 4.62 / FC 8GB/20Port 2.15 - FC 8GB/24Port 3.09)
VMware ESXi Version 6.0 Update 3 Build 6921384

It seems that our original issue is solved now and after roughly one week we do not see any issues.

We have not been able to clearly identify thge root cause of the issue, but is seems to be solved now.

From our point of view we think that the issues was caused by a mix of incompatibility of the Server NIC firmware, the VC firmware and the VMware Server version (Build).

Seems that this version is running as needed.

Keep you posted about the outcome.

Regards

Eugen

EugenRodekuhr · ‎03-29-2018

We have not tested this inside windows.

The only way to workaround in our setup eas just to vMotion the Vm to another host.

we do not have access to the OS of the VMs and for that reason we did not even try this method.

hope our solution will help you to get rifd of the issue as well.

Regards

Eugen

sorenemig · ‎04-10-2018

Do you still consider the problem to be resolved??

We tried to upgrade the NIC firmware/driver only and remained VC at 4.50. This did not resolve the issue for me.

I will go ahead and upgrade the VC to 4.62.

petestone · ‎04-12-2018

We have planned and upgrade to virtual connect 4.62 as well. Will do a rolling upgrade of the firmware and drivers on all hosts as well but we will start with the virtual connect.

Will post once we have seen results, good or bad.

sorenemig · ‎04-13-2018

We have upgrade to Virtual Connect to 4.62 on all my C7000. I will post the result in a week's time

EugenRodekuhr · ‎04-18-2018

Sorry for the delay, yes it still seems that the issues have been fixed with our update routine.