We are having very similar issues in our environment now but only after upgrading to 6.5 from 6.0. We've upgraded our emulex drivers to 11.2.1269 and network drivers to 11.2.1149 but continue to have issues with VMs dropping communication with VMs outside of the host on the same port-group (can communicate with VMs on the same port-group on the same host). Our VC firmware version is on 4.45 but it seems from the dialogue that the VC isnt the problem. Additionally, these VMs cannot talk to any other port-group on the same or different host either. It's not until we vmotion the VM or disable / enable the VM NIC that the RARP brings the VM online again - with the upstream switches / gateway that is. We may have a VM or VMs go down all within a short time, or we may go a couple of days without an issue - we dont see any pattern to what is triggering this event.
BL460C Gen9 blades
FlexFabric 20Gb 2-port 650FLB Adapter 11.2.1269 / 11.2.1149
Virtual Connect firmware 4.45
esxi 6.5 w/ distributed switches
We have done a lot to try and stabilize the situation, including:
Initially upgraded our emulex drivers from 10.5 to 11.2.1269
Recreated the port-groups for the original migrated DVS
Recreated the DVS from scratch along with all of the port-groups.
Rebooted the upstream switches
Changed the port-group load balancing method to 'route based on originating virtual port' from 'NIC load'
Created static MAC address entries on 2 VMs to test communication between each other (failed)
Created interface IP(s) on the upstream switch(es) on the failed VM subnet to test connectivity to the VM (failed)
Removed MAC address entry in the address table on the upstream switch
Upstream switches do not show any issues with flapping during a failure event
VMWare logs/Log InSight/vROPS/ have no visibility into the issue as no events are logged during these failures
We had VMs fail on both sides of the chassis/VC
Any update to your own situation would be appreciated.
Did anyone get an answer to this issues? Does anyone have an HPE case number I can reference my local HPE support team with?
Have you tried doing this ?
To reduce burst traffic drops in Windows Buffer Settings:
- Click Start > Control Panel > Device Manager.
- Right-click vmxnet3 and click Properties.
- Click the Advanced tab.
- Click Small Rx Buffers and increase the value. The default value is 512 and the maximum is 8192.
- Click Rx Ring #1 Size and increase the value. The default value is 1024 and the maximum is 4096.
This is applicable for vmxnet3
and most of the time this resolves the issue
Anyone get answer from HPE or VMWare ?
We had like same issue using Flexfabric 650M.
But the issue has gone after reboot host a few times or down/up vmnic usng esxcli command.
The issue is happened on E1000 adapter.
Guest‘s MAC address record on Flex-10 did not change from old port to new port when I did vMotion.
I think Flex-10 does not receive RARP or something packets for updating MAC address table...
we are also experiencing the same issue...
you have to update firmware
Remove the NIC from profile add new one . configure ESxi host with nic , it should fix the issue .RAJESH RADHAKRISHNAN
VCA -DCV/WM/Cloud,VCP 5 - DCV/DT/CLOUD, ,VCP6-DCV, EMCISA,EMCSA,MCTS,MCPS,BCFA
Mark my post as "helpful" or "correct" if I've helped resolve or answered your query!
Is it possible for you to provide vmkernel.log, hostd.log & VM's vmware.log file?
1- Are all virtual machines isolated from the network or just one?
2- When one virtual machine is isolated from network, can you ping it from a different VM from the same VLAN and see if it's reachable?
3- I noticed that you mentioned about replacement of VC module, did you try to roll back your change?
4- Are you running VM snapshot based backups?
Our g9 servers are still stable on CNA firmware: 22.214.171.124 and driver: 126.96.36.199
But recently I got several new g10 servers with "HP FlexFabric 20Gb 2-port 650FLB Adapter".
I use VMware-ESXi-6.5.0-Update1-7388607-HPE-650.U188.8.131.52.23-Feb2018.iso for esxi installation. And there are no new software/firmware on 2017.10.1 spp for g10 servers, so nothing to update here.
Will try g10 with firmware version: 11.4.1231.6 Drivers & Software - HPE Support Center.
and driver version: 11.4.1205.0 (this version comes with hpe esxi iso)
Hello, Iam facing exactly the same issues with 8 BL460c Gen9 having HP Flex 20Gb 2-port 650FLB adapter and 2 c7000 enclosures.
Virtual Connect has been upgraded to version 4.60 and we then had few VMs randomly losing connectivity. HP then advised to downgrade VC in version 4.50 but we are still facing this issue.
We have tried the following FW and driver combination : 11.2.1263.19 (FW) + 11.2.1149.0 (driver) and 11.2.1226.20 (FW) + 11.2.1149.0 (driver) and 184.108.40.206 (FW) +
Issue is still present at the moment but my question is anyone logged a call with HPE ?
We are experiencing intermittent VM disconnects that we have to resolve by doing vmotion to another host. We have also experienced packet drops on some VMs in the past. Not sure how common it is right now.
None of our Driver/firmware combinations are supported according to the Vmware compatibility guide (VMware Compatibility Guide - I/O Device Search ) but it is also nearly impossible to find these firmware/driver combinations from HP. I have found Firmware 220.127.116.11 for download, but not 11.2.1226.5.
I have logged a support case with HP and I have supplied the requested logs, I have also put forward the question how to find firmware 11.2.1226.5 to comply with the VCG.
As steckrgx2 is also experiencing issues with 11.2.1226.20 (FW) + 11.2.1149.0 (driver) I suspect the best option is to downgrade/upgrade to Firmware 18.104.22.168 Driver 22.214.171.124 across the board as this combination is specified as supported by Vmware.
I am hoping HP will inform us on how to proceed and/or whether this is a confirmed issue with certain driver/firmware combinations. Will update if I get any information that I can relay.
See info below:
We have 2 sites, 2 , 2 separate vmware datacenters and each running multiple clusters.
Each blade is BL460 gen9 with 650FLB adapters.
Site A is running
vsphere 6.0 U3 b6921384
Site B is running mainly
vSphere 6.0 U3 b5050593
But some hosts are running Because they got updated drivers by mistake, so we elected to upgrade firmware to the closest we could find.
vsphere 6.0 U3 b6921384
VMs intermittent loses network connectivity
We are experiencing intermittent VM disconnects that we have to resolve by doing vMotion to another host. We have also experienced packet drops on some VMs in the past. Not sure how common it is right now.
Our basis are HP ProLiant 460c G9 server in C7000 Enclosures.
The main issues is, that we can't nail down the root cause as it is intermittent. It is happening in our production environment and for that reason we can't do a deep troubleshooting and in most cases we need to vMotion the VM to get it back to work.
We have involved VMware and HPE but both are thinking that the issues is related to a network problem in the data centre network switches and is a layer 2 issue, somehow related to MAC address issues or duplicate MAC Addresses.
We are not in control of the DC Network switches for that reason we decided to tackle the issues from the server side onwards, to eliminate involved components one by one.
The start of the happenings could not be exactly fixed to one or the other action we took to keep the VMware environment in a supported status.
It seems that the first time we saw it was with the following setup:
- OA 4.70
- VC 4.60
- Blade Bios I36 - 02-17-2017
- FLB650 Firmware 11.2.1226.20
- FLB650 Driver 11.2.1149.0
- ESXi-Build 6.0.U3a 5572656
We did some research and updated in single steps to the versions below.
- OA 4.70
- VC 4.60
- Blade Bios I36 - 25-10-2017
- FLB650 Firmware 11.2.1263.19
- FLB650 Driver 11.2.1149.0
- ESXi-Build 6.0.U3d 6921384
Is looks like the amount of failures has decreased, but we are not 100% sure, as the failure is intermittent.
So far no we could not find any final solution from anybody, for that reason we decided the do a step by upgrade using the following steps and try to finally solve the issue.
- OA 4.70
- VC 4.62
- Blade Bios I36 - 01-22-2018
- FLB650 Firmware 11.4.1223.x
- FLB650 Driver 11.4.1210.0
- ESXi-Build 6.0.U3d 6921384
HPE has released VC Firmware 4.62 and SPP 03.2018, but the HPE has not committed that there is an issue, a fix or solution is not mentioned in the release notes.
To keep the setup in a supported condition we do not really have another chance then to proceed this way.
By the way the mentioned supported bios/firmware is nit available for download in all sites we searched for it, and at least it is a very old version,
The newer released versions should fix the issue anyway.
Our next steps are:
- Verify if the amount of failures has really decreased
- Update to VC firmware 4.62 and check if the failure is gone
- Update to SPP 03.2018 and check if the failure is gone
We keep you posted about the outcome.
Any feedback and suggestions are welcome.
was you able to see if this fixed your environment. We have similar problems affecting VM communication in different VLANs, out of no where we would get network timeout affecting a particular VLAN.
as for the environment we are running exactly what you have with the exception that VC firmware is on 4.50. Now operations are blaming the Cisco Nexus switches but I am certain this is not the cause since I have another ESX cluster running off proliant dl580 and no reported issues.
we have not been able tp go forward so far due to the fact that we need an approved change ticket from our customer.
Hopefully the change will be approved and we can start with our activity on wednesday this week.
I keep you posted.
A collegue of mine had similiar failures with a Nexus vswitch, he solved it with adding uplink ports to his Nexus, maybe worth to check if you an buffer overflow somewhere on your Nexus.
another collegue had seen duplicate MAC Addresses, even if they are in different VLANS, some Component _ VC-Ethernet Module or vSwitch seem to see them and stop responding.
We are trying to identify the root cause step by step, will eliminate the VC Module from their setup, to verify where the issues is caused.
We will upgrade the VC Module, then upgrade the BIOS of the blades and then schek the VMware setuop again.
Keep you posted.
has someone tried to solve the problem by updating the environment to the newest firmware?
This is from the latest SPP 03.2018.
We will begin with this starting next week, but I would be curious to know if anyone had success with this already.
Thanks in advance!