bjp106
Contributor
Contributor

VM stops pinging after vmotion

Jump to solution

I know there are many, many previous discussions on this topic and have not yet found out the issue.

I have two hosts in a cluster, which both were running ESXi, 6.5.0, 7388607. These hosts are HP BL460c Gen9 in an HP c7000 enclosure with flex fabric virtual connects. I wanted to upgrade these hosts but b/c these hosts were using the old net-bnx2x driver and not the qfle3 driver the normal VUM upgrade would fail when the host would come back up to a black screen. So instead of doing alot of manual driver updates, etc. my team opted to do a "fresh" install of 6.5.0, 11925212 (latest) since we've had multiple versions upgraded over the years. This fresh install would reuse the same management IP and vMotion IP's configured. I did a fresh install on the first host to 6.5.0, 11925212. Everything went great as any previous host reinstall that I've done. When I was ready to start on the second host, when the VM's vMotioned to the first host (now upgraded to latest), half of the VM's lost network connection when they landed on this host. Some VM's in the same VLAN were not affected after vMotion. A quick vmotion back didn't seem to fix the issue (my recollection on this part is fuzzy, I was just trying to do everything I could to get VM's back on network). I had another cluster of hosts using 6.5.0, 11925212 that see's the same storage for these VM's and when I vmotioned the VM's to that cluster the VM's randomly still would not have a network connection (but some would automatically ping). Once the VM Windows Guest was restarted or powered off and on, the network connection would come back. Since then I have upgraded the second host in the previous two host cluster to 6.5.0, 11925212, so now the hosts match versions. I have a few VM's left in this cluster still that if I vmotion to the other host (either host back and forth) that it randomly loses network connection. Sometimes it doesnt. For ex - Scenario 1 - VM1 on Host1 is pinging. I migrate VM1 to Host2, it will stop pinging. Migrate VM1 back to Host1, still not pinging. Power off VM or restart and it will resume pinging. Scenario 2 - VM1 on Host1 is pinging. I migrate VM1 to Host2, it will still ping.

Before 6.5.0, 11925212 we did not have this issue on these hosts with the same set of VM's in this cluster. Only after upgrading to 6.5.0, 11925212. For additional troubleshooting, I can edit the VM settings and sometimes choose a different port on the network adapter that is open and the VM resumes its network connection. Sometimes not. I am at a loss as to what the issue is here, but it seems related to 6.5.0, 11925212??? The same vDS is used across ALL clusters in the environment so its not that. I'm less inclined to think its out upstream physical switch b/c we did not have these issues before 6.5.0, 11925212. For now I turned off DRS so VM's do not move around. It does seem like both hosts have this issue. I do have a VMware SR open and am waiting to hear back, but I am very curious what the communities thoughts are.

Other observations in VM1's current network state:

--------------------------------

From my workstation:

-Cannot ping VM1

From Host2 VM1 is on:

-Cannot ping VM1 from its host (host mgmt is in different vlan)

From VM1 with connectivity issues:

-Cannot ping its host

-CAN ping another VM on the same vlan and same host (interesting)

-Cannot ping its gateway

-Cannot ping another VM on the same vlan on a different host in the cluster

From another (lets say VM2) without network connectivity issues on the same host:

-CAN ping the problematic VM successfully

-----------------------------------

An ESXTOP -n on the host indicates VM1 (problematic) and VM2 (no problems) are both using vmnic4

vmnic4 shows 100 under %DRPTX which from the surface seems alarming

vmnic5 shows 0.00 under %DRPTX

0 Kudos
1 Solution

Accepted Solutions
bjp106
Contributor
Contributor

Ok, think we found the culprit - the qfle3-1.0.60.3 driver.

We ended up installing a driver upgrade to qfle3-1.0.60.4 (the host was on 1.0.60.3). After applying driver upgrade, I am no longer seeing the loss of network connectivity behavior after testing vmotion multiple times.

I opened up an HPE support case questioning this driver version and if this is a known issue as its included in their latest HP Custom ESXi 6.5 update 2 ISO.

View solution in original post

0 Kudos
7 Replies
marcelo_soares
Champion
Champion

- On /var/log/vmkernel are you seeing any errors on vmnic4?

- VM1 and VM2 are both using the same VLAN (and you are sure VM2 is really working)?

Marcelo Soares
0 Kudos
paramoyoo
Enthusiast
Enthusiast

Hi

As  marcelo.soares just mentioned, check vmkernel.log file and grep it to find any 'dissociate' entries:

cat vmkernel.log | grep dissociate

0 Kudos
bjp106
Contributor
Contributor

I did not find anything in the vmkernel log file with 'dissociate' entries. I also grepped for 'vmnic4' and didn't find anything.

I did speak with VMware support and was able to reproduce all these issues with them. We did some packet captures and found no arp responses from the physical switch and they think the MAC address table is not updating. VMware wants me to do packet captures on the physical switches but I'll have to involve my network team for that. No other updates since that call earlier.

Has anyone had any bad experiences with the qfle3 driver? From what I understand that is the main difference between an older version of ESXi we were using the bnx2x driver and now this host is using the qfle3 driver.

0 Kudos
bjp106
Contributor
Contributor
- VM1 and VM2 are both using the same VLAN (and you are sure VM2 is really working)?

That's correct. Yes VM2 is communicating fine on the network on the same host.

0 Kudos
bjp106
Contributor
Contributor

Some interesting links VMware support just shared with me....

Found couple of external links which seems dissatisfied with qfle3 driver,

1. For more information on the same refer to the following : https://blog.zoomik.pri.ee/posts/iscsi-hba-packet-loss-may-crash-your-vsphere/

     Please note this above link has not been verified or vetted by VMware or me personally.

2. https://serverfault.com/questions/950301/qfle3-driver-crashing-vmware-hosts-solved-reverting-to-bnx2...

3. Some other known issue : https://kb.vmware.com/s/article/52044

Although most of these are talking about PSODs and iSCSI which is not related to my experiences.

0 Kudos
bjp106
Contributor
Contributor

Ok, think we found the culprit - the qfle3-1.0.60.3 driver.

We ended up installing a driver upgrade to qfle3-1.0.60.4 (the host was on 1.0.60.3). After applying driver upgrade, I am no longer seeing the loss of network connectivity behavior after testing vmotion multiple times.

I opened up an HPE support case questioning this driver version and if this is a known issue as its included in their latest HP Custom ESXi 6.5 update 2 ISO.

View solution in original post

0 Kudos
RubensSanches
VMware Employee
VMware Employee

Hi Bjo106,

It seems this problem is back again in the higher versions of the qfle3 driver.  I have observed the same behavior with 1.0.72, 1.0.77 and 1.0.86 versions. After upgraded to qfle3-1.0.60.4 did you solve this problem definitively? Did you upgrade to a higher version after that or are you currently running 1.0.60.4 ?

thank you,

Rubens

0 Kudos