VMware Cloud Community
wrf3f34ff
Enthusiast
Enthusiast

ESXi dropping ARP packets?

We're widening our ESXi 3.5.1 test deployment to a second server, and we've encountered an unusual problem. ESXi doesn't appear to be receiving ARP packets. Periodically the switch just "loses" IPs from its NIC and we've confirmed that it's due to ARP caches expiring. Inside a VM, we can packet dump without ever seeing any ARP traffic. Sometimes doing a "Restart Networking" from the console helps for awhile, and as long as traffic continually passes to an IP, obviously its arp cache does not expire.

The first machine we loaded ESXi does not have this problem, running all the same tests, we get exactly the results one would expect.

We have already exhaustively tested the hardware and switch port. Both are working fine. Switch port monitoring shows that the ARP requests are outbound to the VMware box, but they never make it.

This is not related to the guest networking stack, this happens with the ESXi management NIC as well / even right after startup when no guests are running. ("Test Management Network" has to be done to render it accessible to the VI client.) So it's somewhere in the ESXi physical LAN driver or the vSwitch.

Both the working ESXi server and the new test server are Supermicro servers that are either on the HCL or from the same "series" as machines on the HCL (same electronics, different disk configuration or form factor). The machines are identical in terms of CPUs, RAM, disks, etc. The only material difference between the one that works and the one that doesn't is that the server having the problem is a bit older; the onboard dual-port NIC is an Intel (ESB2/Gilgal) 82563EB. However, the 82563EB is a component in many of the HCL systems, so I don't think it's a fundamental compatibility issues.

Does anyone have any idea what might be going on?

Thanks!

(I originally managed to post this to the wrong forum. I'm easily confused! Smiley Happy )

Tags (4)
Reply
0 Kudos
24 Replies
wrf3f34ff
Enthusiast
Enthusiast

Just an update, the working ESXi machine is using the "igb" LAN driver and the non-working one is using the "e1000" driver.

I've also discovered that the second LAN interface on the "e1000" machine seems to work fine. Both the vmknic and guests receive and respond to ARP requests exactly as expected.

But I want to reiterate that we can boot the machine using a FreeBSD livecd and see ARP packets on both interfaces without issue. So this is either specific to the VMware e1000 driver or it's a configuration issue.

I'm certainly hoping for the latter, but I can't find anything else to check. I've already tried rebuilding the vswitch/portgroup/vmknic configuration from scratch in the "unsupported" console, with no change.

Reply
0 Kudos
Texiwill
Leadership
Leadership

Hello,

Moved to ESXi forum.


Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education.

CIO Virtualization Blog: http://www.cio.com/blog/index/topic/168354

As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
Reply
0 Kudos
wrf3f34ff
Enthusiast
Enthusiast

Thanks, Edward.

Here's the latest update.

First I tried the obvious, swapping the LAN cables and vmknic NIC IP addresses to make sure the problem stayed with the port and not the LAN segment. It did.

Then I went into the service console and ripped down networking to the bare vmnics (no vswitches, no port groups, no vmknics). I rebuilt it all from scratch, by hand, from the command line, and I got the same behavior. So I had the LAN cables swapped again and rebuilt it all again. Same vmnic, same problem. The sequence of commands used is here:

esxcfg-vswitch -a vSwitch0

esxcfg-vswitch -a vSwitch1

esxcfg-vswitch -L vmnic0 vSwitch0

esxcfg-vswitch -L vmnic1 vSwitch1

esxcfg-vswitch -A pGroup0 vSwitch0

esxcfg-vswitch -A pGroup1 vSwitch1

esxcfg-vmknic -a -i 172.20.0.101 -n 255.255.0.0 pGroup0

esxcfg-vmknic -a -i 172.21.0.101 -n 255.255.0.0 pGroup1

I believe the 82563EB is really two PHY on a single controller. The problem is always with the first port (vmnic0), so that is consistent with some sort of driver oddity. (Again, the hardware is on the VMware supported list, and if I load the same box with other OS's in the same IP configuration ARP packets are correctly delivered to both ports.) The box has one PCIe 8x slot, so adding a NIC is tempting even if this one is supported, but I need to keep that free for an iSCSI HBA.

I'm not sure where to go from here.

Reply
0 Kudos
wrf3f34ff
Enthusiast
Enthusiast

Two other threads about people having the same problem with the same Ethernet controller:

http://communities.vmware.com/thread/125883

http://communities.vmware.com/thread/173549

I've also found that while other current operating systems don't have this problem, a number of them did at one point, and it was fixed by driver updates. I'm still investigating the specifics.

Reply
0 Kudos
wrf3f34ff
Enthusiast
Enthusiast

I have not been able to learn much more about this problem, unfortunately, and it remains unsolved. The only new detail is that it may be related to the way this particularly chip handles being (mis-)used as a shared LAN port for the BMC/IPMI card in the machine. (We don't use it that way; we have a dedicated IPMI LAN. But the circuity/firmware for this option is still there.)

I'm kind of concerned by the lack of response on this. Is this is an issue that VMware will address, given that it is a VMware driver problem related to systems on the official HCL?

What are we supposed to use for purchasing guidance if the official HCL can't be trusted?

Reply
0 Kudos
dominic7
Virtuoso
Virtuoso

Have you opened an SR with VMware?

You can't be too concerned about a lack of response until you do that.

Reply
0 Kudos
Dave_Mishchenko
Immortal
Immortal

Even if you don't have a support contract, I would suggest opening a support case for this - http://www.vmware.com/go/support_request/. While VMware staff to browse the forums, it's much better to log a support call especially for an issue like this.

Reply
0 Kudos
wrf3f34ff
Enthusiast
Enthusiast

Have you opened an SR with VMware?

You can't be too concerned about a lack of response until you do that.

We're stuck at the evaluation stage. Deploying VI3 network-wide will be staggeringly expensive for us; we're not going to spend the money if the product it isn't going to work, and we can't open a service request about the product not working unless we spend the money.

Chicken and egg. Smiley Sad

Reply
0 Kudos
wrf3f34ff
Enthusiast
Enthusiast

Even if you don't have a support contract, I would suggest opening a support case for this - http://www.vmware.com/go/support_request/. While VMware staff to browse the forums, it's much better to log a support call especially for an issue like this.

Thanks for the suggestion. I don't see anywhere on the referenced page to open a support request without a contract. If there's a way, I'll certainly do it.

Reply
0 Kudos
Dave_Mishchenko
Immortal
Immortal

Support numbers are listed here - https://www.vmware.com/support/phone_support.html. Are you working with a VMware account manager or partner?

Reply
0 Kudos
wrf3f34ff
Enthusiast
Enthusiast

Support numbers are listed here - https://www.vmware.com/support/phone_support.html.

"VMware technical phone support is available to customers covered by Platinum (24x7) and Gold (12x5) support contracts."

Are you working with a VMware account manager or partner?

Not yet, my general inclination is to avoid "account managers" like the plague. But in this case, my general inclination is clearly wrong, so we'll start there and see how far we get.

Reply
0 Kudos
wrf3f34ff
Enthusiast
Enthusiast

Just to follow up on this for the benefit of future searchers, we had no response on this from either VMware sales or support. The problem remains unresolved and we've given up.

I hope the next person who encounters it will have better luck.

Also for the benefit of searchers, the problem NIC is sometimes AKA the Intel "80003ES2LAN Gigabit Ethernet Controller." The only workaround is to disable the onboard NIC and use something else, if that's an option for you.

Reply
0 Kudos
elazar
Enthusiast
Enthusiast

Did you check /var/log/messages for anything when the nic went down? Just wondering if this is your issue: http://kb.vmware.com/kb/1004650. Just to note, the header on the e1000 source for ESXi is dated 2006...

Reply
0 Kudos
Falx
Contributor
Contributor

I had very similar problem on my Supermicro 6025B-3V platform (X7DB8 motherboard according to ESXi information) with 2x 80003ES2LAN NIC onboard. ESXi 3.5 U2 were dropping incoming arp packets on 1st port so I was able to use this port only when set static arp on source computer (or ping client computer from service console - means adding MAC to its arp table).

Same problem I observed when used on this server SLES 10 SP1 (with a bit outdated e1000 driver as well) and it was fixed by simple driver update for this NIC. I even found fix in driver sources which could solve this problem probably. SLES 10 SP2 comes already with updated driver and worked properly right out of box here.

But much later I discovered that true source of this problem wasn't with MB or integrated NIC, but in external 2x ports NIC (Intel PRO 1000 MT Dual Server - 82546 if I'm not wrong), which didn't worked properly by itself on this server (it's another story) and in additional somehow caused troubles with arp to intgerated NIC. Simply removed this additional NIC and all working fine now about month.

I have 2 fully equal servers and both had described issue. Of course now both are "fixed" by removing external NIC.

Some more information - I already met similar case about year ago with another MB (desktop one). Integrated Intel NIC 1gbit were dropping arp packets as well under FreeBSD 6.2 and that was also fixed by driver update (had to compile new driver from sources).

So I suppose that in my case with Supermicro there were 2 problems - outdated e1000 driver in ESXi (and perhaps in some other systems) and incompatible external NIC. Note: integrated NIC worked under SLES 10 SP1 properly after e1000 driver update even without removing "incompatible" external NIC (though that external NIC didn't worked fully anyway).

Hope this info will be useful as I spent quite much time trying to resolve such puzzle Smiley Happy

Alexander Fronkin

Reply
0 Kudos
wrf3f34ff
Enthusiast
Enthusiast

Just wondering if this is your issue: http://kb.vmware.com/kb/1004650

No, it isn't. Thanks though.

Reply
0 Kudos
wrf3f34ff
Enthusiast
Enthusiast

I have 2 fully equal servers and both had described issue. Of course now both are "fixed" by removing external NIC.

Our servers do not have external NICs. There may be other issues as well, but the one described in this thread is solely a driver issue with ESXi. Thanks though.

Reply
0 Kudos
mike_laspina
Champion
Champion

Hello,

I have not seen any issues with the ESXi drivers yet, hopefully this is not one here. This issue sounds much more like a network issue and not a ESXi driver issue.

The mach address initially responds when it connects to the physical switch until the arp cache ttl expires and after that it fails to see the arp broadcast from the switch and does not reply to a query.

Are you sure the arp was fowarded to all ports?

Have you exceeded the MAC table entry limit on the switches? (sometimes a limit is set!)

Are you using 802.1q?

Are there multiple switches involved?

Are you teaming the nics?

Are there port errors on the physical switch?

Could be a bad nic!

Latetest bios applied to the server?

http://blog.laspina.ca/ vExpert 2009
Reply
0 Kudos
wrf3f34ff
Enthusiast
Enthusiast

I have not seen any issues with the ESXi drivers yet, hopefully this is not one here.

It is.

This issue sounds much more like a network issue and not a ESXi driver issue.

It is an ESXi driver issue, not a network issue. It is a known issue with the particular Intel chip used that requires a driver workaround that VMware has never implemented. If I cared enough, I would track down the requisite patch from the Linux or FreeBSD e1000 drivers and post it, but as has already been stated we have given up on this issue and VI3.

Are you sure the arp was fowarded to all ports?

Yes.

Have you exceeded the MAC table entry limit on the switches? (sometimes a limit is set!)

No.

Are you using 802.1q?

No.

Are there multiple switches involved?

No.

Are you teaming the nics?

No.

Are there port errors on the physical switch?

No.

Could be a bad nic!

It is not a bad NIC!

As has already been stated more than once, we can load any other OS on the same box in the same LAN configuration and the problem does not occur, because all of those other OS's (Windows, Linux, FreeBSD) have the fix.

We have 4 of these servers and they all experience the issue.

Latetest bios applied to the server?

Yes.

This is not a mystery, or something that needs to be figured out. We have already expended the significant effort and research necessary to identify the problem and solution. The solution is for VMware to bring in the needed patch, and there's currently no evidence that they plan to do that, or that they even know or care about the problem.

Reply
0 Kudos
mike_laspina
Champion
Champion

Is it possible that the IPMI abstraction is causing your ARP issue. This management feature has been known to cause other issues with ARP. You can disable the ARP broadcast on the IPMI service.

BTW Just trying to feed you ideas, I can sense the frustration in your response.

http://blog.laspina.ca/ vExpert 2009
Reply
0 Kudos