Re: DVS port group creation errors in 5.1 U2

JasonGillis · ‎01-23-2014

Hi all,

We've recently upgraded a few (4 of 12) of our ESXi hosts to 5.1 U2. On a couple of those hosts in the last couple days, we've seen issues with port group creation on the dvSwitch backing our vCloud Director environment. (Our VCD environment is used for support and services engineers to reproduce customer issues, so there's A LOT of isolated networks and port groups to go with them. "A LOT", in this case, is close to 300.)

The error we see is in vSphere Client is:

Cannot create Distributed Port Group dvportgroup-31884 of VDS dvLabNetwork on the host <host>

vDS operation failed on host <host>, An error occurred during host configuration. got (vim.fault.PlatformConfigFault) exception

Digging into the logs on that host, I was able to see this:

Jan 22 19:49:23 host Hostd: [360C4B90 info 'Solo.Vmomi' opID=activity=urn:uuid:28d0ad86-c7c3-4e6c-a54f-861a73f84e97-d9-3e-10] Throw vim.fault.PlatformConfigFault

Jan 22 19:49:23 host Hostd: [360C4B90 info 'Solo.Vmomi' opID=activity=urn:uuid:28d0ad86-c7c3-4e6c-a54f-861a73f84e97-d9-3e-10] Result:

Jan 22 19:49:23 host Hostd: --> (vim.fault.PlatformConfigFault) {

Jan 22 19:49:23 host Hostd: --> dynamicType = <unset>,

Jan 22 19:49:23 host Hostd: --> faultCause = (vmodl.MethodFault) null,

Jan 22 19:49:23 host Hostd: --> faultMessage = (vmodl.LocalizableMessage) [

Jan 22 19:49:23 host Hostd: --> (vmodl.LocalizableMessage) {

Jan 22 19:49:23 host Hostd: --> dynamicType = <unset>,

Jan 22 19:49:23 host Hostd: --> key = "com.vmware.esx.hostctl.default",

Jan 22 19:49:23 host Hostd: --> arg = (vmodl.KeyAnyValue) [

Jan 22 19:49:23 host Hostd: --> (vmodl.KeyAnyValue) {

Jan 22 19:49:23 host Hostd: --> dynamicType = <unset>,

Jan 22 19:49:23 host Hostd: --> key = "reason",

Jan 22 19:49:23 host Hostd: --> value = "Failed to get DVS state from vmkernel Status (bad0014)= Out of memory",

Jan 22 19:49:23 host Hostd: --> }

Jan 22 19:49:23 host Hostd: --> ],

Jan 22 19:49:23 host Hostd: --> message = <unset>,

Jan 22 19:49:23 host Hostd: --> }

Jan 22 19:49:23 host Hostd: --> ],

Jan 22 19:49:23 host Hostd: --> text = "",

Jan 22 19:49:23 host Hostd: --> msg = ""

Jan 22 19:49:23 host Hostd: --> }

I've implemented the workaround found in this KB article: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=203407...

(And, the issue is listed as a known issue in the release notes for at least a couple revs of 5.1 as far as I can tell.)

We've seen this on two hosts and after implementing the change to Net.DVSLargeHeapMaxSize, it hasn't reappeared on those two hosts. So far, so good.

I'm curious, though, if others have seen this issue since updating to 5.1 U2 on their ESXi hosts? Our vCenter is still running 5.1 U1c, which is a slight mismatch with the hosts, so that is on my list to knock down this weekend. We've only seen this on hosts upgraded to 5.1 U2 and never saw it before, but I'm not 100% sure whether it's due to that or just because we've got so many port groups in the first place.

Any feedback would be appreciated!

Thanks,

Jason

joernc · ‎02-03-2014

Yes, similar situation here: Upgraded the ESXi hosts to 5.1U2, but vCenter is still at U1. But I don't expect this mismatch to be a problem. Here VMs lost their connection to the dvSwitch, and trying to reconnect them resulted in the same error message. I also found the same KB article and applied the fix.

The disconnects were usually triggered when vRanger removed its snapshot. After increasing DVSLargeHeapMaxSize last friday, over the weekend one machine still experienced the same problem during backup. At the next backup cycle 24 hours later, the interface got reconnected, which puzzles me a little bit...

The affected dvSwitch has 16 port groups and 113 VMs. I can't believe that these are problematic figures, I consider this a small setup. The current configuration options for DVSLargeHeapMaxSize don't leave much room for growth, if this is really the limiting factor. In a way I hope this is a bug that will get fixed.

Does anybody know if this has changed in 5.5? I.e. is there still a parameter Net.DVSLargeHeapMaxSize you can configure?

JasonGillis · ‎02-03-2014

No resolution here. I do have a support case open with VMware tech support, though.

I was able to update our vCenter server to 5.1 U2 yesterday and am doing some testing today.

I have seen the dvPortCreation failed error on a non-5.1 U2 host, too, so it's not strictly isolated to that version. It does appear to be much more pronounced on 5.1 U2, though.

Jason

joernc · ‎02-04-2014

Of course I opened a support case, but it's dragging on and I don't expect any significant revelations from it. Please let me (us) know if you get a serious analysis of the problem.

joernc · ‎02-07-2014

I received some more information from VMware support. The relevant error message is this:

2014-01-27T13:11:06.017Z cpu16:218950)Net: 2285: associated dvPort 1168 with portID 0x4000040
2014-01-27T13:11:06.017Z cpu16:218950)WARNING: Heap: 2677: Heap netGPHeap already at its maximum size. Cannot expand.
2014-01-27T13:11:06.017Z cpu16:218950)WARNING: Heap: 3058: Heap_Align(netGPHeap, 18496/18496 bytes, 64 align) failed.  caller: 0x4180119984d5
2014-01-27T13:11:06.017Z cpu16:218950)WARNING: E1000: 8817: failed to enable port 0x4000040 on DvsPortset-2: Out of memory
2014-01-27T13:11:06.017Z cpu16:218950)WARNING: Net: vm 218951: 4454: cannot enable port 0x4000040: Out of memory
2014-01-27T13:11:06.020Z cpu16:218950)WARNING: Uplink: 3076: releasing cap 0x0!
2014-01-27T13:11:06.020Z cpu16:218950)WARNING: Uplink: 3076: releasing cap 0x0!

The netGPHeap runs out of memory, and this is the reason the configuration of the dvSwitch fails. You can check the usage of the heap on the command line with

# vsish -e get /system/heaps/netGPHeap-0x4100013cc000/stats | grep "lowest percent free"

Three of my four hosts showed reasonable values after increasing Net.DVSLargeHeapMaxSize (around 95%), but one host still exhausted this heap completely (i.e. the result was 0%).

I was told that this was a bug fixed in 5.1U1, but it seems it resurfaced in U2.

JasonGillis · ‎02-07-2014

I was going to tell a very similar story today.

I had set my netGPHeap size to 64 in the advanced settings, but I did see today that I had a host hit 0% at some point, so it's not completely cured. Despite hitting 0% free, we have not seen a reoccurrence of the port group problem.

I'm not sure if it needs to be set higher, though. Looking at a 5.5 system I have in another lab, it's netGPHeap setting (called netGPHeapMaxMBPerGB) is set to 4, but that's max mb of heap / GB of RAM in the host. So, for our hosts, that means we'll have a max heap size of 1536mb. That's a pretty large difference between 64.

I'm definitely going to follow up with support to get more feedback.

Jason

joernc · ‎02-09-2014

I had put the affected host in maintenance mode for the weekend, but the heap space on one of the remaining hosts got depleted and one VM again lost its connection. I am currently downgrading all hosts back to 5.1U1.

The interesting thing is: It's again a VM that was already hit before. It seems VMs using the E1000 NIC emulation are far more prone to running into trouble. In addition, this VM does a database backup, transferring large amounts of data via NFS. I think these are contributing factors that seem to trigger an assumed memory leak much faster.

admin · ‎02-27-2014

Hi Jason,

Unfortunately this is a bug in 5.1U2 that makes itself apparent if you are using E1000 adapters. There's a memory leak that causes the heap to become exhausted.

We have more detail related to this bug and two possible workarounds detailed at http://kb.vmware.com/kb/2072694

Laga18 · ‎02-27-2014

Can I have 1GB & 10GB uplinks in one distributed switch without any issues?

admin · ‎02-27-2014

You can but you would need to consider additional settings (NIOC, active/standby uplinks for specific portgroups etc.) to ensure that VMs get consistent performance.

PS: It's probably best to start a new thread for a new topic rather than tagging onto an existing thread. You are much more likely to get on-topic answers.

sbosse · ‎02-28-2014

VMWARE just released a fix for this issue: VMware KB: VMware ESXi 5.1, Patch Release ESXi510-201402001

joernc · ‎03-07-2014

This fix solves only part of the problem. Although the memory leak is gone, I still see massive network problems. Guest OS is Solaris 10, and a backup job that takes 10 minutes on 5.1U1 takes several hours on 5.1U2 (with and without ESXi510-201402001). What does the fix actually do? Does it "solve" the problem by fixing the memory leak, or does it turn off TSO as described in KB2072694? At least

esxcli system settings advanced list -o /Net/E1000TxTsoZeroCopy

esxcli system settings advanced list -o /Net/E1000TxZeroCopy

shows no difference between 5.1U1 and 5.1U2+ESXi510-201402001.

BTW: We did try turning off Large Segment Offload in Solaris (see Disable Large Segment Offload (LSO) in Solaris 10) with 5.1U2 but without ESXi510-201402001, and the heap was depleted nonetheless.

All

DVS port group creation errors in 5.1 U2