VMware Cloud Community
cesprov
Enthusiast
Enthusiast

ESXi 6.0U1a build 3073146 vpxa crashing randomly, host disconnected from vCenter

After recently upgrading vCenter and ESXi to 6.0U1a and installing all patches (now build numbers 3018524 and 3073146 respectively), we began experiencing random host disconnects from vCenter.  The host itself and all guests on it are still alive when it's disconnected; I can SSH to the host and RDP/SSH to guests.  If I literally do nothing, it eventually fixes itself and rejoins vCenter within 15-20 minutes.  We did not have this issue prior to upgrading to 6.0U1a.  This is not the "NETDEV WATCHDOG: vmnic4: transmit timed out" issue.  In fact, the reason we upgraded to the latest build was to get the fix for that particular issue.

I've personally witnessed this happen now on three different hosts and never has it reoccurred on the same host twice that we have noticed.  The vmkernel.log simply shows:

2015-11-18T20:56:42.662Z cpu12:173086)User: 3816: wantCoreDump:vpxa-worker signal:6 exitCode:0 coredump:enabled

2015-11-18T20:56:42.819Z cpu15:173086)UserDump: 1907: Dumping cartel 172357 (from world 173086) to file /var/core/vpxa-worker-zdump.000 ...

The vpxa.log doesn't show anything building up to the disconnection and leaves a large gap after the agent crashes, like so:

2015-11-18T20:56:42.638Z info vpxa[FFF2AB70] [Originator@6876 sub=vpxLro opID=QS-host-311567-2883ed8a-1e-SWI-42a5654a] [VpxLroList::ForgetTask] Unregistering vim.Task:sessio

2015-11-18T20:56:42.641Z verbose vpxa[FFF6CB70] [Originator@6876 sub=VpxaHalCnxHostagent opID=QS-host-311567-2883ed8a-1e] [VpxaHalCnxHostagent::DoCheckForUpdates] CheckForUp

2015-11-18T20:56:42.641Z verbose vpxa[FFF6CB70] [Originator@6876 sub=vpxaMoService opID=QS-host-311567-2883ed8a-1e] [VpxaMoService] GetChanges: 97820 -> 97820

2015-11-18T20:56:42.641Z verbose vpxa[FFF6CB70] [Originator@6876 sub=VpxProfiler opID=QS-host-311567-2883ed8a-1e] [2+] VpxaStatsMetadata::PrepareStatsChanges

2015-11-18T21:10:20.328Z Section for VMware ESX, pid=3326854, version=6.0.0, build=3073146, option=Release

2015-11-18T21:10:20.329Z verbose vpxa[FF8A6A60] [Originator@6876 sub=Default] Dumping early logs:

2015-11-18T21:10:20.329Z info vpxa[FF8A6A60] [Originator@6876 sub=Default] Logging uses fast path: false

vCenter logs simply show the host becoming unreachable so the problem is obviously host-side.

Anyone else seeing similar activity?  This has all the feel of another "known issue" but I don't see any talk about it.  I did open a case with VMware support and am awaiting contact now.

37 Replies
Aviator20111014
Enthusiast
Enthusiast

Hello.

A few weeks ago, I ran into the "netdev_watchdog" problem and downgraded my hosts back to ESXi 5.5.

Re: ESXi 6.0 causing Dell C6100 XS23-TY3 to hang randomly

On friday i gave ESXi 6 a second try and upgraded my hosts from 5.5 to 6.0U1.

Yesterday I had a similar issue which also seems not to be the "netdev_watchdog" problem (which should be fixed with 6.0U1).

But in my case, the host lost the network connections, which is really bad since we use iSCSI storage.

The host was disconnected from vCenter and also not acessable via ESXCLI, DCUI or SSH.

I had to power off the hardware to trigger VMware HA to restart the hosted VMs.

Reply
0 Kudos
Aviator20111014
Enthusiast
Enthusiast

Status-Update:

Today one more host crashed. I opened a VMware support request ...

sappomannoz
Hot Shot
Hot Shot

The same here :smileyangry: build 3247720

UCS blades here, you?

Reply
0 Kudos
Bleeder
Hot Shot
Hot Shot

Same problems here, same ESXi build.  It's actually worse in that I cannot even SSH to the host because we have SSH turned off by policy, and there is no way to enable SSH when the host is in this state.

Edit: It seems we are not alone.  Check reddit thread "ESXi 6.0U1a fully patched...vpxa crashing randomly, host disconnected from vCenter".

sappomannoz
Hot Shot
Hot Shot

Hello Bleeder,

have you opened a SR? Mine is 15815988811. They told me it could be HW related, as I'm running some older version of UCS Manager, but I doubt it. Which servers are you using?

If you can't leave SSH enabled, enable at least the ESXi shell once you regain control of the hosts. restarting the agents form the command line is effective. I'm running a newer build 3247720 because of a CBT issue, but I had the same problem also with 3073146. And btw, we are also using NSX here.

Reply
0 Kudos
sappomannoz
Hot Shot
Hot Shot

Any news?

Reply
0 Kudos
Aviator20111014
Enthusiast
Enthusiast

Hi *,

VMware Support recommended to update my IXGBE driver to latest release (ixgbe 4.1.1.1). No trouble so far - until yesterday. I rebooted one of our redundant storage nodes (HP P4500 G2) and all ESXi hosts stopped working.  My virtual DC was dead and I had to power cycle all ESXi hosts.

I have had no such problems with vSphere 5.5. I guess there's still some kind of networking bug with ESXi 6.0U1

My Hosts: Dell R610 + 10GbE Intel 82599 + ESXi 6.0U1, build 3247720

Reply
0 Kudos
sappomannoz
Hot Shot
Hot Shot

Hello Aviator,

it seems you have a different issue. We don't lose network connectivity, just the management agents. In fact I'm albe to ssh into the host and restart them!

Kind regards

Cristiano

Reply
0 Kudos
cesprov
Enthusiast
Enthusiast

I had a ticket open with VMware support on this that went nowhere.  They showed me that the zdump generated by the vpxa crash was saying the issue was an out of memory condition but they couldn't explain why.  I hadn't personally witnessed the problem in several weeks so I ended up closing my open ticket in the hopes it went away or was fixed on the sly in that CBT patch (current build 3247720).  But just experienced another one of these last night on yet another host.  I've seen the issue 4 times now for sure, different hosts each time, never the same host twice thus far.  This time my syslogs picked up this:

<4>2015-12-30T03:00:07Z host.domain.com Unknown: out of memory [35136]

<182>2015-12-30T03:00:07.220Z host.domain.com vmkernel: cpu28:35977)User: 3816: wantCoreDump:vpxa-worker signal:6 exitCode:0 coredump:enabled

<182>2015-12-30T03:00:07.576Z host.domain.com vmkernel: cpu28:35977)UserDump: 1907: Dumping cartel 35136 (from world 35977) to file /var/core/vpxa-worker-zdump.000 ...


The "out of memory" issue seems to line up with what they showed me previously but still no reason as to why that I can find.


It's highly likely I have had this issue occur more than the 4 times I have personally witnessed as it fixes itself within 20 minutes or so.  You either have to happen to be watching when the host disconnects due to the vpxa agent crash or there needs to be something running at the time of the crash that will trip over the disconnected host while crashed.  In my most recent case, vRanger reported that it couldn't back up VMs in the middle of the night due to this host being disconnected.


We're actually investigating migrating to a competitor's product as VMware is going downhill in a hurry.  The myriad of problems with 6.0 have been a complete disaster for us.  VMware support doesn't seem to care at all any more, they just try to get you off the phone as quick as possible, and it's obvious they don't know the product very well, at least compared to my experience level.  The VMware KB is always down (it's down again as I write this).  With all the turmoil surrounding VMware with the whole Dell/EMC thing, not sure I want to continue investing in a product from a dying company.

Reply
0 Kudos
sappomannoz
Hot Shot
Hot Shot

Mate, I feel your pain

Reply
0 Kudos
sappomannoz
Hot Shot
Hot Shot

Opened a new SR 16853210901, since the previous one was closed w/o solution.

Reply
0 Kudos
Gaurav_Baghla
VMware Employee
VMware Employee

I cannot see logs attached to this Case yet. I will update for the logs to be attached.

Regards Gaurav Baghla Opinions are my own and not the views of my employer. https://twitter.com/garry_14
Reply
0 Kudos
sappomannoz
Hot Shot
Hot Shot

Hello Gaurav,

just the logs of the last crash, that happened on the 5/1/2016. I hope it’s not too late.

You could also see the logs from SR 15815988811.

Reply
0 Kudos
cesprov
Enthusiast
Enthusiast

My case was SR#: 15809251611.  It has since been closed as I hadn't "seen" the issue directly for a couple weeks around the time it was closed and was getting pressure to close the case since it was "no longer happening."  Since then I have been doing my own troubleshooting trying to pinpoint the issue prior to re-opening a case on this.  Had two more hosts' vpxa agents crash with the same issue for sure since Friday 1/8 with two other hosts that may have also had the same if not similar issue as their vpxa agents are using way less memory now (31MB, 51MB) than they were last week.  I have been watching the memory usage of the vpxa agent (go into esxtop and press m for memory) for the last couple weeks to see if I can at least predict when vpxa was going to crash.  If you reboot a host or simply restart the management agents (which includes vpxa), you'll see that the process for vpxa in esxtop starts out really low on memory use, around 28 to 30MB.  As days go by, the memory use for vpxa steadily climbs.  The highest I have witnessed it get is 309MB.  Somewhere after 309MB, it crashes with the following errors in the vpxa.log:


2016-01-09T19:33:50.319Z error vpxa[2DCA4B70] [Originator@6876 sub=Default] Unable to allocate memory

2016-01-09T19:33:50.319Z panic vpxa[2DCA4B70] [Originator@6876 sub=Default]

-->

--> Panic: Unable to allocate memory

--> Backtrace:

-->

--> [backtrace begin] product: VMware ESX, version: 6.0.0, build: build-3247720, tag: vpxa

--> backtrace[00] libvmacore.so[0x00311403]: Vmacore::System::Stacktrace::CaptureFullWork(unsigned int)

--> backtrace[01] libvmacore.so[0x00145049]: Vmacore::System::SystemFactoryImpl::CreateBacktrace(Vmacore::Ref<Vmacore::System::Backtrace>&)

--> backtrace[02] libvmacore.so[0x0030CE30]

--> backtrace[03] libvmacore.so[0x0030CF06]: Vmacore::PanicExit(char const*)

--> backtrace[04] libvmacore.so[0x0010B6D2]

--> backtrace[05] libstdc++.so.6[0x000A6C82]: operator new(unsigned int)

--> backtrace[06] libstdc++.so.6[0x0008D5A5]: std::string::_Rep::_S_create(unsigned int, unsigned int, std::allocator<char> const&)

--> backtrace[07] libstdc++.so.6[0x0008E868]: std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned int)

--> backtrace[08] libstdc++.so.6[0x0008E99A]: std::string::reserve(unsigned int)

--> backtrace[09] libstdc++.so.6[0x0008ED0F]: std::string::append(char const*, unsigned int)

--> backtrace[10] libvmomi.so[0x001BFB1D]

--> backtrace[11] libvmomi.so[0x001C06B8]

--> backtrace[12] libvmacore.so[0x001C9E21]

--> backtrace[13] libvmacore.so[0x001CA3D8]

--> backtrace[14] libvmacore.so[0x00179E5D]

--> backtrace[15] libvmacore.so[0x0017A007]

--> backtrace[16] libvmacore.so[0x0017C768]

--> backtrace[17] libvmacore.so[0x0013502C]

--> backtrace[18] libvmacore.so[0x0028F016]

--> backtrace[19] libvmacore.so[0x002915FC]

--> backtrace[20] libvmacore.so[0x0025A3CC]

--> backtrace[21] libvmacore.so[0x00252070]

--> backtrace[22] libvmacore.so[0x0025627A]

--> backtrace[23] libvmacore.so[0x0025649C]

--> backtrace[24] libvmacore.so[0x0025F882]

--> backtrace[25] libvmacore.so[0x002B2CAD]

--> backtrace[26] libvmacore.so[0x00252D68]

--> backtrace[27] libvmacore.so[0x00257603]

--> backtrace[28] libvmacore.so[0x0031B2CC]

--> backtrace[29] libpthread.so.0[0x00006D6A]

--> backtrace[30] libc.so.6[0x000D5D9E]

--> [backtrace end]

-->

and then you can see the vpxa agent start back up ~20 minutes later:

2016-01-09T19:53:16.646Z Section for VMware ESX, pid=3581936, version=6.0.0, build=3247720, option=Release

2016-01-09T19:53:16.646Z verbose vpxa[FF8DDA60] [Originator@6876 sub=Default] Dumping early logs:

Nothing seems to occur in the vpxa or vmkernel logs prior to the crash.  It just seems like normal workload of vpxa has some sort of memory leak in that it continues to grow until it hits some defined memory limit and then crashes when it can't allocate any more.

Currently I have three other hosts whose vpxa memory usage are over 200MB.  Based on the rate of memory leakage from vpxa, I would expect these three to crash either this upcoming weekend or early next week.

I saw 6.0U1b was released on 1/7 but nothing in the release notes indicates any fix targeted at this issue unless it was fixed on the sly.  Quite frankly I'm afraid to install 6.0U1b at this point due to every 6.0 release seemingly breaking something new...one step forward two steps back.  I'd rather deal with my known unknowns at this point then introduce new unknown unknowns.

sappomannoz
Hot Shot
Hot Shot

Hi CesProv,

thank you for taking the time for sharing, I'm not sure we see the same issue. When my managements agents are dead they stay dead. Only way to bring them back to life is to restart them.

Anyway I'm starting to monitor the the memory usage, let's see if I also have a leak.

Reply
0 Kudos
Bleeder
Hot Shot
Hot Shot

Updating to 6.0 U1b has not fixed the issue for me.

Reply
0 Kudos
sappomannoz
Hot Shot
Hot Shot

Also for me 6u1b is not fixing the issue

Reply
0 Kudos
aleksvasiukov
Contributor
Contributor

Hi CesProv,

I have exactly the same issue. When vpxa process memory usage grows stronly, hosts starts be disconnected with "unable to allocate memory" in vpxa log. All VMs on it are still alive. I can restart management agents on host and rejoin it to vCenter or just wait and host fixes itself after 20-30 min. VMware support offered to change max memAllocationInM to 400MB for vpxa and set minLimit value to unlimited. You can do it via SSH due to this commands:

grpID=$(vsish -e set /sched/groupPathNameToID host vim vmvisor vpxa | cut -d' ' -f 1) #Obtaining groupID for VPXA.

vsish -e get /sched/groups/$grpID/memAllocationInMB #$grpID is output from previous command

Output is look like:

memsched-allocation {

   min:201

   max:201

   shares:0

   minLimit:-1

   units:units: 3 -> mb

}

To change max and minLimit values:

vsish -e set /sched/groups/$grpID/memAllocationInMB max=400 minLimit=unlimited

This changes aren't staying alive after the host has rebooted.

The problem is that vpxa process can easily grows to 400MB and then crashes.

I could not identify optimal threshold at the moment.

Do you have any news? Maybe esxi 6u2 has resolved the problem (I didn't find any info about that in release notes)?

Reply
0 Kudos
cesprov
Enthusiast
Enthusiast

Hi aleksvasiukov.  I have no doubt your issue is the same as mine.  It's definitely some sort of bug that VMware won't admit to.  At least it sounds like they somewhat acknowledged it in your support call it sounds like.  The problem is still there, I just saw it over this past weekend when my VM backups were running at night.  I am still on 6.0U1a as this issue was reported by others as not addressed in 6.0U1b so I saw no reason to upgrade.  I don't see anything in the 6.0U2 release notes particular to this so either it was addressed silently or not all, my money's on the latter.  I am planning on upgrading to 6.0U2 in a couple weeks.

Running that script on my hosts yields:

memsched-allocation {

   min:256

   max:256

   shares:0

   minLimit:-1

   units:units: 3 -> mb

}

But this is the defaults as I haven't changed these values.  The highest I have ever seen vpxa memory usage get under this configuration is 309MB and it crashes shortly after.  If it's a memory leak, I'm not so sure upping those limits will help, I would assume if you change it to 512MB, it would just grow and crash around ~550MB, which is what it sounds like you found.

I'll follow up here once I upgrade to 6.0U2.  Just waiting to see if there are any major issues with it before I do.

Reply
0 Kudos