VMware Cloud Community
cesprov
Enthusiast
Enthusiast

ESXi 6.0U1a build 3073146 vpxa crashing randomly, host disconnected from vCenter

After recently upgrading vCenter and ESXi to 6.0U1a and installing all patches (now build numbers 3018524 and 3073146 respectively), we began experiencing random host disconnects from vCenter.  The host itself and all guests on it are still alive when it's disconnected; I can SSH to the host and RDP/SSH to guests.  If I literally do nothing, it eventually fixes itself and rejoins vCenter within 15-20 minutes.  We did not have this issue prior to upgrading to 6.0U1a.  This is not the "NETDEV WATCHDOG: vmnic4: transmit timed out" issue.  In fact, the reason we upgraded to the latest build was to get the fix for that particular issue.

I've personally witnessed this happen now on three different hosts and never has it reoccurred on the same host twice that we have noticed.  The vmkernel.log simply shows:

2015-11-18T20:56:42.662Z cpu12:173086)User: 3816: wantCoreDump:vpxa-worker signal:6 exitCode:0 coredump:enabled

2015-11-18T20:56:42.819Z cpu15:173086)UserDump: 1907: Dumping cartel 172357 (from world 173086) to file /var/core/vpxa-worker-zdump.000 ...

The vpxa.log doesn't show anything building up to the disconnection and leaves a large gap after the agent crashes, like so:

2015-11-18T20:56:42.638Z info vpxa[FFF2AB70] [Originator@6876 sub=vpxLro opID=QS-host-311567-2883ed8a-1e-SWI-42a5654a] [VpxLroList::ForgetTask] Unregistering vim.Task:sessio

2015-11-18T20:56:42.641Z verbose vpxa[FFF6CB70] [Originator@6876 sub=VpxaHalCnxHostagent opID=QS-host-311567-2883ed8a-1e] [VpxaHalCnxHostagent::DoCheckForUpdates] CheckForUp

2015-11-18T20:56:42.641Z verbose vpxa[FFF6CB70] [Originator@6876 sub=vpxaMoService opID=QS-host-311567-2883ed8a-1e] [VpxaMoService] GetChanges: 97820 -> 97820

2015-11-18T20:56:42.641Z verbose vpxa[FFF6CB70] [Originator@6876 sub=VpxProfiler opID=QS-host-311567-2883ed8a-1e] [2+] VpxaStatsMetadata::PrepareStatsChanges

2015-11-18T21:10:20.328Z Section for VMware ESX, pid=3326854, version=6.0.0, build=3073146, option=Release

2015-11-18T21:10:20.329Z verbose vpxa[FF8A6A60] [Originator@6876 sub=Default] Dumping early logs:

2015-11-18T21:10:20.329Z info vpxa[FF8A6A60] [Originator@6876 sub=Default] Logging uses fast path: false

vCenter logs simply show the host becoming unreachable so the problem is obviously host-side.

Anyone else seeing similar activity?  This has all the feel of another "known issue" but I don't see any talk about it.  I did open a case with VMware support and am awaiting contact now.

37 Replies
ivanerben
Enthusiast
Enthusiast

Hi,

I'm experiencing vpxa crashing due out of memory with 5.5 (esxi build 3116895, vcenter 3142196). Hosts goes offline in vcenter and after a while it reconnects. Probably from update to this version in autumn.

vpxa process:2015-11-20T08:00:31Z Unknown: out of memory [34516]..2015-11-20T08:00:31.935Z cpu0:54442)UserDump: 1820: Dumping cartel 34516 (from world 54442) to file /var/core/vpxa-worker-zdump.000


I had opened it twice with vmware support and they advised to increase ThreadStackSizeKb in vpxa.cfg. It was ok for few days (or just maybe restart of vpxa) (and period of crashed seems to be longer?) Anyway, I have opened it once more and answer from support was that it is due lot of snapshots - we had few vms with more than 32 (officialy supported number). We had deleted more than 32 snapshots and...still crashing Smiley Happy

We are not with latest 5.5, but I'm unsure if upgrade will help. Due sslv3 patches I have to upgrade vcenter first, I can't patch hosts to try -- anyway vpxa is installed and upgraded from/with vcenter right?

Reply
0 Kudos
cesprov
Enthusiast
Enthusiast

Your "Unknown: out of memory" certainly sounds like the same issue I started seeing on 6.0U1a, just haven't seen it on any other version thus far.  Checking your build number, you're on ESXi 5.5 Update 3a  (Express Patch 😎 which was released the same day as 6.0U1a (10-6-15) .  So it's possible a change was made in 6.0U1a to start causing this issue and was then rolled back to 5.5U3a also.  I unfortunately don't have any clients on 5.5U3 or higher yet to confirm the issue is happening there also.  But seems a little more than coincidental.

The vpxa agent gets installed/enabled when you join a host to vCenter.  The vpxa agent on the hosts gets upgraded when you upgrade vCenter.

Reply
0 Kudos
JPSVM
Contributor
Contributor

hi guys

This is a known issue ( vpxa crash , out of memroy error) from 5.5 U2 32485427 onwards and Vmware releasing a patch to fix it as a high priority. Only temp solutions is changing the value, but will lose the value after reboot.

1. Connect to each hosts mentioned above through SSH

2. Run the following command to change the default value of vpxa

a ) Run the following command to set the grpID of the vpxa process to a variable

grpID=$(vsish -e set /sched/groupPathNameToID host vim vmvisor vpxa | cut -d' ' -f 1)

  b) Run the following to increase the max memory allocation of the vpxa process to 400MB (default is 304)

vsish -e set /sched/groups/$grpID/memAllocationInMB max=400 minLimit=unlimited

3. Confirm that the max memory allocation of the vpxa process has been changed

vsish -e get /sched/groups/$grpID/memAllocationInMB

The output should be similar to the following:

sched-allocation {

min:0

max:400

shares:0

minLimit:-1

units:units: 3 -> mb

}

Hope this helps.

ivanerben
Enthusiast
Enthusiast

Hi,

I had no confirmation from support that this is known issue in my two opened cases Smiley Sad   As reading this discussion I changed vpxa memory limit using vsphere client, I think it is same as your solution using vsish, right?

vpxa-resources.png

I had doubled original value (it seems to be computed from host memory size). There was one crash on one host (of 5) after a week. Going to try on another cluster...

Reply
0 Kudos
JPSVM
Contributor
Contributor

Hi Ivanerben

Have you had any crash even after the change? if so, please let me know as I have the case still opened with them.

This is a known issue internally with in VMware support. Dont think they acknowledged to everyone.

Reply
0 Kudos
ivanerben
Enthusiast
Enthusiast

Hi, yes we had one crash on host with modified memory settings:

2016-03-30T01:40:54Z    XXX Unknown: out of memory [15946563]

2016-03-30T01:40:54.042Z    XXX vmkernel: cpu34:13686138)User: 2888: wantCoreDump : vpxa-worker -enabled : 1

2016-03-30T01:50:46.047Z     XXX Hostd: [49701B70 info 'Vimsvc.ha-eventmgr'] Event 502173 : /usr/lib/vmware/vpxa/bin/vpxa crashed (11 time(s) so far) and a core file might have been created at /var/core/vpxa-worker-zdump.000. This might have caused connections to the host to be dropped.

2016-03-30T01:50:46Z    XXX watchdog-vpxa: '/usr/lib/vmware/vpxa/bin/vpxa ++min=0,swapscope=system,group=host/vim/vmvisor/vpxa -D /etc/vmware/vpxa' exited after 972216 seconds 134
Reply
0 Kudos
JPSVM
Contributor
Contributor

Alright, I just noticed that even after running the commands, the values doesnt change and you have to restart the vpxa service to take effect.

Had the value changed definitely on the crashed host?

Changing on couple of hosts today with build 3248547, will keep you posted.

If it doesnt fix the issue, the only hope is to await vmware to release patches. This is their 2nd top priority according to them

Reply
0 Kudos
ivanerben
Enthusiast
Enthusiast

So...one crash is ok = restart Smiley Happy

Reply
0 Kudos
ivanerben
Enthusiast
Enthusiast

Hi, it seems that modifying settings using vsphere client is persistent and stays after host reboot.

Reply
0 Kudos
cesprov
Enthusiast
Enthusiast

But is it "fixed" with the increased settings or does it just take longer to crash now?

Reply
0 Kudos
ivanerben
Enthusiast
Enthusiast

We have to wait longer for confirmation, but I have only one crash per esxi host since April 1st, which is promising.

Reply
0 Kudos
ivanerben
Enthusiast
Enthusiast

Naah, just longer period with modified settings. Still crashing with 'Unknown: out of memory'

Reply
0 Kudos
Bleeder
Hot Shot
Hot Shot

They finally made a public KB article for the problem, but still no fix:

https://kb.vmware.com/kb/2144799

Reply
0 Kudos
ivanerben
Enthusiast
Enthusiast

Reply
0 Kudos
cesprov
Enthusiast
Enthusiast

It appears the article posted by Bleeder above, KB2144799, now indicates this was fixed in ESXi600-201608401-BG and ESXi550-201608401-BG released 8/4/16.  I haven't installed any of the patches released from 8/4 yet so I can't confirm.  While the issue did seem to be less frequent after dropping our stats collection levels as someone above indicated, it didn't completely fix the issue as I just witnessed this problem again the other day, but I also never increased the memory above the default as previous workarounds mentioned.

Reply
0 Kudos
nareshunik008
Contributor
Contributor

i upgraded my ESXi version to 6.0U2 with build 4192238 and my ESXi server disconnect from vCenter and for few Vms it gets disconnect from vcenter and had to migrate to other ESXi to resolve the issue, when i opened a case with VMware they said to upgrade the NIC firmware and drive to resolve this issue, even after the upgrade the issue still exists.

Even after trying below KB the issue exists.

2144799

If anyone got solution. I feel to stop upgrading the ESXi version to 6, in ESXi 5.5 U3 i did not had a single issues like these.

Reply
0 Kudos
cesprov
Enthusiast
Enthusiast

What NIC are you using and what driver/firmware?  Run ethtool -i and you should see something like:

driver: i40e

version: 1.4.26

firmware-version: 5.02 0x8000222e 17.5.10

bus-info: 0000:01:00.0

Reply
0 Kudos
nareshunik008
Contributor
Contributor

Pls find the details

Driver Info:

         Bus Info: 0000:0c:00:0

         Driver: elxnet

         Firmware Version: 4.2.433.604

         Version: 10.2.445.0

Driver Info:

         Bus Info: 0000:0c:00:0

         Driver: elxnet

         Firmware Version: 10.6.144.2702

         Version: 10.6.144.2712

Reply
0 Kudos