VMware Cloud Community
cyberjock1980
Contributor
Contributor

Unknown: Out of Memory [5880] in syslog

Hello all.  First post.  Been playing with ESXi for a few months and I'm having an unusual issue.  Here's my hardware and VM setup:

Hardware:  Supermicro x9scm-f-o with 32GB of ECC RAM and E3-1230v2 CPU

Also included is a M1015 reflashed to IT mode and used as passthrough with VT-d to the FreeNAS VM.

VMs:

FreeNAS 8.3.1 64-bit with 20GB of RAM assigned (3vCPU)

Linux Mint 13 LTS 4GB RAM (3vCPU)

Cyberpower UPS Appliance VM 2GB RAM (1vCPU)

After building the machine a week or two went by, then I lost network connectivity to the entire box(this was when I first realized something might be wrong).  I couldn't use the vSphere client or access the guest machines.  I attempted to "Restart Management Network" locally but that didn't solve anything.  Still no connectivity.  I then attempted to shutdown the box but it wouldn't shutdown.  Ultimately I power cycled it.

So over the last 2 months at random times I've lost network connectivity.  Thanks to reading various sources I've found an error in my syslog that ends up being logged 5-20+ times in a row, and happens randomly at intervals of 2-10 minutes.  I get an error that says:

Unknown: out of memory [5880]

I thought it was due to my VMs using a lot of network connections, but never really could find someone else with the error.  It's very non-specific and I'm running out of leads to check out the issue, hence I've turned to the community.

Well, today I have decided I want to figure this puzzle out, so I shutdown all of the VMs and that error still continues to show up in the syslog.  So clearly I have a configuration problem somewhere.  This is odd to me because I've tried to avoid changing settings that I wasn't 100% sure was appropriate for me because Im a newbie to ESXi and I'd rather not break my own installation.  The only thing I've done that might be construed as an 'advanced' feature modification is I enabled the RDM function for the Linux Mint installation.  It was already on a hard drive and I thought I'd just do RDM passthrough and use it that way if it worked.  Well, it has so I left it like that.

I can't find much info on what 5880 might represent, so I assumed its a PID or something.  It has changed from one bootup to the next.  So here's the output from the CLI:

~ # ps -P | grep 5880

5880 5880 sfcb-vmware_raw      5606 /sbin/sfcbd

5881 5880 sfcb-vmware_raw      5606 /sbin/sfcbd

5882 5880 sfcb-vmware_raw      5606 /sbin/sfcbd

5883 5880 sfcb-vmware_raw      5606 /sbin/sfcbd

5884 5880 sfcb-vmware_raw      5606 /sbin/sfcbd

5885 5880 sfcb-vmware_raw      5606 /sbin/sfcbd

5887 5880 sfcb-vmware_raw      5606 /sbin/sfcbd

5888 5880 sfcb-vmware_raw      5606 /sbin/sfcbd

~ #

Again, I tried searching for sfcbd and it really doesn't give me any clues.

I actually built 2 identical systems.  One for me and one for a friend.  He uses more RAM for his VMs(and different VMs) but he doesn't have this issue.  I'm figuring I've done something to my installation to deserve this since his is flawless, but I am out of leads.

tl;dr:  I get the "Unknown: out of memory" error along with a loss of all network connectivity randomly after 1 day to 2 weeks.  Sometimes when the server would be idle(such as at night).  I don't know if they are related, but I assume they are.

Anyone have ideas to try?  This is baffling me and I'm really at a loss to explain the problem let alone how to fix it.

Thanks!

15 Replies
cyberjock1980
Contributor
Contributor

Well, did a reboot this evening to facilitate adding a hard drive to the system and within 15 minutes the network connections died.  The server wasn't even busy.  I was streaming a movie and that was all.  In the syslog I have these entries:

Unknown: out of memory [7087]

Unknown: out of memory [7087]

Unknown: out of memory [7087]

Unknown: out of memory [7087]

Unknown: out of memory [7087]

Unknown: out of memory [7087]

Unknown: out of memory [7087]

Unknown: out of memory [7087]

crond[4529]: crond: USER root pid 20439 cmd /sbin/hostd-probe

syslog[20440] starting hostd probing.

syslog[20440]: hostd probing is done.

sfcb-vmware_base[5821]: TIMEOUT DOING SHARED SOCKET RECV RESULT (5821)

sfcb-vmware_base[5821]: Timeout (or other socket error) waiting for response from provider

sfcb-vmware_base[5821]: Request Header Id (1927) != Response Header reqId (0) in request to provider 94 in process 4. Drop response.

Unknown: out of memory [7087]

Unknown: out of memory [7087]

Unknown: out of memory [7087]

Unknown: out of memory [7087]

Unknown: out of memory [7087]

Unknown: out of memory [7087]

Unknown: out of memory [7087]

Unknown: out of memory [7087]

NIC lights are normal (except no blinking since there's no traffic).

Tried a services.sh restart from the console but it didn't fix the problem.  I did get a lot of errors during things like stopping ntpd and then restarting ntpd and starting ssh, "Connect to localhost failed: Connection failure".

One thing I did find from esxicli network nic list shows that the Intel e1000 driver is loaded on my Intel 82574L network card but the link status is down with a speed of 0.  I tried unplugging the cable and then plugging it back in but the link didn't come back up.  I tried esxcli network nic down(and up) -n vmnic1 but that didn't bring it back either.

Reply
0 Kudos
admin
Immortal
Immortal

What does the following command return as output?

vdf -h

Can you initiate a "/etc/init.d/sfcb-watchdog stop" after a reboot of the server and see if the issue still occurs.

Some explanation on the sfcb service: This is the service providing you with the hardware status tab in vCenter or in on the configuration page when connected directly on the host. It is an extendible system which means third party providers can use the base CIM system to add sensors to watch.

From time to time such a CIM provider can actually have a memory leak (not only 3rd party proivder's are at fault here, also the basic ones installed in the vanilla image can cause this).

Do you have any dumps in /var/core ?

Reply
0 Kudos
cyberjock1980
Contributor
Contributor

The system is up and running right now, but here is the output of vdf -h

~ # vdf -h
Tardisk                  Space      Used
s.v00                     347M      347M
ata_pata.v00               44K       42K
ata_pata.v01               32K       28K
ata_pata.v02               32K       30K
ata_pata.v03               32K       31K
ata_pata.v04               40K       36K
ata_pata.v05               36K       32K
ata_pata.v06               32K       29K
ata_pata.v07               40K       36K
block_cc.v00               84K       82K
ehci_ehc.v00               92K       90K
weaselin.t00               13M       13M
esx_dvfi.v00              368K      366K
xlibs.v00                   1M        1M
ima_qla4.v00                1M        1M
ipmi_ipm.v00               44K       41K
ipmi_ipm.v01              112K      111K
ipmi_ipm.v02              104K      101K
misc_cni.v00               28K       25K
misc_dri.v00                3M        3M
net_be2n.v00              348K      346K
net_bnx2.v00              292K      290K
net_bnx2.v01                1M        1M
net_cnic.v00              124K      120K
net_e100.v00              300K      297K
net_e100.v01              248K      246K
net_enic.v00              148K      144K
net_forc.v00              132K      131K
net_igb.v00               244K      240K
net_ixgb.v00              376K      374K
net_nx_n.v00                2M        2M
net_r816.v00              140K      137K
net_r816.v01               84K       82K
net_s2io.v00              244K      242K
net_sky2.v00              116K      112K
net_tg3.v00               324K      321K
net_vmxn.v00              104K      102K
ohci_usb.v00               60K       56K
sata_ahc.v00              100K       99K
sata_ata.v00               60K       58K
sata_sat.v00               88K       86K
sata_sat.v01               44K       41K
sata_sat.v02               44K       43K
sata_sat.v03               44K       40K
sata_sat.v04               36K       32K
scsi_aac.v00              172K      171K
scsi_adp.v00              432K      428K
scsi_aic.v00              296K      295K
scsi_bnx.v00              200K      196K
scsi_fni.v00              160K      159K
scsi_hps.v00              192K      190K
scsi_ips.v00              132K      129K
scsi_lpf.v00                1M        1M
scsi_meg.v00              100K       96K
scsi_meg.v01              168K      165K
scsi_meg.v02               96K       92K
scsi_mpt.v00              392K      389K
scsi_mpt.v01              520K      517K
scsi_mpt.v02              440K      436K
scsi_qla.v00                1M        1M
scsi_qla.v01              288K      285K
scsi_rst.v00              748K      745K
uhci_usb.v00               60K       57K
xorg.v00                    3M        3M
imgdb.tgz                 252K      250K
state.tgz                  20K       18K
-----
Ramdisk                   Size      Used Available Use% Mounted on
root                       32M      432K       31M   1% --
etc                        28M      184K       27M   0% --
tmp                       192M        4K      191M   0% --
hostdstats                249M        2M      246M   0% --
~ #

The /var/core directory is empty.

I assume you meant /etc/init.d/sfcbd-watchdog stop.  I have run that command and I'll report back in the next week or so with a status update(unless it crashes again before then).  If the system acts up again I'll post the output of vdf -h after it goes down.

Edit:  The server just did its disconnect thing.  It's been 2-3 hours or so.  I was doing some connections via the internet instead of LAN, not sure what that would have to do with it.  I tried to output the to /usr/temp.txt, then reboot so I could retrieve the file, but on reboot the file was gone.  sigh.  So what is a safe location to save the file so on reboot I can retrieve the output of the command?

Reply
0 Kudos
cyberjock1980
Contributor
Contributor

Well, I got 12 hours this time.  I didn't rerun the command you had requested since the last failure.  But whatever the problem is its happening much more frequently than it has been.  I was copying about 100GB of data off of the server at the time of the failure.  Anyway, I did save the output of the vdf -h command to the datastore, so I have that output for you.

Is there any chance that this could be a bad NIC?  I built this system 2 months ago and I installed ESXi without really testing the hardware.  My gut feeling is that if I have too many network connections open then the house of cards comes crashing down.  It seems to happen more frequently when I try to copy files or otherwise do things that opens a lot of network ports.

Anyway, here's the output of vdf -h about 2 minutes after my loss of network connectivity...

Tardisk                  Space      Used

s.v00                     347M      347M

ata_pata.v00               44K       42K

ata_pata.v01               32K       28K

ata_pata.v02               32K       30K

ata_pata.v03               32K       31K

ata_pata.v04               40K       36K

ata_pata.v05               36K       32K

ata_pata.v06               32K       29K

ata_pata.v07               40K       36K

block_cc.v00               84K       82K

ehci_ehc.v00               92K       90K

weaselin.t00               13M       13M

esx_dvfi.v00              368K      366K

xlibs.v00                   1M        1M

ima_qla4.v00                1M        1M

ipmi_ipm.v00               44K       41K

ipmi_ipm.v01              112K      111K

ipmi_ipm.v02              104K      101K

misc_cni.v00               28K       25K

misc_dri.v00                3M        3M

net_be2n.v00              348K      346K

net_bnx2.v00              292K      290K

net_bnx2.v01                1M        1M

net_cnic.v00              124K      120K

net_e100.v00              300K      297K

net_e100.v01              248K      246K

net_enic.v00              148K      144K

net_forc.v00              132K      131K

net_igb.v00               244K      240K

net_ixgb.v00              376K      374K

net_nx_n.v00                2M        2M

net_r816.v00              140K      137K

net_r816.v01               84K       82K

net_s2io.v00              244K      242K

net_sky2.v00              116K      112K

net_tg3.v00               324K      321K

net_vmxn.v00              104K      102K

ohci_usb.v00               60K       56K

sata_ahc.v00              100K       99K

sata_ata.v00               60K       58K

sata_sat.v00               88K       86K

sata_sat.v01               44K       41K

sata_sat.v02               44K       43K

sata_sat.v03               44K       40K

sata_sat.v04               36K       32K

scsi_aac.v00              172K      171K

scsi_adp.v00              432K      428K

scsi_aic.v00              296K      295K

scsi_bnx.v00              200K      196K

scsi_fni.v00              160K      159K

scsi_hps.v00              192K      190K

scsi_ips.v00              132K      129K

scsi_lpf.v00                1M        1M

scsi_meg.v00              100K       96K

scsi_meg.v01              168K      165K

scsi_meg.v02               96K       92K

scsi_mpt.v00              392K      389K

scsi_mpt.v01              520K      517K

scsi_mpt.v02              440K      436K

scsi_qla.v00                1M        1M

scsi_qla.v01              288K      285K

scsi_rst.v00              748K      745K

uhci_usb.v00               60K       57K

xorg.v00                    3M        3M

imgdb.tgz                 252K      250K

state.tgz                  20K       18K

-----

Ramdisk                   Size      Used Available Use% Mounted on

root                       32M      432K       31M   1% --

etc                        28M      184K       27M   0% --

tmp                       192M        4K      191M   0% --

hostdstats                249M        2M      246M   0% --

Thanks for the help so far.

Reply
0 Kudos
cyberjock1980
Contributor
Contributor

Well, After rebooting I started copying files off the FreeNAS VM and within 10 minutes the server was down again.  The other VMs were booted up, but I hadn't started running any of the services yet.  I'm about to leave town for the day, and I'm going to leave the server with its broken network config running in case any outputs of commands are wanted with the system non-functional.

Thanks.

Reply
0 Kudos
cyberjock1980
Contributor
Contributor

Well, I decided I'd wipe ESXi and recreate everything from scratch.  So new install of ESXi(updated to the latest public build, 1157734), I did reuse the vmdk files but the remainder of the VMs was recreated from scratch.  Worked fine for 3 whole days...

Well, this evening I was replacing a disk in my zpool(FreeNAS VM) and wouldn't you know, the darn thing disconnected.  This time I didn't do any kind of experimenting.  Very basic setup of the VMs.  And for whatever reason this evening, poof.  Checked the syslog and lots of "Unknown: out of memory [6955]" errors.  Just like above, but 1 extra line.  This is getting dangerous because I can't even rebuild my pool to restore redundancy without having problems.

From the first post, right after the "Request Header Id" error I have a new entry:

sfcb-VMware_base[5830]: Dropped response operation details -- namespace: root/cimv2, className: OMC_RawIpmiSensor, Type: 0

Because of the pool resilvering the M1015(using VT-d to pass through to the FreeNAS machine) was heavily loaded during this time.  Any chance this could be a trigger?    Does VT-d have a "cache" that may need to be bigger to store commands for the PCIe devices that are passed through?  CPU usage was very high during the resilvering.

Edit:  It turns out that my desktop began an automated backup to the FreeNAS VM, so the high traffic across the network port is probably responsible for the problem.  Is there some kind of network "cache" for ESXi that I may need to make bigger?

The VMs are clearly still running since the resilvering is still running.  But last time I looked it had over 40 hours remaining, and I can't just wait until the hdd led's go idle to roll the dice on this thing.

Reply
0 Kudos
cyberjock1980
Contributor
Contributor

Well, I was forced to get rid of ESXi so I could rebuild the array.  Because the issue seems to occur frequently with large network transfers I chose to setup a 10000 second iperf test with the server as the iperf server.  It transferred almost 900GB with iperf without any issues.  So I'm confident that there is no hardware issue and ESXi is to blame.

Anyone have any ideas before I abandon ESXi as a solution for our business needs?  The test platform is not proving reliable enough to use despite the lack of complexity in our setup.  Our final design would have more VMs but on the same physical hardware.

Reply
0 Kudos
admin
Immortal
Immortal

OMC_RawIpmiSensor


What is the BMC on the board saying?


To be fair I could not find the board on the HCL but Supermicro is special in terms of naming conventions... To me this still looks like a CIM provider issue and I am pretty sure that you would hit it without any VMs running at all.


Would you be able to upload a log bundle for that host to a share hoster and pm me the link, if I have time over the weekend I might have a look if I can find something more, the sfcbd running out of memory might just be a symptom.

cyberjock1980
Contributor
Contributor


The BMC system log was disabled.  Apparently that is the default.  I've enabled it but I have about 30 hours left before I can switch back to ESXi since I can't trust ESXi to last that long before the server disconnects from the network.  I can upload the logs to somewhere and/or email them to you.

As for my BMC, I have it setup as a dedicated port.  I think the default was failover, but I changed it because I had some weird issues with failover on a previous system build because the network switch wouldn't bring all ports online at the exact same time so it would failover when it didn't need to and then never switch back when a network link was "restored" seconds later.  This behavior is documented on the web, so I just assume avoid the potential issue and make it dedicated.

I'll be honest, I'm very vague on what a CIM provider actually is.  I've Googled it but I'm really not seeing how it could be my problem, but I will push the "I believe" button and accept your knowledge since I'm sure its more than mine.  As far as I know I haven't installed any 3rd party CIM providers(hence my reason for not understanding how it could be a CIM provider issue).  My issue seems to be ultra-rare at the least.  And all of my ESXi using friends I have consulted with on the phone are at a loss to explain the actual issue except to say "it sucks to be you".  LOL.  Even my friend's system that is the same identical hardware(literally bought double of everything when we built the system) doesn't have this issue.  I updated his ESXi build on Sunday night and he was at 37 days of uptime.  He has the exact same VMs, but 1 more(small XP system for printing).  I don't know how to download a log bundle, but I'm sure I can learn with some Google-fu and/or a call to my ESXi friend.

Supermicro's naming is confusing.  My board is technically the X9SCM-F.  The -O means non-retail box I believe.  I instinctively add the -O because its technically  the most accurate, although I guess it really adds no value(especially related to this problem).

Reply
0 Kudos
cyberjock1980
Contributor
Contributor

Just wanted to reply back and say that the server hasn't crashed all weekend despite me trying to induce it to crash.  So I guess we're going to play the waiting game with it until things go bad again.  I'm predicting in the next week I'll have those logs for you.

Thanks.

Reply
0 Kudos
cyberjock1980
Contributor
Contributor

PM sent with log bundle.

Reply
0 Kudos
cyberjock1980
Contributor
Contributor

Well, the other day I got a replacement Intel NIC and installed it in the system.  After 8 hours of testing various combinations if enabling/disabling hardware and removing/reinstalling hardware I have proven that the BMC was in fact bad.  Disable it via jumper and everything purrs like a kitten.  Enable it and expect nightmares.

I guess come Monday morning I'll be doing an RMA with Supermicro.

Thanks for the support Frank!  You were right.


Reply
0 Kudos
miszcz
Contributor
Contributor

Sorry to bring up that old discussion but I experienced an almost identical problem with two of our ESXi 5.1 servers. They are also Supermicro based systems.

Just today I experienced some weird behaviour with one ESXi and an assigned VM - the upgrade of the hardware version of the VM simply did not come to an end. Before that, I could not power on the VM (which was manually created some minutes before). After checking /var/log/syslog.log on the host, I saw exactly the same messages as you described:

2014-08-28T09:08:10Z sfcb-vmware_base[5487072]: TIMEOUT DOING SHARED SOCKET RECV RESULT (5487072)
2014-08-28T09:08:10Z sfcb-vmware_base[5487072]: Timeout (or other socket error) waiting for response from provider
2014-08-28T09:08:10Z sfcb-vmware_base[5487072]: Request Header Id (557844) != Response Header reqId (0) in request to provider 324 in process 5. Drop response.
2014-08-28T09:08:10Z sfcb-vmware_base[5487072]: Dropped response operation details -- nameSpace: root/cimv2, className: OMC_RawIpmiSensor, Type: 0
2014-08-28T09:08:10Z Unknown: out of memory [12085999]

... repreated about a dozen times

However. I have not seen dropped network connectionis or other operational issues with that server. Unfortunately, I found a second server with the exact log messages. And both of the servers are running critical operational VMs.

Those two servers were part of a purchase of ten identical hosts - the other systems do not seem to have that problem.

In case you're reading this: what was the exact cause of the problem? A bad BMC which resulted in the complete replacement of the server itself?

Any info you might have would be greatly appreciated.

Thanks in advance!

Michael

Reply
0 Kudos
miszcz
Contributor
Contributor

Short update: it seems that the hardware is not to blame in my case. I think I was able to solve the problem for now.

The root cause of my specific problem seems to have been a large number of files in the /var/run/sfcb/ directory (around 5630 at the time). This also led to various error events (which I didn't see at first because no alarm was raised) about filled up file tables and inode tables in the root ramdisk (see also KB article 2037798).

After stopping the sfcbd-watchdog, deleting all the files in /var/run/sfcb/, restarting the sfcbd-watchdog and restarting the hostd and vpxa agents, the system seemed to behave normal again (= no more "out of memory" messages and the hardware upgrade of the VM was also possible again).

Now I still have this other machine throwing the "out of memory" messages every once in a while but no other symptoms on this server so far (and the /var/run/sfcb/ directory does not contain more than 20 files). I'll be keeping a close watch.

Michael

Reply
0 Kudos
suhaakin
Contributor
Contributor

If it is ESXi 6.0 check this out.

VMware Knowledge Base

kb.vmware.com/s/article/2144799

Reply
0 Kudos