14 Replies Latest reply on Aug 28, 2014 6:48 AM by miszcz

    Unknown:  Out of Memory [5880] in syslog

    cyberjock1980 Novice

      Hello all.  First post.  Been playing with ESXi for a few months and I'm having an unusual issue.  Here's my hardware and VM setup:

       

      Hardware:  Supermicro x9scm-f-o with 32GB of ECC RAM and E3-1230v2 CPU

       

      Also included is a M1015 reflashed to IT mode and used as passthrough with VT-d to the FreeNAS VM.

       

      VMs:

       

      FreeNAS 8.3.1 64-bit with 20GB of RAM assigned (3vCPU)

      Linux Mint 13 LTS 4GB RAM (3vCPU)

      Cyberpower UPS Appliance VM 2GB RAM (1vCPU)

       

      After building the machine a week or two went by, then I lost network connectivity to the entire box(this was when I first realized something might be wrong).  I couldn't use the vSphere client or access the guest machines.  I attempted to "Restart Management Network" locally but that didn't solve anything.  Still no connectivity.  I then attempted to shutdown the box but it wouldn't shutdown.  Ultimately I power cycled it.

       

      So over the last 2 months at random times I've lost network connectivity.  Thanks to reading various sources I've found an error in my syslog that ends up being logged 5-20+ times in a row, and happens randomly at intervals of 2-10 minutes.  I get an error that says:

       

      Unknown: out of memory [5880]

       

      I thought it was due to my VMs using a lot of network connections, but never really could find someone else with the error.  It's very non-specific and I'm running out of leads to check out the issue, hence I've turned to the community.

       

      Well, today I have decided I want to figure this puzzle out, so I shutdown all of the VMs and that error still continues to show up in the syslog.  So clearly I have a configuration problem somewhere.  This is odd to me because I've tried to avoid changing settings that I wasn't 100% sure was appropriate for me because Im a newbie to ESXi and I'd rather not break my own installation.  The only thing I've done that might be construed as an 'advanced' feature modification is I enabled the RDM function for the Linux Mint installation.  It was already on a hard drive and I thought I'd just do RDM passthrough and use it that way if it worked.  Well, it has so I left it like that.

       

      I can't find much info on what 5880 might represent, so I assumed its a PID or something.  It has changed from one bootup to the next.  So here's the output from the CLI:

       

      ~ # ps -P | grep 5880

      5880 5880 sfcb-vmware_raw      5606 /sbin/sfcbd

      5881 5880 sfcb-vmware_raw      5606 /sbin/sfcbd

      5882 5880 sfcb-vmware_raw      5606 /sbin/sfcbd

      5883 5880 sfcb-vmware_raw      5606 /sbin/sfcbd

      5884 5880 sfcb-vmware_raw      5606 /sbin/sfcbd

      5885 5880 sfcb-vmware_raw      5606 /sbin/sfcbd

      5887 5880 sfcb-vmware_raw      5606 /sbin/sfcbd

      5888 5880 sfcb-vmware_raw      5606 /sbin/sfcbd

      ~ #

       

      Again, I tried searching for sfcbd and it really doesn't give me any clues.

       

      I actually built 2 identical systems.  One for me and one for a friend.  He uses more RAM for his VMs(and different VMs) but he doesn't have this issue.  I'm figuring I've done something to my installation to deserve this since his is flawless, but I am out of leads.

       

      tl;dr:  I get the "Unknown: out of memory" error along with a loss of all network connectivity randomly after 1 day to 2 weeks.  Sometimes when the server would be idle(such as at night).  I don't know if they are related, but I assume they are.

       

      Anyone have ideas to try?  This is baffling me and I'm really at a loss to explain the problem let alone how to fix it.

       

      Thanks!

        • 1. Re: Unknown:  Out of Memory [5880] in syslog
          cyberjock1980 Novice

          Well, did a reboot this evening to facilitate adding a hard drive to the system and within 15 minutes the network connections died.  The server wasn't even busy.  I was streaming a movie and that was all.  In the syslog I have these entries:

           

          Unknown: out of memory [7087]

          Unknown: out of memory [7087]

          Unknown: out of memory [7087]

          Unknown: out of memory [7087]

          Unknown: out of memory [7087]

          Unknown: out of memory [7087]

          Unknown: out of memory [7087]

          Unknown: out of memory [7087]

          crond[4529]: crond: USER root pid 20439 cmd /sbin/hostd-probe

          syslog[20440] starting hostd probing.

          syslog[20440]: hostd probing is done.

          sfcb-vmware_base[5821]: TIMEOUT DOING SHARED SOCKET RECV RESULT (5821)

          sfcb-vmware_base[5821]: Timeout (or other socket error) waiting for response from provider

          sfcb-vmware_base[5821]: Request Header Id (1927) != Response Header reqId (0) in request to provider 94 in process 4. Drop response.

          Unknown: out of memory [7087]

          Unknown: out of memory [7087]

          Unknown: out of memory [7087]

          Unknown: out of memory [7087]

          Unknown: out of memory [7087]

          Unknown: out of memory [7087]

          Unknown: out of memory [7087]

          Unknown: out of memory [7087]

           

          NIC lights are normal (except no blinking since there's no traffic).

           

          Tried a services.sh restart from the console but it didn't fix the problem.  I did get a lot of errors during things like stopping ntpd and then restarting ntpd and starting ssh, "Connect to localhost failed: Connection failure".

           

          One thing I did find from esxicli network nic list shows that the Intel e1000 driver is loaded on my Intel 82574L network card but the link status is down with a speed of 0.  I tried unplugging the cable and then plugging it back in but the link didn't come back up.  I tried esxcli network nic down(and up) -n vmnic1 but that didn't bring it back either.

          • 2. Re: Unknown:  Out of Memory [5880] in syslog
            Hot Shot

            What does the following command return as output?

             

            vdf -h

             

            Can you initiate a "/etc/init.d/sfcb-watchdog stop" after a reboot of the server and see if the issue still occurs.

             

            Some explanation on the sfcb service: This is the service providing you with the hardware status tab in vCenter or in on the configuration page when connected directly on the host. It is an extendible system which means third party providers can use the base CIM system to add sensors to watch.

             

            From time to time such a CIM provider can actually have a memory leak (not only 3rd party proivder's are at fault here, also the basic ones installed in the vanilla image can cause this).

             

            Do you have any dumps in /var/core ?

            • 3. Re: Unknown:  Out of Memory [5880] in syslog
              cyberjock1980 Novice

              The system is up and running right now, but here is the output of vdf -h

               

              ~ # vdf -h
              Tardisk                  Space      Used
              s.v00                     347M      347M
              ata_pata.v00               44K       42K
              ata_pata.v01               32K       28K
              ata_pata.v02               32K       30K
              ata_pata.v03               32K       31K
              ata_pata.v04               40K       36K
              ata_pata.v05               36K       32K
              ata_pata.v06               32K       29K
              ata_pata.v07               40K       36K
              block_cc.v00               84K       82K
              ehci_ehc.v00               92K       90K
              weaselin.t00               13M       13M
              esx_dvfi.v00              368K      366K
              xlibs.v00                   1M        1M
              ima_qla4.v00                1M        1M
              ipmi_ipm.v00               44K       41K
              ipmi_ipm.v01              112K      111K
              ipmi_ipm.v02              104K      101K
              misc_cni.v00               28K       25K
              misc_dri.v00                3M        3M
              net_be2n.v00              348K      346K
              net_bnx2.v00              292K      290K
              net_bnx2.v01                1M        1M
              net_cnic.v00              124K      120K
              net_e100.v00              300K      297K
              net_e100.v01              248K      246K
              net_enic.v00              148K      144K
              net_forc.v00              132K      131K
              net_igb.v00               244K      240K
              net_ixgb.v00              376K      374K
              net_nx_n.v00                2M        2M
              net_r816.v00              140K      137K
              net_r816.v01               84K       82K
              net_s2io.v00              244K      242K
              net_sky2.v00              116K      112K
              net_tg3.v00               324K      321K
              net_vmxn.v00              104K      102K
              ohci_usb.v00               60K       56K
              sata_ahc.v00              100K       99K
              sata_ata.v00               60K       58K
              sata_sat.v00               88K       86K
              sata_sat.v01               44K       41K
              sata_sat.v02               44K       43K
              sata_sat.v03               44K       40K
              sata_sat.v04               36K       32K
              scsi_aac.v00              172K      171K
              scsi_adp.v00              432K      428K
              scsi_aic.v00              296K      295K
              scsi_bnx.v00              200K      196K
              scsi_fni.v00              160K      159K
              scsi_hps.v00              192K      190K
              scsi_ips.v00              132K      129K
              scsi_lpf.v00                1M        1M
              scsi_meg.v00              100K       96K
              scsi_meg.v01              168K      165K
              scsi_meg.v02               96K       92K
              scsi_mpt.v00              392K      389K
              scsi_mpt.v01              520K      517K
              scsi_mpt.v02              440K      436K
              scsi_qla.v00                1M        1M
              scsi_qla.v01              288K      285K
              scsi_rst.v00              748K      745K
              uhci_usb.v00               60K       57K
              xorg.v00                    3M        3M
              imgdb.tgz                 252K      250K
              state.tgz                  20K       18K
              -----
              Ramdisk                   Size      Used Available Use% Mounted on
              root                       32M      432K       31M   1% --
              etc                        28M      184K       27M   0% --
              tmp                       192M        4K      191M   0% --
              hostdstats                249M        2M      246M   0% --
              ~ #

               

              The /var/core directory is empty.

               

              I assume you meant /etc/init.d/sfcbd-watchdog stop.  I have run that command and I'll report back in the next week or so with a status update(unless it crashes again before then).  If the system acts up again I'll post the output of vdf -h after it goes down.

               

              Edit:  The server just did its disconnect thing.  It's been 2-3 hours or so.  I was doing some connections via the internet instead of LAN, not sure what that would have to do with it.  I tried to output the to /usr/temp.txt, then reboot so I could retrieve the file, but on reboot the file was gone.  sigh.  So what is a safe location to save the file so on reboot I can retrieve the output of the command?

              • 4. Re: Unknown:  Out of Memory [5880] in syslog
                cyberjock1980 Novice

                Well, I got 12 hours this time.  I didn't rerun the command you had requested since the last failure.  But whatever the problem is its happening much more frequently than it has been.  I was copying about 100GB of data off of the server at the time of the failure.  Anyway, I did save the output of the vdf -h command to the datastore, so I have that output for you.

                 

                Is there any chance that this could be a bad NIC?  I built this system 2 months ago and I installed ESXi without really testing the hardware.  My gut feeling is that if I have too many network connections open then the house of cards comes crashing down.  It seems to happen more frequently when I try to copy files or otherwise do things that opens a lot of network ports.

                 

                Anyway, here's the output of vdf -h about 2 minutes after my loss of network connectivity...

                 

                Tardisk                  Space      Used

                s.v00                     347M      347M

                ata_pata.v00               44K       42K

                ata_pata.v01               32K       28K

                ata_pata.v02               32K       30K

                ata_pata.v03               32K       31K

                ata_pata.v04               40K       36K

                ata_pata.v05               36K       32K

                ata_pata.v06               32K       29K

                ata_pata.v07               40K       36K

                block_cc.v00               84K       82K

                ehci_ehc.v00               92K       90K

                weaselin.t00               13M       13M

                esx_dvfi.v00              368K      366K

                xlibs.v00                   1M        1M

                ima_qla4.v00                1M        1M

                ipmi_ipm.v00               44K       41K

                ipmi_ipm.v01              112K      111K

                ipmi_ipm.v02              104K      101K

                misc_cni.v00               28K       25K

                misc_dri.v00                3M        3M

                net_be2n.v00              348K      346K

                net_bnx2.v00              292K      290K

                net_bnx2.v01                1M        1M

                net_cnic.v00              124K      120K

                net_e100.v00              300K      297K

                net_e100.v01              248K      246K

                net_enic.v00              148K      144K

                net_forc.v00              132K      131K

                net_igb.v00               244K      240K

                net_ixgb.v00              376K      374K

                net_nx_n.v00                2M        2M

                net_r816.v00              140K      137K

                net_r816.v01               84K       82K

                net_s2io.v00              244K      242K

                net_sky2.v00              116K      112K

                net_tg3.v00               324K      321K

                net_vmxn.v00              104K      102K

                ohci_usb.v00               60K       56K

                sata_ahc.v00              100K       99K

                sata_ata.v00               60K       58K

                sata_sat.v00               88K       86K

                sata_sat.v01               44K       41K

                sata_sat.v02               44K       43K

                sata_sat.v03               44K       40K

                sata_sat.v04               36K       32K

                scsi_aac.v00              172K      171K

                scsi_adp.v00              432K      428K

                scsi_aic.v00              296K      295K

                scsi_bnx.v00              200K      196K

                scsi_fni.v00              160K      159K

                scsi_hps.v00              192K      190K

                scsi_ips.v00              132K      129K

                scsi_lpf.v00                1M        1M

                scsi_meg.v00              100K       96K

                scsi_meg.v01              168K      165K

                scsi_meg.v02               96K       92K

                scsi_mpt.v00              392K      389K

                scsi_mpt.v01              520K      517K

                scsi_mpt.v02              440K      436K

                scsi_qla.v00                1M        1M

                scsi_qla.v01              288K      285K

                scsi_rst.v00              748K      745K

                uhci_usb.v00               60K       57K

                xorg.v00                    3M        3M

                imgdb.tgz                 252K      250K

                state.tgz                  20K       18K

                -----

                Ramdisk                   Size      Used Available Use% Mounted on

                root                       32M      432K       31M   1% --

                etc                        28M      184K       27M   0% --

                tmp                       192M        4K      191M   0% --

                hostdstats                249M        2M      246M   0% --

                 

                Thanks for the help so far.

                • 5. Re: Unknown:  Out of Memory [5880] in syslog
                  cyberjock1980 Novice

                  Well, After rebooting I started copying files off the FreeNAS VM and within 10 minutes the server was down again.  The other VMs were booted up, but I hadn't started running any of the services yet.  I'm about to leave town for the day, and I'm going to leave the server with its broken network config running in case any outputs of commands are wanted with the system non-functional.

                   

                  Thanks.

                  • 6. Re: Unknown:  Out of Memory [5880] in syslog
                    cyberjock1980 Novice

                    Well, I decided I'd wipe ESXi and recreate everything from scratch.  So new install of ESXi(updated to the latest public build, 1157734), I did reuse the vmdk files but the remainder of the VMs was recreated from scratch.  Worked fine for 3 whole days...

                     

                    Well, this evening I was replacing a disk in my zpool(FreeNAS VM) and wouldn't you know, the darn thing disconnected.  This time I didn't do any kind of experimenting.  Very basic setup of the VMs.  And for whatever reason this evening, poof.  Checked the syslog and lots of "Unknown: out of memory [6955]" errors.  Just like above, but 1 extra line.  This is getting dangerous because I can't even rebuild my pool to restore redundancy without having problems.

                     

                    From the first post, right after the "Request Header Id" error I have a new entry:

                     

                    sfcb-VMware_base[5830]: Dropped response operation details -- namespace: root/cimv2, className: OMC_RawIpmiSensor, Type: 0

                     

                    Because of the pool resilvering the M1015(using VT-d to pass through to the FreeNAS machine) was heavily loaded during this time.  Any chance this could be a trigger?    Does VT-d have a "cache" that may need to be bigger to store commands for the PCIe devices that are passed through?  CPU usage was very high during the resilvering.

                     

                    Edit:  It turns out that my desktop began an automated backup to the FreeNAS VM, so the high traffic across the network port is probably responsible for the problem.  Is there some kind of network "cache" for ESXi that I may need to make bigger?

                     

                    The VMs are clearly still running since the resilvering is still running.  But last time I looked it had over 40 hours remaining, and I can't just wait until the hdd led's go idle to roll the dice on this thing.

                    • 7. Re: Unknown:  Out of Memory [5880] in syslog
                      cyberjock1980 Novice

                      Well, I was forced to get rid of ESXi so I could rebuild the array.  Because the issue seems to occur frequently with large network transfers I chose to setup a 10000 second iperf test with the server as the iperf server.  It transferred almost 900GB with iperf without any issues.  So I'm confident that there is no hardware issue and ESXi is to blame.

                       

                      Anyone have any ideas before I abandon ESXi as a solution for our business needs?  The test platform is not proving reliable enough to use despite the lack of complexity in our setup.  Our final design would have more VMs but on the same physical hardware.

                      • 8. Re: Unknown:  Out of Memory [5880] in syslog
                        Hot Shot

                        OMC_RawIpmiSensor


                        What is the BMC on the board saying?


                        To be fair I could not find the board on the HCL but Supermicro is special in terms of naming conventions... To me this still looks like a CIM provider issue and I am pretty sure that you would hit it without any VMs running at all.


                        Would you be able to upload a log bundle for that host to a share hoster and pm me the link, if I have time over the weekend I might have a look if I can find something more, the sfcbd running out of memory might just be a symptom.

                        • 9. Re: Unknown:  Out of Memory [5880] in syslog
                          cyberjock1980 Novice


                          The BMC system log was disabled.  Apparently that is the default.  I've enabled it but I have about 30 hours left before I can switch back to ESXi since I can't trust ESXi to last that long before the server disconnects from the network.  I can upload the logs to somewhere and/or email them to you.

                           

                          As for my BMC, I have it setup as a dedicated port.  I think the default was failover, but I changed it because I had some weird issues with failover on a previous system build because the network switch wouldn't bring all ports online at the exact same time so it would failover when it didn't need to and then never switch back when a network link was "restored" seconds later.  This behavior is documented on the web, so I just assume avoid the potential issue and make it dedicated.

                           

                          I'll be honest, I'm very vague on what a CIM provider actually is.  I've Googled it but I'm really not seeing how it could be my problem, but I will push the "I believe" button and accept your knowledge since I'm sure its more than mine.  As far as I know I haven't installed any 3rd party CIM providers(hence my reason for not understanding how it could be a CIM provider issue).  My issue seems to be ultra-rare at the least.  And all of my ESXi using friends I have consulted with on the phone are at a loss to explain the actual issue except to say "it sucks to be you".  LOL.  Even my friend's system that is the same identical hardware(literally bought double of everything when we built the system) doesn't have this issue.  I updated his ESXi build on Sunday night and he was at 37 days of uptime.  He has the exact same VMs, but 1 more(small XP system for printing).  I don't know how to download a log bundle, but I'm sure I can learn with some Google-fu and/or a call to my ESXi friend.

                           

                          Supermicro's naming is confusing.  My board is technically the X9SCM-F.  The -O means non-retail box I believe.  I instinctively add the -O because its technically  the most accurate, although I guess it really adds no value(especially related to this problem).

                          • 10. Re: Unknown:  Out of Memory [5880] in syslog
                            cyberjock1980 Novice

                            Just wanted to reply back and say that the server hasn't crashed all weekend despite me trying to induce it to crash.  So I guess we're going to play the waiting game with it until things go bad again.  I'm predicting in the next week I'll have those logs for you.

                             

                            Thanks.

                            • 11. Re: Unknown:  Out of Memory [5880] in syslog
                              cyberjock1980 Novice

                              PM sent with log bundle.

                              • 12. Re: Unknown:  Out of Memory [5880] in syslog
                                cyberjock1980 Novice

                                Well, the other day I got a replacement Intel NIC and installed it in the system.  After 8 hours of testing various combinations if enabling/disabling hardware and removing/reinstalling hardware I have proven that the BMC was in fact bad.  Disable it via jumper and everything purrs like a kitten.  Enable it and expect nightmares.

                                 

                                I guess come Monday morning I'll be doing an RMA with Supermicro.

                                 

                                Thanks for the support Frank!  You were right.

                                 


                                • 13. Re: Unknown:  Out of Memory [5880] in syslog
                                  miszcz Lurker

                                  Sorry to bring up that old discussion but I experienced an almost identical problem with two of our ESXi 5.1 servers. They are also Supermicro based systems.

                                   

                                  Just today I experienced some weird behaviour with one ESXi and an assigned VM - the upgrade of the hardware version of the VM simply did not come to an end. Before that, I could not power on the VM (which was manually created some minutes before). After checking /var/log/syslog.log on the host, I saw exactly the same messages as you described:

                                   

                                  2014-08-28T09:08:10Z sfcb-vmware_base[5487072]: TIMEOUT DOING SHARED SOCKET RECV RESULT (5487072)
                                  2014-08-28T09:08:10Z sfcb-vmware_base[5487072]: Timeout (or other socket error) waiting for response from provider
                                  2014-08-28T09:08:10Z sfcb-vmware_base[5487072]: Request Header Id (557844) != Response Header reqId (0) in request to provider 324 in process 5. Drop response.
                                  2014-08-28T09:08:10Z sfcb-vmware_base[5487072]: Dropped response operation details -- nameSpace: root/cimv2, className: OMC_RawIpmiSensor, Type: 0
                                  2014-08-28T09:08:10Z Unknown: out of memory [12085999]

                                  ... repreated about a dozen times

                                   

                                  However. I have not seen dropped network connectionis or other operational issues with that server. Unfortunately, I found a second server with the exact log messages. And both of the servers are running critical operational VMs.

                                  Those two servers were part of a purchase of ten identical hosts - the other systems do not seem to have that problem.

                                   

                                  In case you're reading this: what was the exact cause of the problem? A bad BMC which resulted in the complete replacement of the server itself?

                                   

                                  Any info you might have would be greatly appreciated.

                                   

                                  Thanks in advance!

                                  Michael

                                  • 14. Re: Unknown:  Out of Memory [5880] in syslog
                                    miszcz Lurker

                                    Short update: it seems that the hardware is not to blame in my case. I think I was able to solve the problem for now.

                                     

                                     

                                    The root cause of my specific problem seems to have been a large number of files in the /var/run/sfcb/ directory (around 5630 at the time). This also led to various error events (which I didn't see at first because no alarm was raised) about filled up file tables and inode tables in the root ramdisk (see also KB article 2037798).

                                     

                                     

                                    After stopping the sfcbd-watchdog, deleting all the files in /var/run/sfcb/, restarting the sfcbd-watchdog and restarting the hostd and vpxa agents, the system seemed to behave normal again (= no more "out of memory" messages and the hardware upgrade of the VM was also possible again).

                                     

                                    Now I still have this other machine throwing the "out of memory" messages every once in a while but no other symptoms on this server so far (and the /var/run/sfcb/ directory does not contain more than 20 files). I'll be keeping a close watch.

                                     

                                    Michael