VMware Cloud Community
mrudloff
Enthusiast
Enthusiast

IPMI / sfcbd-watchdog Freespace problem

Hiya,

I have a problem with one of our hosts.

/var/log/ipmi/0 fills up in less than a day. The files

sel

sel_raw

getting massive and I cannot seem to find a reason for it. The only way to get the host to behave seem to delete those files and restart sfcbd-watchdog

I got two questions actually

What does the service sfcbd-watchdog but more importantly, what is causing those files to grow so massively and how can I stop this ?

0 Kudos
19 Replies
admin
Immortal
Immortal

sfcb daemon is running on ESX/ESXi server which monitors you server health status like processor,Memory,PowerSupply.. you can see ur server health status under configuration tab.

Job of sfcbd-watchdog is to monitor sfcb daemon, if in case sfcbd got stopped then sfcbd-watchdog restart the sfcbd daemon.

Can you please attach sel log output and health status screen shot then i may able to help you out with it ?

0 Kudos
titanlee
Contributor
Contributor

sfcb daemon is running on ESX/ESXi server which monitors you server health status like processor,Memory,PowerSupply.. you can see ur server health status under configuration tab.

Job of sfcbd-watchdog is to monitor sfcb daemon, if in case sfcbd got stopped then sfcbd-watchdog restart the sfcbd daemon.

Can you please attach sel log output and health status screen shot then i may able to help you out with it ?

Hiya,

thank you for your reply.

Here the hardware status:

The file content of the file 'sel' (same content worth 40MB)

And 'sel.raw' (same content worth 130MB)

Edit: sorry, I was logged into the wrong account Smiley Happy

0 Kudos
abaum
Hot Shot
Hot Shot

We've been having problems with these two services that when the watchdog service restarts, it also restarts hostd, which causes the hosts to show up as disconnected in VC for a few minutes. Tech Support just had us disable the sfcbd-watchdog service. I used to have this problem on my HP servers and now I am seeing it on UCS. Looks like CIM/IPMI and VM don't get along.

adam

0 Kudos
admin
Immortal
Immortal

Hi,

I saw the screen-shot and sel entires. I m suspecting there is issue with oem integration.

Have you integrated any oem with esx build? Because screen shot which you have pasted which talks about "Asset tag : To be filed by O.E.M."

Workaround is :-

In same Health status screen, there is drop down list, In that select "Sensor event log" and click on reset event log which will remove entries from sel and sel.raw.

Which build you have installed? is it fresh install or upgrade? plz let me know so I will try to reproduce this issue.

0 Kudos
mrudloff
Enthusiast
Enthusiast

The server runs the latest possible release and it is a fresh install (4.1.0, 260247).

We run several more hosts in the same cluster and this seem to be the only one with this problem.

Even when I click reset the files start to grow immediately again. As you can see attached, even the date seem to be an odd one.

0 Kudos
admin
Immortal
Immortal

I hope your other servers also has same hardware vendor and same esx build installed?.. May i know which hardware vendor you have?

Please check Bios firmware and BMC firmware version with other servers where you are not facing this issue... if it is not same then try to upgrade firmwares.

0 Kudos
mrudloff
Enthusiast
Enthusiast

All three server are indentical in any way, including every firmware / bios revision.

Motherboard is a Supermicro X8DTN+-F

0 Kudos
mrudloff
Enthusiast
Enthusiast

All three server are indentical in any way, including every firmware / bios revision.

Motherboard is a Supermicro X8DTN+-F

0 Kudos
VirtualEquality
Contributor
Contributor

Hi,

we have exactly the same problem and also three server of the same vendor but only one with this issue. Stopping the watchdog services fixes the problem but of course it is no real solution.

Have you or somebdy else found any solution for the problem?

Thanks in advance.

0 Kudos
aaiitsupport
Contributor
Contributor

am having the same problem with a Supermicro X8SIL-F-O. Anybody had luck in getting this working. Part of me thinks that it has something to do with the Asset Tag not being set properly. Does anybody know how to change the asset tag?

0 Kudos
ds236
Contributor
Contributor

Count me as yet another with this issue. The machine gets into trouble within 30 minutes of reboot. A reboot allows things to proceed as normal, then we must reboot once again.

0 Kudos
ds236
Contributor
Contributor

From vSphere Client, Configuration, Software/Advanced Settings, I found:

VMKernel.Boot.ipmiEnabled

I rebooted the box, then immediately went in and unchecked this setting, then rebooted again before the sel and sel.raw files again filled up the space.

After reboot, the /var/log/ipmi/0/ directory on that machine is empty.

This isn't a great work-around, as most of the sensor data now can't be monitored by VMWare, but it does keep VMWare from filling these two log files with the same information over, and over and over. I'd really like to see a fix from VMWare. It's easy: when you're doing dumb things, it's writing "efef" forever in one log and "System Boot:" over and over in the other. This can't be that hard to fix in software.

0 Kudos
zero1
Contributor
Contributor

I am having the exact same issue with a Supermicro X8SIL-F-O.  Workaround seems to work but I hope this gets a permanent fix soon.

0 Kudos
ds236
Contributor
Contributor

Filling in the Asset Tag value (and for that matter any other blank values in the FRU) have ZERO impact on this issue. Looks like the issue here is a bug in ESXi as it interacts with this motherboard. I'd sure like to see this fixed, and it'll affect my purchase of vShere licensing, as we are evaluating the product set now.

Disabling IPMI entirely in the VMWare configuration is presently the only way to make these platforms functional.

That VMWare fails with cryptic messages about the disk being full, rather than properly handle the issue, gives an appearance of poor software quality, and questionable software quality assurance testing to the entire product. Surely you can do better.

0 Kudos
acarrasco201110
Contributor
Contributor

We have the same problem in one of the six ESX 4.1 server. It's an IBM x3650 M3.

Red alert and in System event log:

01/01/9999 1:00:00 AM      OEM Defined:0xefefefefefefefefefefef

Any solution?

Thanks in advice.

0 Kudos
acarrasco201110
Contributor
Contributor

At IBM IMM (ILO) level, we changed the hostname and reboot IMM.  Now all the problems disappeared.

0 Kudos
bert_cl
Contributor
Contributor

problem is caused by a firmware update in the IMM (or IMM or UEFI or DSA udpate). If the ESX has not been restarted after those updates, you will get this error.

0 Kudos
benny_hauk
Enthusiast
Enthusiast

We had the exact same symptom (ESXi 4.1, build 260247; only happened on one system; would disconnect from vCenter temporarily) and VMware tech support told us to disable the CIM agent [link].  The issue went away but we can't view hardware status, drivers versions, etc any longer.  The problem occurred on an HS22v IBM bladecenter blade that was up to date firmware-wise.

Anyone heard what, if any update/patch fixed this?  Also:  VMKernel.Boot.ipmiEnabled fix vs. disabling the CIM agent fix... which workaround is best?  If they both work, does the KMKernel.Boot.ipmiEnabled fix allow some functionality from the Hardware Status tab or does it disable everything as well?

Benny Hauk Systems Admin, VCP3/VCP4 LifeWay Chrstian Resources
0 Kudos
ds236
Contributor
Contributor

We moved to ESXi 5.0, and the problem no longer exists. Someone at VMWare knew about it and fixed it. Lobby for it to be patched in 4.1?

0 Kudos